我正在尝试使用python编写网络搜寻器。我正在使用re和request模块。我想从第一页(这是一个论坛)获取URL,并从每个URL获取信息。
我现在的问题是,我已经将URL存储在列表中。但是我无法进一步获得这些URL的正确源代码。
这是我的代码:
import re
import requests
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/' + eachLink.encode('utf-8')
html = getsourse(url) #THIS IS WHERE I CAN'T GET THE RIGHT SOURCE CODE
#To get the source code of current url
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.text
#To get all the links in current page
def getallLinksinPage(sourceCode):
bigClasses = re.findall('<th class="new">(.*?)</th>', sourceCode, re.S)
allLinks = []
for each in bigClasses:
everylink = re.findall('</em><a href="(.*?)" onclick', each, re.S)[0]
allLinks.append(everylink)
return allLinks
您在使用函数后定义函数,这样您的代码将出错。您也不应该使用re来解析html,而是使用如下的beautifulsoup这样的解析器。还可以使用urlparse.urljoin将基本URL链接到链接,您真正想要的是id为div的锚标签中的hrefs:threadlist
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.content
#To get all the links in current page
def getallLinksinPage(sourceCode):
soup = BeautifulSoup(sourceCode)
return [a["href"] for a in soup.select("#threadlist a.xst")]
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/'
html = getsourse(urljoin(url, eachLink))
print(html)
如果urljoin(url, eachLink)
在循环中打印,您会看到获得该表的所有正确链接并返回了正确的源代码,以下是返回的链接的片段:
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3177846&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3197510&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3201399&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3170748&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3152747&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3168498&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3176639&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203657&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3190138&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3140191&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199154&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3156814&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203435&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3089967&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199384&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3173489&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3204107&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
如果您在浏览器中访问上面的链接,您将看到它获得正确的页面,http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3187289&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
根据结果,您将看到:
Sorry, specified thread does not exist or has been deleted or is being reviewed
[New Zealand day-dimensional network Community Home]
您可以清楚地看到url的区别。如果您希望自己的工作,则需要在正则表达式中进行替换:
everylink = re.findall('</em><a href="(.*?)" onclick', each.replace("&","%26"), re.S)[0]
但是真的不解析html会是一个正则表达式。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句