如何使用网络搜寻器从URL获取正确的源代码与Python?

威尔海姆

我正在尝试使用python编写网络搜寻器。我正在使用rerequest模块。我想从第一页(这是一个论坛)获取URL,并从每个URL获取信息。

我现在的问题是,我已经将URL存储在列表中。但是我无法进一步获得这些URL的正确源代码。

这是我的代码:

import re
import requests

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'

sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
    url = 'http://bbs.skykiwi.com/' + eachLink.encode('utf-8')
    html = getsourse(url) #THIS IS WHERE I CAN'T GET THE RIGHT SOURCE CODE


#To get the source code of current url
def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.text

#To get all the links in current page
def getallLinksinPage(sourceCode):
    bigClasses = re.findall('<th class="new">(.*?)</th>', sourceCode, re.S)
    allLinks = []
    for each in bigClasses:
        everylink = re.findall('</em><a href="(.*?)" onclick', each, re.S)[0]
        allLinks.append(everylink)
return allLinks
帕德拉克·坎宁安(Padraic Cunningham)

您在使用函数后定义函数,这样您的代码将出错。您也不应该使用re来解析html,而是使用如下的beautifulsoup这样的解析器还可以使用urlparse.urljoin将基本URL链接到链接,您真正想要的是iddiv标签中hrefsthreadlist

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'



def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.content

#To get all the links in current page
def getallLinksinPage(sourceCode):
    soup = BeautifulSoup(sourceCode)
    return [a["href"] for a in soup.select("#threadlist a.xst")]



sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
    url = 'http://bbs.skykiwi.com/'
    html = getsourse(urljoin(url, eachLink))
    print(html)

如果urljoin(url, eachLink)在循环中打印,您会看到获得该表的所有正确链接并返回了正确的源代码,以下是返回的链接的片段:

http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3177846&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3197510&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3201399&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3170748&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3152747&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3168498&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3176639&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203657&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3190138&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3140191&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199154&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3156814&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203435&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3089967&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199384&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3173489&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3204107&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231

如果您在浏览器中访问上面的链接,您将看到它获得正确的页面,http://bbs.skykiwi.com/forum.php?mod=viewthread&amp;tid=3187289&amp;extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231根据结果​​,您将看到:

Sorry, specified thread does not exist or has been deleted or is being reviewed
[New Zealand day-dimensional network Community Home]

您可以清楚地看到url的区别。如果您希望自己的工作,则需要在正则表达式中进行替换:

 everylink = re.findall('</em><a href="(.*?)" onclick', each.replace("&","%26"), re.S)[0]

但是真的不解析html会是一个正则表达式。

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

如何获取我的代码以停止在我的网络搜寻器中打印关键字

来自分类Dev

如何获取我的代码以停止在我的网络搜寻器中打印关键字

来自分类Dev

如何保持网络搜寻器运行?

来自分类Dev

如何增加Apache Nutch搜寻器获取的文档数量

来自分类Dev

如何通过搜寻器获取简单信息

来自分类Dev

如何通过搜寻器获取简单信息

来自分类Dev

Python搜寻器通过使用Selenium和PhantomJS获取DOM信息

来自分类Dev

我使用生成了html文件的python编写了脚本。如何使用网络搜寻器检索的数据动态更改其内容

来自分类Dev

网络搜寻器无法正常工作

来自分类Dev

如何自动运行搜寻器?

来自分类Dev

Python搜寻器验证图片

来自分类Dev

Python中的多线程搜寻器

来自分类Dev

Python Web搜寻器,深度问题

来自分类Dev

Python搜寻器:下载HTML页面

来自分类Dev

python中的多线程搜寻器

来自分类Dev

Google是否使用BLEXBot搜寻器?

来自分类Dev

想使用Cookie搜寻器吗?

来自分类Dev

一个自包含的python文件中的网络搜寻器

来自分类Dev

识别搜寻器

来自分类Dev

识别搜寻器

来自分类Dev

多线程网络搜寻器线程限制

来自分类Dev

Python搜寻器| 从应用程序/ ld + json访问的“ URL”参数

来自分类Dev

如何将我的Python搜寻器输出保存到JSON文件?

来自分类Dev

如何在python搜寻器中保存存储(通用字符串)

来自分类Dev

如何使用Symfony Dom搜寻器将HTML表解析为数组

来自分类Dev

如何使用boto3更改由AWS Glue搜寻器创建的表的名称

来自分类Dev

Python-使用BeautifulSoup4的Reddit Web搜寻器不返回任何内容

来自分类Dev

Twitter的URL搜寻器是否执行JavaScript?

来自分类Dev

PHP Web搜寻器,检查URL的路径

Related 相关文章

热门标签

归档