我快写完第一本刮板了!
但是,我遇到了一个麻烦:我似乎无法抓住包含表的帖子(换言之,引用另一篇帖子的帖子)的内容。
这是从汤对象中提取帖子内容的代码。它工作得很好:
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
...#Error management
return (post_contents)
这是我需要抓取的示例(以黄色突出显示):
(URL,以防万一:http : //forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906)
如何获取突出显示的内容?为什么我当前的getPostcontents函数在此特定实例中不起作用?据我所知,字符串仍然在div class = post_contents下。
编辑编辑
这就是我获取BeautifulSoup的方式:
from bs4 import BeautifulSoup as Soup
def getHTMLsoup(url):
try:
html = urlopen(url)
...#Error management
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
...#Error management
return (soup0bj)
编辑2编辑2编辑2
这些是刮板的相关部分:(对转储表示抱歉!)
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError, URLError
import time, re
def getHTMLsoup(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
print('The server hosting{} is unavailable.'.format(url), '\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
except URLError as e:
return None
print('The webpage found at {} is unavailable.'.format(url),'\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
except AttributeError as e:
return None
print("Ooops, {}'s HTML structure wasn't detected.".format(url),'\n')
return soup0bj
def getMessagetable(soup0bj):
try:
soup0bj = (soup0bj)
messagetable = []
for data in soup0bj.findAll('tr', {'class' : re.compile('message.*')}, recursive = 'True'):
except AttributeError as e:
print(' ')
return (messagetable)
def getTime_stamps(soup0bj):
try:
soup0bj = (soup0bj)
time_stamps = []
for stamp in soup0bj.findAll('span', {'class' : 'topic_posted'}):
time_stamps.append(re.search('..\/..\/20..', stamp.text).group(0))
except AttributeError as e:
print('No time-stamps found. Moving on.','\n')
return (time_stamps)
def getHandles(soup0bj):
try:
soup0bj = (soup0bj)
handles = []
for handle in soup0bj.findAll('span', {'data-id_user' : re.compile('.*')}, limit = 1):
handles.append(handle.text)
except AttributeError as e:
print("")
return (handles)
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('div', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
except AttributeError as e:
print('Ooops, something has gone wrong!')
return (post_contents)
html = ('http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm')
for soup in getHTMLsoup(html):
for messagetable in getMessagetable(soup):
print(getTime_stamps(messagetable),'\n')
print(getHandles(messagetable),'\n')
print(getPost_contents(messagetable),'\n')
问题是您的解码,不是utf-8,如果删除"replace"
您的代码,则会出现以下错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 253835: invalid continuation byte
数据似乎是latin-1编码的,对latin-1的解码不会引起任何错误,但是使用时,输出确实在某些部分看起来很好。
html = urlopen(r).read().decode("latin-1")
可以工作,但是正如我提到的,您将得到如下奇怪的输出:
"diabète en cas d'accident de la route ou malaise isolÊ ou autre ???"
另一种选择是传递一个accept-charset标头:
from urllib.request import Request, urlopen
headers = {"accept-charset":"utf-8"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
html = urlopen(r).read()
使用请求让我处理编码,我得到了完全相同的编码问题,就像数据混合编码,一些utf-8和一些latin-1。从请求返回的标头将内容编码显示为gzip,如下所示:
'Content-Encoding': 'gzip'
如果我们指定我们想要gzip并解码:
from urllib.request import Request, urlopen
headers = {"Accept-Encoding":"gzip"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
r = urlopen(r)
import gzip
gzipFile = gzip.GzipFile(fileobj=r)
print(gzipFile.read().decode("latin-1"))
对于utf-8,我们得到相同的错误,对latin-1则得到相同的怪异输出解码。有趣的是,在python2中,请求和urllib都可以正常工作。
使用chardet:
r = urlopen(r)
import chardet
print(chardet.detect(r.read()))
大约有71%的信心认为它是,ISO-8859-2
但同样会给出相同的不良输出。
{'confidence': 0.711104254322944, 'encoding': 'ISO-8859-2'}
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句