我正在尝试从Four Factors
本网站https://www.basketball-reference.com/boxscores/201101100CHA.html上的表格中提取数据。我很难到达餐桌。我试过了
url = https://www.basketball-reference.com/boxscores/201101100CHA.html
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")
div = soup.find('div',id='all_four_factors')
然后,当我尝试使用tr = div.find_all('tr')
拉行时,我一无所获。
我查看了您尝试抓取的 HTML 代码,问题是您尝试获取的标签都在评论部分中,<!-- Like this --->
. BeautifulSoup 将其中的注释视为一堆文本,而不是实际的 HTML 代码。所以你需要做的是获取评论的内容,然后将此字符串重新粘贴到 BeautifulSoup 中:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.basketball-reference.com/boxscores/201101100CHA.html'
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")
div = soup.find('div', id='all_four_factors')
# Get everything in here that's a comment
comments = div.find_all(text=lambda text:isinstance(text, Comment))
# Loop through each comment until you find the one that
# has the stuff you want.
for c in comments:
# A perhaps crude but effective way of stopping at a comment
# with HTML inside: see if the first character inside is '<'.
if c.strip()[0] == '<':
newsoup = BeautifulSoup(c.strip(), 'html.parser')
tr = newsoup.find_all('tr')
print(tr)
对此的一个警告是 BS 将假设注释掉的代码是有效的、格式良好的 HTML。不过这对我有用,所以如果页面保持相对相同,它应该继续工作。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句