从表中提取数据的美丽汤

debugcn 发表于 Dev

GNMO11

我正在尝试从Four Factors本网站https://www.basketball-reference.com/boxscores/201101100CHA.html上的表格中提取数据。我很难到达餐桌。我试过了

url = https://www.basketball-reference.com/boxscores/201101100CHA.html
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div',id='all_four_factors')

然后，当我尝试使用tr = div.find_all('tr')拉行时，我一无所获。

比尔·M。

我查看了您尝试抓取的 HTML 代码，问题是您尝试获取的标签都在评论部分中，. BeautifulSoup 将其中的注释视为一堆文本，而不是实际的 HTML 代码。所以你需要做的是获取评论的内容，然后将此字符串重新粘贴到 BeautifulSoup 中：

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.basketball-reference.com/boxscores/201101100CHA.html'
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div', id='all_four_factors')

# Get everything in here that's a comment
comments = div.find_all(text=lambda text:isinstance(text, Comment))

# Loop through each comment until you find the one that
# has the stuff you want.
for c in comments:

    # A perhaps crude but effective way of stopping at a comment
    # with HTML inside: see if the first character inside is '<'.
    if c.strip()[0] == '<':
        newsoup = BeautifulSoup(c.strip(), 'html.parser')
        tr = newsoup.find_all('tr')
        print(tr)

对此的一个警告是 BS 将假设注释掉的代码是有效的、格式良好的 HTML。不过这对我有用，所以如果页面保持相对相同，它应该继续工作。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。