BeautifulSoup从多个表中提取数据

user2447387 发表于 Dev

用户名

我正在尝试使用BeautifulSoup从html文件中的两个html表中提取一些数据。

这实际上是我第一次使用它，并且搜索了很多问题/示例，但似乎没有一个适合我的情况。html包含两个表，第一个表包含第一列的标题（始终为文本），第二个表包含以下各列的数据。此外，该表还包含文本，数字以及符号。对于像我这样的新手来说，一切都变得更加复杂。这是从浏览器复制的html的布局，我能够提取行的整个html内容，但仅针对第一个表，因此实际上我没有得到任何数据，而仅得到第一列的内容。

我尝试获取的输出是一个包含表的“联合”信息的字符串（Col1 =文本，Col2 =数字，Col3 =数字，Col4 =数字，Col5 =数字），例如：

Canada, 6, 5, 2, 1

这是每个项目的Xpath列表：

"Canada": /html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a
"6": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1] 
"5": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3] 
"2": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]
"1": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]

只要每行只有一个字符串，我也会对“粗糙” html格式的字符串感到满意，这样我就可以使用我已经知道的方法进一步解析它。这是我到目前为止的代码。谢谢！

from BeautifulSoup import BeautifulSoup
html=""" 
my html code
"""
soup = BeautifulSoup(html)
table=soup.find("table")
for row in table.findAll('tr'):
    col = row.findAll('td')
    print row, col

空置

使用bs4，但这应该可行：

from bs4 import BeautifulSoup as bsoup

ofile = open("htmlsample.html")
soup = bsoup(ofile)
soup.prettify()

tables = soup.find_all("tbody")

storeTable = tables[0].find_all("tr")
storeValueRows = tables[2].find_all("tr")

storeRank = []
for row in storeTable:
    storeRank.append(row.get_text().strip())

storeMatrix = []
for row in storeValueRows:
    storeMatrixRow = []
    for cell in row.find_all("td")[::2]:
        storeMatrixRow.append(cell.get_text().strip())
    storeMatrix.append(", ".join(storeMatrixRow))

for record in zip(storeRank, storeMatrix):
    print " ".join(record)

上面将打印出：

# of countries - rank 1 reached 0, 0, 1, 9
# of countries - rank 5 reached 0, 8, 49, 29
# of countries - rank 10 reached 25, 31, 49, 32
# of countries - rank 100 reached 49, 49, 49, 32
# of countries - rank 500 reached 49, 49, 49, 32
# of countries - rank 1000 reached 49, 49, 49, 32
[Finished in 0.5s]

更改storeTable到tables[1]和storeValueRows到tables[3]会打印出：

Country 
Canada 6, 5, 2, 1
Brazil 7, 5, 2, 1
Hungary 7, 6, 2, 2
Sweden 9, 5, 1, 1
Malaysia 10, 5, 2, 1
Mexico 10, 5, 2, 2
Greece 10, 6, 2, 1
Israel 10, 6, 2, 1
Bulgaria 10, 6, 2, -
Chile 10, 6, 2, -
Vietnam 10, 6, 2, -
Ireland 10, 6, 2, -
Kuwait 10, 6, 2, -
Finland 10, 7, 2, -
United Arab Emirates 10, 7, 2, -
Argentina 10, 7, 2, -
Slovakia 10, 7, 2, -
Romania 10, 8, 2, -
Belgium 10, 9, 2, 3
New Zealand 10, 13, 2, -
Portugal 10, 14, 2, -
Indonesia 10, 14, 2, -
South Africa 10, 15, 2, -
Ukraine 10, 15, 2, -
Philippines 10, 16, 2, -
United Kingdom 11, 5, 2, 1
Denmark 11, 6, 2, 2
Australia 12, 9, 2, 3
United States 13, 9, 2, 2
Austria 13, 9, 2, 3
Turkey 14, 5, 2, 1
Egypt 14, 5, 2, 1
Netherlands 14, 8, 2, 2
Spain 14, 11, 2, 4
Thailand 15, 10, 2, 3
Singapore 16, 10, 2, 2
Switzerland 16, 10, 2, 3
Taiwan 17, 12, 2, 4
Poland 17, 13, 2, 5
France 18, 8, 2, 3
Czech Republic 18, 13, 2, 6
Germany 19, 11, 2, 3
Norway 20, 14, 2, 5
India 20, 14, 2, 5
Italy 20, 15, 2, 7
Hong Kong 26, 21, 2, -
Japan 33, 16, 4, 5
Russia 33, 17, 2, 7
South Korea 46, 27, 2, 5
[Finished in 0.6s]

不是最好的代码，可以进一步改进。但是，该逻辑适用性很好。

希望这可以帮助。

编辑：

如果要使用格式South Korea, 46, 27, 2, 5而不是South Korea 46, 27, 2, 5（请,在国家/地区名称后注意），只需更改以下内容：

storeRank.append(row.get_text().strip())

对此：

storeRank.append(row.get_text().strip() + ",")

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-7

我来说两句

0条评论

登录后参与评论

上一篇：不同颜色模型的免费dicom图像的来源？

来自分类常见问题

Related 相关文章

文章

BeautifulSoup从多个表中提取数据

BeautifulSoup从多个表中提取数据

使用BeautifulSoup将数据从网站表中提取的数据转换为数值

VBA错误91-从多个工作簿/工作表中提取数据

从多个XML行中提取数据

从表中提取具有多个要求的对

从多个表中提取Sequelize Info

从单个按钮的多个片段中提取数据

如何跨多个工作表编写一个countifs函数，以从过滤的数据中提取数据？

单个数据库调用从EF Core中的多个表中提取数据

BeautifulSoup从表中的某些列中提取数据我获取了太多数据

从多个表中提取SQL数据

从Google Colab研究的多个工作表中提取数据

从BS4中的表行中提取多个数据

使用Python BeautifulSoup从具有多个相同名称表的特定页面提取数据表

使用BeautifulSoup从表中提取彩色文本

使用BeautifulSoup将数据从网站表中提取的数据转换为数值

Oracle：无法从联接表中提取数据：单行子查询返回多个行

如何从BeautifulSoup中提取表值

从多个Excel工作表中提取数据并计算特定项目

sas宏从多个文件中提取数据

从JSON数据中提取多个数组

如何使用beautifulsoup从多张表中提取数据？

SQL Server从多个表中提取数据

从多个Excel工作表中提取数据

使用 BeautifulSoup 从多个 XML 列表中提取文本数据

BeautifulSoup 表数据提取 - 数据未显示

Python - BeautifulSoup 从多个选项中提取值

如何从python中的许多word文档的多个表中提取所有数据（直接从MS Word提取数据）？

使用xpath使用Scrapy从多个表中提取数据

如何从数据表中提取数据