无法使用BeautifulSoup Webscrape HTML表并使用Python将其加载到Pandas数据框中

debugcn 发表于 Dev

托尼·彭德尔顿

我的目标是访问以下网页https://www.countries-ofthe-world.com/world-currencies.html上的表格，并将其转换为包含“国家或地区”，“货币”列，和“ ISO-4217”。

我能够正确访问列，但是我很难确定如何将每一行追加到数据框。大家对我该如何做有什么建议？例如，在网页上，表中的第一行是字母“ A”。不过，我需要在第一行数据帧是Afghanistan，Afghan afghani和AFN。

这是我到目前为止的内容：

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage=urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
table = soup.find("table", {"class":"codes"})
rows = table.find_all('tr')
columns = [v.text for v in rows[0].find_all('th')] 
print(columns) # ['Country or territory', 'Currency', 'ISO-4217']

也请参阅此图片。

谢谢大家的时间。

托尼

兰迪

完成修复后，可以很容易地通过pd.read_html以下方法来解析它：

url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage = urlopen(req).read()

df = pd.read_html(webpage)[0]
print(df.head())

         Country or territory        Currency ISO-4217
0                           A               A        A
1                 Afghanistan  Afghan afghani      AFN
2  Akrotiri and Dhekelia (UK)   European euro      EUR
3     Aland Islands (Finland)   European euro      EUR
4                     Albania    Albanian lek      ALL

它具有那些字母标题，但是您可以使用类似 df = df[df['Currency'] != df['ISO-4217']]

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。