这是html代码:
<div class="wp-block-atomic-blocks-ab-accordion ab-block-accordion ab-font-size-18"><details><summary class="ab-accordion-title"><strong>American Samoa</strong></summary><div class="ab-accordion-text">
<ul><li><strong><a href="https://www.americansamoa.gov/covid-19-advisories" target="_blank" rel="noreferrer noopener" aria-label="American Samoa Department of Health Travel Advisory (opens in a new tab)">American Samoa Department of Health Travel Advisory</a></strong></li><li>March 2, 2020—Governor Moliga <a rel="noreferrer noopener" href="https://www.rnz.co.nz/international/pacific-news/410783/american-samoa-establishes-govt-taskforce-to-plan-for-coronavirus" target="_blank">appointed</a> a government taskforce to provide a plan for preparation and response to the covid-19 coronavirus. </li></ul>
<ul><li>March 25, 2020 – The Governor <a href="https://6fe16cc8-c42f-411f-9950-4abb1763c703.filesusr.com/ugd/4bfff9_2d3c78a841824b8aafe05032f853585b.pdf">issued</a> an Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health. The order requires the immediate and comprehensive enforcement by the Commissioner of Public Safety, Director of Health, Attorney General, and other agency leaders.
<ul>
<li>Business are also required to provide necessary supplies to the public and are prohibited from price gouging.</li>
</ul>
</li></ul>
</div></details></div>
我想提取状态,日期和文本,并使用这三列添加到数据框
州:美属萨摩亚
日期:2020-03-25
文本:州长001号行政命令,承认已宣布的公共卫生紧急状态和紧急状态,以及对公共卫生的迫在眉睫的威胁
到目前为止,我的代码:
soup = bs4.BeautifulSoup(data)
for tag in soup.find_all("summary"):
print("{0}: {1}".format(tag.name, tag.text))
for tag1 in soup.find_all("li"):
#print(type(tag1))
ln = tag1.text
dt = (ln.split(' – ')[0])
dt = (dt.split('—')[0])
#txt = ln.split(' – ')[1]
print(dt)
需要帮忙:
感谢你的帮助!
首先,我添加了以下代码。不幸的是,该网页不是统一的,因为它使用HTML列表,某些ul
元素包含嵌套,而ul
其他元素则没有。这段代码不是完美的,而是一个起点,例如American Samoa
,嵌套ul
元素绝对是一团糟,因此仅在中出现一次df
。
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'lxml')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
ul = detail.find('ul')
for li in ul.find_all('li', recursive=False):
# Three types of hyphen are used on this webpage
split = re.split('(?:-|–|—)', li.text, 1)
if len(split) == 2:
rows_list.append([state.text, split[0], split[1]])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1):
print(df)
它创建并打印一个具有547行的数据框,并为无法拆分的文本打印一些错误消息。您将必须准确确定所需的数据以及如何调整代码以适合您的目的。
如果未安装“ lxml”,则可以使用“ html.parser”。
更新的另一种方法是使用正则表达式匹配以日期开头的任何字符串:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'html.parser')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
for li in detail.find_all('li'):
p = re.compile(r'(\s*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s*(\d{1,2}),*\s*(\d{4}))', re.IGNORECASE)
m = re.match(p, li.text)
if m:
rows_list.append([state.text, m.group(0), m.string.replace(m.group(0), '')])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
df.to_csv('out.csv')
这给出了更多记录4,785。同样,这是一个起点,一些数据被遗漏了,但是少得多。它将数据写入csv文件out.csv。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句