我正在尝试从HTML URL的许多不同表中获取信息,而没有任何HTML缩进/制表符格式。我使用get_text生成所需的内容,但是它打印出很多空白和制表符。我已经尝试过.strip,但是并没有实现我想要的功能。
这是我正在使用的python脚本:
import csv, simplejson, urllib,
url="http://www.thecomedystudio.com/schedule.html"
response=urllib.urlopen(url)
from bs4 import BeautifulSoup
html = response
soup = BeautifulSoup(html.read())
text = soup.get_text()
print text
最后,我想创建事件日历的csv,但首先我想创建一个.txt或不需要太多手动清理的内容。
任何帮助表示赞赏。
您无需“清理” HTML即可解析BeautifulSoup
。
只需将日期和事件直接解析为csv文件即可:
import csv
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "http://www.thecomedystudio.com/schedule.html"
soup = BeautifulSoup(urlopen(url))
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
for item in soup.select('td div[align=center] > b'):
date = ' '.join(el.strip() for el in item.find_all(text=True))
event = item.parent.parent.find_next_sibling('td').get_text(strip=True)
writer.writerow([date, event])
output.csv
运行脚本后的内容如下:
Fri 2.27.15,"Rick Canavan hosts with Christine An, Rachel Bloom, Dan Crohn, Wes Hazard, James Huessy, Kelly MacFarland, Peter Martin, Ted Pettingell."
Sat 2.28.15,"Rick Jenkins hosts Taylor Connelly, Lilian DeVane, Andrew Durso, Nate Johnson, Peter Martin, Andrew Mayer, Kofi Thomas, Tim Willis."
Sun 3.1.15,"Peter Martin hosts Sunday Funnies with Nonye Brown-West, Ryan Donahue, Joe Kozlowski, Casey Malone, Etrane Martinez, Kwasi Mensah, Anthony Zonfrelli, Christa Weiss and Sam Jay closing."
Tue 3.3.15,Mystery Lounge! The old-est and only-est magic show in New England! with guest comedian Ryan Donahue.
...
Thu 12.31.15,"New Year's Eve! with Rick Jenkins, Nathan Burke."
Fri 1.1.16,Rick Canavan hosts New Year's Day.
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句