因此,我正在使用以下代码从某个站点抓取雕像。
from bs4 import BeautifulSoup
import requests
f = open('C:\Python27\projects\FL_final.doc','w')
base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range (1,9):
url = base_url.format(chapter=chapter)
r = requests.get(url)
soup = BeautifulSoup((r.content),"html.parser")
tableContents = soup.find('div', {'class': 'Chapters' })
for title in tableContents.find_all ('div', {'class': 'Title' }):
f.write (title.text)
for data in tableContents.find_all('div',{'class':'Section' }):
data = data.text.encode("utf-8","ignore")
data = "\n\n" + str(data)+ "\n"
f.write(data)
f.close()
问题是缺少某些章节。例如,第1章到第2章有页面,那么第3、4、5章的页面不存在。因此,当使用范围(1,9)时,由于无法获取第3、4、5章的内容而给我错误,因为它们的(0003 / 0003、0004 / 0004、0005 / 0005)url不存在。
如何跳过循环中缺少的URL,并让程序找到该范围内的下一个可用URL?
这是第1章的网址:http : //www.leg.state.fl.us/statutes/index.cfm? App_mode = Display_Statute&URL = 0000-0099/0001/ 0001.html
您可以try
为url请求添加,并tableContents is not none
在应用之前进行检查find_all
:
import requests
f = open('C:\Python27\projects\FL_final.doc','w')
base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range (1,9):
url = base_url.format(chapter=chapter)
try:
r = requests.get(url)
except requests.exceptions.RequestException as e: # This is the correct syntax
print "missing url"
print e
sys.exit(1)
soup = BeautifulSoup((r.content),"html.parser")
tableContents = soup.find('div', {'class': 'Chapters' })
if tableContents is not None:
for title in tableContents.find_all ('div', {'class': 'Title' }):
f.write (title.text)
for data in tableContents.find_all('div',{'class':'Section' }):
data = data.text.encode("utf-8","ignore")
data = "\n\n" + str(data)+ "\n"
print data
f.write(data)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句