我正在尝试学习如何在 python 中从网页中抓取数据,并且在如何在 python 中构建嵌套循环时遇到了麻烦。我在如何解决这个问题方面得到了一些帮助(如何从“a”标签中提取链接)。我试图让该代码基本上遍历不同周(最终是几年)的网页。我目前拥有的内容如下,但它并没有在我想要的两周内迭代并保存它。
import requests, re, json
from bs4 import BeautifulSoup
weeks=['1','2']
data = pd.DataFrame(columns=['Teams','Link'])
scripts_head = soup.find('head').find_all('script')
all_links = {}
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2018/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data.to_csv(r'game_id_data.csv')
编辑:所以要添加一些说明,它从一周开始创建数据的重复项,并反复将其附加到末尾。我还编辑了代码以包含适当的库,它应该能够复制和粘贴并在 python 中运行。
所以我想出的解决方法如下,我仍然在我的最终数据集中得到重复的游戏 ID,但至少我正在遍历整个所需的集合并获取所有这些 ID。然后最后我去重复了。
import requests, re, json
from bs4 import BeautifulSoup
import csv
import pandas as pd
years=['2015','2016','2017','2018']
weeks=['1','2','3','4','5','6','7','8','9','10','11','12','13','14']
data = pd.DataFrame(columns=['Teams','Link'])
all_links = {}
for year in years:
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/'+ year + '/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data_test=data.drop_duplicates(keep='first')
data_test.to_csv(r'all_years_deduped.csv')
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句