我正在尝试从BBC美食网站上提取信息,但是在缩小我正在收集的数据方面遇到了一些麻烦。
这是我到目前为止的内容:
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("a")
for anchor in links:
print(anchor.get('href')), anchor.text
这将返回所涉及页面的所有链接以及该链接的文本描述,但是我想从页面上的“文章”类型对象中提取链接。这些是特定配方的链接。
通过一些实验,我设法从文章中返回了文本,但是我似乎无法提取链接。
我看到的与文章标签相关的仅有两件事是href和img.src:
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("article")
for ele in links:
print(ele.a["href"])
print(ele.img["src"])
链接在 "class=node-title"
from bs4 import BeautifulSoup
import requests
webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find("div",{"class":"main row grid-padding"}).find_all("h2",{"class":"node-title"})
for l in links:
print(l.a["href"])
/recipes/681646/tomato-tart
/recipes/4468/stuffed-tomatoes
/recipes/1641/charred-tomatoes
/recipes/tomato-confit
/recipes/1575635/roast-tomatoes
/recipes/2536638/tomato-passata
/recipes/2518/cherry-tomatoes
/recipes/681653/stuffed-tomatoes
/recipes/2852676/tomato-sauce
/recipes/2075/tomato-soup
/recipes/339605/tomato-sauce
/recipes/2130/essence-of-tomatoes-
/recipes/2942/tomato-tarts
/recipes/741638/fried-green-tomatoes-with-ripe-tomato-salsa
/recipes/3509/honey-and-thyme-tomatoes
要访问,您需要先添加http://www.bbcgoodfood.com
:
for l in links:
print(requests.get("http://www.bbcgoodfood.com{}".format(l.a["href"])).status
200
200
200
200
200
200
200
200
200
200
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句