Objective:
I am trying to scrape a 100s of web pages, specifically the ingredients for the recipe on each. If we take an example - which contains the recipe for an Egg Sandwich (url) for which I'm using many Python dependencies including BeautifulSoup
, splinter.Browser
, ChromeDrivermanager
.
Expected output:
Once I have scraped the ingredients, I would like to save them in a dictionary. Example below -
recipes = {"quick_and_easy_egg_salad_sandwich_recipe":
['1-2 tablespoons mayonnaise (to taste)',
'2 tablespoons chopped celery',
'2 slices white, wheat, multigrain, or rye bread, toasted or plain']
What I've achieved:
1. I have been able to determine 'roughly' (through Web Inspector) what I need to focus on -
It looks like each ingredient has it's own <li class='ingredient'>
however it looks like I have either misinterpreted the hierarchy or my code is incorrect.
2.My code is as follows -
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)
webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredients = website_soup.find('h3', class_="Ingredients")
ingredientsList = ingredients.find('li', class_ = "ingredient")
print({ingredients})
When I attempt to print {ingredients}
I get a AttributeError: 'NoneType' object has no attribute 'find'
I know my code is flawed, however I just don't know how to approach this and wondered if anyone has any suggestions?
try this,
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")
soup = BeautifulSoup(resp.text, "html.parser")
div_ = soup.find("div", attrs={"class": "recipe-callout"})
recipes = {"_".join(div_.find("h2").text.split()):
[x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加