Web scraping - challenges articulating hierarchy in my code

debugcn 投稿 Dev

William

Objective:
I am trying to scrape a 100s of web pages, specifically the ingredients for the recipe on each. If we take an example - which contains the recipe for an Egg Sandwich (url) for which I'm using many Python dependencies including BeautifulSoup, splinter.Browser, ChromeDrivermanager.

Expected output:
Once I have scraped the ingredients, I would like to save them in a dictionary. Example below -

recipes = {"quick_and_easy_egg_salad_sandwich_recipe":
['1-2 tablespoons mayonnaise (to taste)',
 '2 tablespoons chopped celery',
 '2 slices white, wheat, multigrain, or rye bread, toasted or plain']

What I've achieved:
1. I have been able to determine 'roughly' (through Web Inspector) what I need to focus on -
It looks like each ingredient has it's own <li class='ingredient'> however it looks like I have either misinterpreted the hierarchy or my code is incorrect.

2.My code is as follows -

executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)

webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredients = website_soup.find('h3', class_="Ingredients")
ingredientsList = ingredients.find('li', class_ = "ingredient")
print({ingredients})

When I attempt to print {ingredients} I get a AttributeError: 'NoneType' object has no attribute 'find'

I know my code is flawed, however I just don't know how to approach this and wondered if anyone has any suggestions?

sushanth

try this,

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")

soup = BeautifulSoup(resp.text, "html.parser")
div_ = soup.find("div", attrs={"class": "recipe-callout"})

recipes = {"_".join(div_.find("h2").text.split()):
               [x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-13

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

Web scraping - challenges articulating hierarchy in my code

Web scraping - challenges articulating hierarchy in my code

How to fix web scraping Python code "IndexError: list index out of range" when the code hits missing values

VBA web Scraping problems

Web Scraping stocks

Web scraping with BeautifulSoup on Wikipedia

Web Scraping in Python with BeautifulSoup

Web scraping using BeautifulSoup

Improve Code - Web Scraping Job Offers - Title, Employer, Salary, Link required

Why is my code not rendering a list of table on my web page?

Web scraping multiple pages with BeautifulSoup

Authenticate using cookies for web scraping?

Web Scraping using Requests - Python

Python - Web scraping using Scrapy

Web scraping company description from StackOverflow companies

Web scraping relevant information from soup file

web scraping product and store information from Target

Web Scraping Iteratively from a WebPage in R

Trouble returning web scraping output as dictionary

Web Scraping WSJ Archive with BS4

Scraping data- attributes from web page

Web scraping Tennis24 in play stats

Python html web scraping on header and title

Web scraping: Combining tables in for-loop in R

Defensive web scraping techniques for scrapy spider

使用Web Scraping导航到表主体

In need of an explanation of Web scraping with Nokogiri in Rails

some issues with web scraping imd website

Scraping code not working in php to form controls

Lifecycling in SwifUI: Running code when leaving a child view of a NavigationView hierarchy

Selenium Webdriver / Beautifulsoup + Web Scraping +エラー416