Returning Items in scrapy's start_requests()

pintoch Published at Dev

pintoch

I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests() method.

Some URLs can be classified without downloading them, so I would like to yield directly an Item for them in start_requests(), which is forbidden by scrapy. How can I circumvent this?

I have thought about catching these requests in a custom middleware that would turn them into spurious Response objects, that I could then convert into Item objects in the request callback, but any cleaner solution would be welcome.

Ruehri

I think using a spider middleware and overwriting the start_requests() would be a good start.

In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.

For your special URLs which do not require a request, you can
- directly call your pipeline's process_item(), do not forget to import your pipeline and create a scrapy.item from your url for this
- as you mentioned, pass the url as meta in a Request, and have a separate parse function which would only return the url
For all remaining URLs, your can launch a "normal" Request as you probably already have defined

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-25

Comments

0 comments

From Dev

Related Related

Article

Returning Items in scrapy's start_requests()

Returning Items in scrapy's start_requests()

Scrapy not calling parse function with start_requests

Scrapy not returning all the items it should

scrapy: understanding how do items and requests work between callbacks

Scrapy returning zero results

Parsing adjacent items in Scrapy

Numbering Items in Scrapy

Numbering Items in Scrapy

Processing items with Scrapy pipeline

Scraping items using scrapy

Nested JSON items with scrapy

Returning items in a list

Scrapy returning 403 error (Forbidden)

Scrapy returning scraped values into an array

Scrapy: Create Project returning error

Scrapy Not Returning After Yielding a Request

scrapy spider not returning any results

Scrapy crawler not returning expected html

Why is Scrapy returning duplicate results?

Scrapy returning 403 error (Forbidden)

Scrapy returning empty list for xpath

Scrapy Spider not returning all elements

start_requests from mysql with additional parameter

Scrapy how to convert item to JSON string for exporting items into S3

Yield multiple items using scrapy

scrapy itemloaders return list of items

Items vs item loaders in scrapy

Items vs item loaders in scrapy

Scrapy safe way to extract items

Yield multiple items using scrapy