Scraping many pages using scrapy

Prakhar Mohan Srivastava

I am trying to scrape multiple webpages using scrapy. The link of the pages are like:

http://www.example.com/id=some-number

In the next page the number at the end is reduced by 1.

So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:

import scrapy
import requests
from scrapy.http import Request

URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]

    def start_request(self):
        for i in range (starting_number, number_of_pages, -1):
            yield Request(url = URL % i, callback = self.parse)

    def parse(self, response):
        **parsing data from the webpage**

This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse() function.

But then the example given here works okay. Where am I going wrong?

paul trmbrth

The first page requested is "http://www.example.com/id=1000" (starting_number)

It's response goes through parse() and with for i in range (0, 500): you are requesting http://www.example.com/id=999, http://www.example.com/id=998, http://www.example.com/id=997...http://www.example.com/id=500

self.page_number is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500 after the first parse().

So when Scrapy calls parse for the response of http://www.example.com/id=999, you're generating requests for http://www.example.com/id=499, http://www.example.com/id=498, http://www.example.com/id=497...http://www.example.com/id=0

You guess what happens the 3rd time: http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500

For each response, you're generating 500 requests.

You can stop the loop by testing self.page_number >= 0


Edit after OP question in comments:

No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests() method (instead of requesting 1 page, and then returning Request istances in the parse method). Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.

See start_requests documentation.

Something like this would work:

class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]
    def __init__(self):
        self.page_number = starting_number

    def start_requests(self):
        # generate page IDs from 1000 down to 501
        for i in range (self.page_number, number_of_pages, -1):
            yield Request(url = URL % i, callback=self.parse)

    def parse(self, response):
        **parsing data from the webpage**

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Web scraping using R - Table of many pages

From Dev

scraping multiple pages with scrapy

From Dev

Scrapy -- Scraping a page and scraping next pages

From Dev

Scraping items using scrapy

From Dev

crawl pages using scrapy

From Dev

Scraping iTunes Charts using Scrapy

From Dev

Scraping dynamic DataTable of many pages but same URL

From Dev

Scrapy - Scraping data from multiple pages when href = #

From Dev

Scraping dynamic content using python-Scrapy

From Dev

Export scraping data in multiple formats using scrapy

From Dev

Scraping data off flipkart using scrapy

From Dev

Scraping text without javascript code using scrapy

From Dev

Scraping DATA from Javascript using SCRAPY and PYTHON

From Dev

Confusion using Xpath when scraping websites with Scrapy

From Dev

Scraping title from alibaba using scrapy

From Dev

Scraping with scrapy

From Dev

Web scraping using python for multiple pages

From Dev

scraping with R using rvest and purrr, multiple pages

From Dev

Using python scrapy crawl Web pages by multithreading

From Dev

How many URLs can be handled by using Scrapy?

From Dev

Need help scraping items from a list with Scrapy using ancestor

From Dev

Scraping all text using Scrapy without knowing webpages' structure

From Dev

Does using scrapy-splash significantly affect scraping speed?

From Dev

Need help scraping items from a list with Scrapy using ancestor

From Dev

Scraping javascript with 'data-reactid' content using Scrapy and Splash

From Dev

Error response ''NoneType' object is not iterable' while scraping website using scrapy

From Dev

Scraping all the links from a website using scrapy not working

From Dev

Is there any way to set JSESSIONID while doing scraping using scrapy

From Dev

Using purrr:map to loop through web pages for scraping with Rselenium

Related Related

  1. 1

    Web scraping using R - Table of many pages

  2. 2

    scraping multiple pages with scrapy

  3. 3

    Scrapy -- Scraping a page and scraping next pages

  4. 4

    Scraping items using scrapy

  5. 5

    crawl pages using scrapy

  6. 6

    Scraping iTunes Charts using Scrapy

  7. 7

    Scraping dynamic DataTable of many pages but same URL

  8. 8

    Scrapy - Scraping data from multiple pages when href = #

  9. 9

    Scraping dynamic content using python-Scrapy

  10. 10

    Export scraping data in multiple formats using scrapy

  11. 11

    Scraping data off flipkart using scrapy

  12. 12

    Scraping text without javascript code using scrapy

  13. 13

    Scraping DATA from Javascript using SCRAPY and PYTHON

  14. 14

    Confusion using Xpath when scraping websites with Scrapy

  15. 15

    Scraping title from alibaba using scrapy

  16. 16

    Scraping with scrapy

  17. 17

    Web scraping using python for multiple pages

  18. 18

    scraping with R using rvest and purrr, multiple pages

  19. 19

    Using python scrapy crawl Web pages by multithreading

  20. 20

    How many URLs can be handled by using Scrapy?

  21. 21

    Need help scraping items from a list with Scrapy using ancestor

  22. 22

    Scraping all text using Scrapy without knowing webpages' structure

  23. 23

    Does using scrapy-splash significantly affect scraping speed?

  24. 24

    Need help scraping items from a list with Scrapy using ancestor

  25. 25

    Scraping javascript with 'data-reactid' content using Scrapy and Splash

  26. 26

    Error response ''NoneType' object is not iterable' while scraping website using scrapy

  27. 27

    Scraping all the links from a website using scrapy not working

  28. 28

    Is there any way to set JSESSIONID while doing scraping using scrapy

  29. 29

    Using purrr:map to loop through web pages for scraping with Rselenium

HotTag

Archive