Get scrapy spider to crawl entire site

Lewis Smith

I am using scrapy to crawl old sites that I own, I am using the code below as my spider. I don't mind having files outputted for each webpage, or a database with all the content within that. But I do need to be able to have the spider crawl the whole thing with out me having to put in every single url that I am currently having to do

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["www.example.com"]
    start_urls = [
        "http://www.example.com/contactus"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
Daniil Mashkin

To crawl whole site you should use the CrawlSpider instead of the scrapy.Spider

Here's an example

For your purposes try using something like this:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

Also, take a look at this article

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Scrapy - flow of the Crawl spider

From Dev

Scrapy - flow of the Crawl spider

From Dev

scrapy crawl spider ajax pagination

From Dev

Scrapy crawl spider stopped working

From Dev

Simple scrapy spider doesn't crawl

From Dev

scrapy crawl information another site

From Dev

Scrapy Crawl Spider Only Scrape Certain Number Of Layers

From Dev

Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

From Dev

Getting the functionality of the scrapy crawl command when running a spider from a script

From Dev

Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

From Dev

Getting the functionality of the scrapy crawl command when running a spider from a script

From Dev

crawl pictures from web site with Scrapy

From Dev

Scrapy don't crawl over site

From Dev

Scrapy: How to crawl link image before enter the site and crawl the content?

From Dev

How to pass multiple arguments to Scrapy spider (getting error running 'scrapy crawl' with more than one spider is no longer supported)?

From Dev

How to get the pipeline object in Scrapy spider

From Dev

How to get the pipeline object in Scrapy spider

From Dev

How to get back to the previous spider in Scrapy

From Dev

Get proxy ip address scrapy using to crawl

From Dev

How to crawl links on all pages of a web site with Scrapy

From Dev

Scrapy - Get spider variables inside DOWNLOAD MIDDLEWARE __init__

From Dev

Can't get Scrapy spider_opened to be called

From Dev

Crawl spider not crawling ~ Rule Issue

From Dev

scrapy get the entire text including children

From Dev

Python scrapy spider

From Dev

scrapy spider pass parameters

From Dev

Scrapy Spider not scraping correctly

From Dev

Scrapy Spider not Following Links

From Dev

Scrapy: Spider optimization

Related Related

  1. 1

    Scrapy - flow of the Crawl spider

  2. 2

    Scrapy - flow of the Crawl spider

  3. 3

    scrapy crawl spider ajax pagination

  4. 4

    Scrapy crawl spider stopped working

  5. 5

    Simple scrapy spider doesn't crawl

  6. 6

    scrapy crawl information another site

  7. 7

    Scrapy Crawl Spider Only Scrape Certain Number Of Layers

  8. 8

    Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

  9. 9

    Getting the functionality of the scrapy crawl command when running a spider from a script

  10. 10

    Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

  11. 11

    Getting the functionality of the scrapy crawl command when running a spider from a script

  12. 12

    crawl pictures from web site with Scrapy

  13. 13

    Scrapy don't crawl over site

  14. 14

    Scrapy: How to crawl link image before enter the site and crawl the content?

  15. 15

    How to pass multiple arguments to Scrapy spider (getting error running 'scrapy crawl' with more than one spider is no longer supported)?

  16. 16

    How to get the pipeline object in Scrapy spider

  17. 17

    How to get the pipeline object in Scrapy spider

  18. 18

    How to get back to the previous spider in Scrapy

  19. 19

    Get proxy ip address scrapy using to crawl

  20. 20

    How to crawl links on all pages of a web site with Scrapy

  21. 21

    Scrapy - Get spider variables inside DOWNLOAD MIDDLEWARE __init__

  22. 22

    Can't get Scrapy spider_opened to be called

  23. 23

    Crawl spider not crawling ~ Rule Issue

  24. 24

    scrapy get the entire text including children

  25. 25

    Python scrapy spider

  26. 26

    scrapy spider pass parameters

  27. 27

    Scrapy Spider not scraping correctly

  28. 28

    Scrapy Spider not Following Links

  29. 29

    Scrapy: Spider optimization

HotTag

Archive