Getting the functionality of the scrapy crawl command when running a spider from a script

loremIpsum1771

I have written a crawl spider within a scrapy project that properly scrapes data from a url and the pipelines the response into a postgresql table, but only when the scrapy crawl command is used. When the spider is run from a script in the root directory of the project, it appears that only the parse method of the spider class is being called as the table is not being created upon running the script simply using the python command. I think the problem is that the crawl command has a specific protocol for looking for and calling specific modules in the directory above the spiders package (e.g. the models, pipelines, and settings modules) which aren't being called when the spider is run from a script.

I followed the directions included in the docs but they don't seem to address pipelining data after it is scraped. This raises the question of I should even be trying to run a script to run the spider or if I should just use the scrapy crawl command somehow. The problem is, I planned to run the scrapy spider from a django project when the user submits text in a form which lead me to this SO post, but the provided answer doesn't seem to be addressing the my problem. I would also need to pass the text from the form to be added to the spider url (I was previously just using raw_input to create the url). How should I properly go about running the spider? I have the code for the script and the spider below if they are needed. Any help/code provided would be appreciated, thanks.

script file

from ticket_city_scraper import *
from ticket_city_scraper.spiders import tc_spider 

tc_spider.spiderCrawl()

spider file

import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin

bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  

class MySpider3(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.ticketcity.com"]

    start_urls = [tc_url]
    tickets_list_xpath = './/div[@class = "vevent"]'
    def create_link(self, bandname):
        tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
        self.start_urls = [tc_url]
        #return tc_url      

    tickets_list_xpath = './/div[@class = "vevent"]'

    def parse_json(self, response):
        loader = response.meta['loader']
        jsonresponse = json.loads(response.body_as_unicode())
        ticket_info = jsonresponse.get('B')
        price_list = [i.get('P') for i in ticket_info]
        if len(price_list) > 0:
            str_Price = str(price_list[0])
            ticketPrice = unicode(str_Price, "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        else:
            ticketPrice = unicode("sold out", "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        return loader.load_item()

    def parse_price(self, response):
        print "parse price function entered \n"
        loader = response.meta['loader']
        event_City = response.xpath('.//span[@itemprop="addressLocality"]/text()').extract() 
        eventCity = ''.join(event_City) 
        loader.add_value('eventCity' , eventCity)
        event_State = response.xpath('.//span[@itemprop="addressRegion"]/text()').extract() 
        eventState = ''.join(event_State) 
        loader.add_value('eventState' , eventState) 
        event_Date = response.xpath('.//span[@class="event_datetime"]/text()').extract() 
        eventDate = ''.join(event_Date)  
        loader.add_value('eventDate' , eventDate)    
        ticketsLink = loader.get_output_value("ticketsLink")
        json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
        json_id=  "".join(json_id_list)
        json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
        yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True) 

    def parse(self, response):
        """
        # """
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):
            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            loader.add_xpath('eventName' , './/span[@class="summary listingEventName"]/text()')
            loader.add_xpath('eventLocation' , './/div[@class="divVenue location"]/text()')
            loader.add_xpath('ticketsLink' , './/a[@class="divEventDetails url"]/@href')
            #loader.add_xpath('eventDateTime' , '//div[@id="divEventDate"]/@title') #datetime type
            #loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
            #sel.xpath("//span[@id='PractitionerDetails1_Label4']/text()").extract()
            ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
            ticketsURL = urljoin(response.url, ticketsURL)
            yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)

def spiderCrawl():
   process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
   })
   process.crawl(MySpider3)
   process.start()
MrPandav

To Answer Your Question

  1. Scrapy Does not differentiate between crawl command and crawl command line( from Script ) execution.

only part (and difference) that you are missing is :

  1. scrapy crawl command... always and must be executed from within the project directory ..where scrapy.cfg file is located....and if you look closely , it contains where the setting file is located..and setting file is the central location where all your project specific settings are located..like..cache policy , pipelines , header setting, proxy setting ..etc so while using scrapy crawl..all this setting are internally loaded
  2. for Scrapy execution from script...you are just providing the location of the spider and where it is located and executing it without any of your custom setting from settings.py file

for this setting to come into effect..create crawlprocess object with project setting ..

settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Getting the functionality of the scrapy crawl command when running a spider from a script

From Dev

How to pass multiple arguments to Scrapy spider (getting error running 'scrapy crawl' with more than one spider is no longer supported)?

From Dev

How to pass a user-defined argument to a scrapy Spider when running it from a script

From Dev

Scrapy - flow of the Crawl spider

From Dev

Scrapy - flow of the Crawl spider

From Dev

scrapy crawl spider ajax pagination

From Dev

Scrapy crawl spider stopped working

From Dev

Calling Scrapy Spider from python script?

From Dev

Passing Argument to Scrapy Spider from Python Script

From Dev

Getting an error when running awk from a script

From Dev

Simple scrapy spider doesn't crawl

From Dev

Get scrapy spider to crawl entire site

From Dev

Getting "ensure" / "finally" functionality in a shell command (not script)?

From Dev

update scrapy spider while running

From Dev

Scrapy: How to run spider from other python script twice or more?

From Dev

Scrapy Spider: Restart spider when finishes

From Dev

Prevent figures from closing when running a script from the command line

From Dev

Stop Scrapy spider when date from page is older that yesterday

From Dev

Download file as csv when running script from command line

From Dev

setClass not found when running R script from command line

From Dev

setClass not found when running R script from command line

From Dev

Different behaviour when using a command line script or running it from a backgroundworker

From Dev

dd command in script not running when executed from udev rule

From Dev

Scrapy "ImportError" when running inside bash script

From Dev

Scrapy Crawl Spider Only Scrape Certain Number Of Layers

From Dev

Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

From Dev

Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

From Dev

Running scrapy from script not including pipeline

From Dev

How to log scrapy spiders running from script

Related Related

  1. 1

    Getting the functionality of the scrapy crawl command when running a spider from a script

  2. 2

    How to pass multiple arguments to Scrapy spider (getting error running 'scrapy crawl' with more than one spider is no longer supported)?

  3. 3

    How to pass a user-defined argument to a scrapy Spider when running it from a script

  4. 4

    Scrapy - flow of the Crawl spider

  5. 5

    Scrapy - flow of the Crawl spider

  6. 6

    scrapy crawl spider ajax pagination

  7. 7

    Scrapy crawl spider stopped working

  8. 8

    Calling Scrapy Spider from python script?

  9. 9

    Passing Argument to Scrapy Spider from Python Script

  10. 10

    Getting an error when running awk from a script

  11. 11

    Simple scrapy spider doesn't crawl

  12. 12

    Get scrapy spider to crawl entire site

  13. 13

    Getting "ensure" / "finally" functionality in a shell command (not script)?

  14. 14

    update scrapy spider while running

  15. 15

    Scrapy: How to run spider from other python script twice or more?

  16. 16

    Scrapy Spider: Restart spider when finishes

  17. 17

    Prevent figures from closing when running a script from the command line

  18. 18

    Stop Scrapy spider when date from page is older that yesterday

  19. 19

    Download file as csv when running script from command line

  20. 20

    setClass not found when running R script from command line

  21. 21

    setClass not found when running R script from command line

  22. 22

    Different behaviour when using a command line script or running it from a backgroundworker

  23. 23

    dd command in script not running when executed from udev rule

  24. 24

    Scrapy "ImportError" when running inside bash script

  25. 25

    Scrapy Crawl Spider Only Scrape Certain Number Of Layers

  26. 26

    Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

  27. 27

    Why is my linkExtractor in a scrapy spider semming to not crawl allowed links?

  28. 28

    Running scrapy from script not including pipeline

  29. 29

    How to log scrapy spiders running from script

HotTag

Archive