为什么我的Scrapy蜘蛛不能按预期运行？

zgall1 发表于 Dev

zgall1

当我运行下面的代码时，最终得到一个文件，该文件包含第二个代码块中的所有预期数据，而第一个代码块中则没有任何数据。换句话说，从EventLocation到EventURL的所有数据都存在，但是从EventArtist到EventDetails没有任何数据。我需要修改什么才能使其正常工作？

import urlparse
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
#from NT.items import NowTorontoItem
from scrapy.item import Item, Field

class NowTorontoItem(Item):
    eventArtist = Field()
    eventTitle = Field()
    eventHolder = Field()
    eventDetails = Field()
    eventLocation = Field()
    eventOrganization = Field()
    eventName = Field()
    eventAddress = Field()
    eventLocality = Field()
    eventPostalCode = Field()
    eventPhone = Field()
    eventURL = Field()

class MySpider(BaseSpider):
    name = "NTSpider"
    allowed_domains = ["nowtoronto.com"]
    start_urls = ["http://www.nowtoronto.com/music/listings/"]

    def parse(self, response):
        selector = Selector(response)
        listings = selector.css("div.listing-item0, div.listing-item1")

        for listing in listings:
            item = NowTorontoItem()
            for body in listing.css('span.listing-body > div.List-Body'):
                item ["eventArtist"] = body.css("span.List-Name::text").extract()
                item ["eventTitle"] = body.css("span.List-Body-Emphasis::text").extract()
                item ["eventHolder"] = body.css("span.List-Body-Strong::text").extract()
                item ["eventDetails"] = body.css("::text").extract()


            # yield a Request()
            # so that scrapy enqueues a new page to fetch
            detail_url = listing.css("div.listing-readmore > a::attr(href)")

            if detail_url:
                yield Request(urlparse.urljoin(response.url,
                                               detail_url.extract()[0]),
                              callback=self.parse_details)

    def parse_details(self, response):
        self.log("parse_details: %r" % response.url)
        selector = Selector(response)
        listings = selector.css("div.whenwhereContent")

        for listing in listings:
            for body in listing.css('td.small-txt.dkgrey-txt.rightInfoTD'):
                item = NowTorontoItem()
                item ["eventLocation"] = body.css("span[property='v:location']::text").extract()
                item ["eventOrganization"] = body.css("span[property='v:organization'] span[property='v:name']::text").extract()
                item ["eventName"] = body.css("span[property='v:name']::text").extract()
                item ["eventAddress"] = body.css("span[property='v:street-address']::text").extract()
                item ["eventLocality"] = body.css("span[property='v:locality']::text").extract()
                item ["eventPostalCode"] = body.css("span[property='v:postal-code']::text").extract()
                item ["eventPhone"] = body.css("span[property='v:tel']::text").extract()
                item ["eventURL"] = body.css("span[property='v:url']::text").extract()
                yield item

编辑

现在它似乎正在运行，但是有一个小问题。对于每个事件，它返回两行，其中一行包含所有详细信息，而一行仅包含从第一个代码块提取的详细信息，或者返回三行，一行包含所有细节，而两行相同的行仅包含从第一个代码块提取的详细信息块。

这是第一种情况的示例

2014-03-21 11:12:40-0400 [NTSpider] DEBUG: parse_details: 'http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
    {'eventAddress': [u'875 Bloor W'],
     'eventArtist': [u'Andria Simone & Those Guys'],
     'eventDetails': [u'Andria Simone & Those Guys',
                      u' (pop/soul) ',
                      u'Baltic Avenue',
                      u' 8 pm, $15.'],
     'eventHolder': [u'Baltic Avenue'],
     'eventLocality': [u'Toronto'],
     'eventLocation': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
     'eventName': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBaltic Avenue'],
     'eventOrganization': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBaltic Avenue'],
     'eventPhone': [u'647-898-5324'],
     'eventPostalCode': [u'M6G 3T6'],
     'eventTitle': [],
     'eventURL': []}
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
    {'eventAddress': [],
     'eventArtist': [u'Andria Simone & Those Guys'],
     'eventDetails': [u'Andria Simone & Those Guys',
                      u' (pop/soul) ',
                      u'Baltic Avenue',
                      u' 8 pm, $15.'],
     'eventHolder': [u'Baltic Avenue'],
     'eventLocality': [],
     'eventLocation': [],
     'eventName': [],
     'eventOrganization': [],
     'eventPhone': [],
     'eventPostalCode': [],
     'eventTitle': [],
     'eventURL': []}

这是第二种情况的例子

2014-03-21 11:21:23-0400 [NTSpider] DEBUG: parse_details: 'http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
    {'eventAddress': [u'11 Polson'],
     'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
     'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
                      u'Bassweek: Projek-Hospitality ',
                      u'Sound Academy',
                      u' $35 or wristband TM.'],
     'eventHolder': [u'Sound Academy'],
     'eventLocality': [u'Toronto'],
     'eventLocation': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
     'eventName': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSound Academy'],
     'eventOrganization': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSound Academy'],
     'eventPhone': [u'416-461-3625'],
     'eventPostalCode': [u'M5A 1A4'],
     'eventTitle': [u'Bassweek: Projek-Hospitality '],
     'eventURL': [u'sound-academy.com']}
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: Crawled (200) <GET http://www.nowtoronto.com/music/listings/listing.cfm?listingid=122291&subsection=&category=&criticspicks=&date1=&date2=&locationId=0> (referer: http://www.nowtoronto.com/music/listings/)
2014-03-21 11:21:24-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
    {'eventAddress': [],
     'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
     'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
                      u'Bassweek: Projek-Hospitality ',
                      u'Sound Academy',
                      u' $35 or wristband TM.'],
     'eventHolder': [u'Sound Academy'],
     'eventLocality': [],
     'eventLocation': [],
     'eventName': [],
     'eventOrganization': [],
     'eventPhone': [],
     'eventPostalCode': [],
     'eventTitle': [u'Bassweek: Projek-Hospitality '],
     'eventURL': []}
2014-03-21 11:21:24-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
    {'eventAddress': [],
     'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
     'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
                      u'Bassweek: Projek-Hospitality ',
                      u'Sound Academy',
                      u' $35 or wristband TM.'],
     'eventHolder': [u'Sound Academy'],
     'eventLocality': [],
     'eventLocation': [],
     'eventName': [],
     'eventOrganization': [],
     'eventPhone': [],
     'eventPostalCode': [],
     'eventTitle': [u'Bassweek: Projek-Hospitality '],
     'eventURL': []}

您应该在的meta参数中将项目从传递parse()给：parse_details()Request

yield Request(urlparse.urljoin(response.url,
              detail_url.extract()[0]),
              meta={'item': item},
              callback=self.parse_details)

然后，parse_details()您可以从response.meta['item']（docs）获取项目。

另外，yield如果没有找到详细信息，您可能想要一个项目：

if detail_url:
    yield Request(urlparse.urljoin(response.url,
                  detail_url.extract()[0]),
                  meta={'item': item},
                  callback=self.parse_details)
else:
    yield item

希望能有所帮助。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-8

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

为什么我的Scrapy蜘蛛不能按预期运行？

为什么我的Scrapy蜘蛛不能按预期运行？

为什么PowerPC的stfs指令不能按预期运行？

为什么这个`grep -v`不能按预期运行？

为什么max（）函数不能按预期运行？

为什么dropna（）不能按我预期的那样工作？

为什么我的.after不能按预期工作？

为什么支票不能按我预期的那样工作？

为什么我的makefile不能按预期工作？

为什么我的application.html.erb中的if语句不能按我预期的那样工作？

为什么该脚本不能按预期方式运行？

为什么XPath last（）函数不能按我预期的那样工作？

为什么我的堆排序功能不能按预期工作？

为什么这个空方法不能按我预期的那样工作？

为什么我的代码有时不能按预期工作？

为什么 highcharts 实心图表不能按我预期的方式缩放？

为什么我的泛型泛型不能按预期工作？

为什么这种简单的for循环不能按预期工作？

为什么TrimLeft不能按预期工作？

为什么这不能按预期进行串联？

为什么foreach不能按预期工作？

为什么array_walk（）不能按预期工作？

为什么getparent（）不能按预期工作？

为什么否定的字符类不能按预期工作？

为什么这不能按预期进行串联？

为什么这个.push（）不能按预期工作？

为什么foreach不能按预期工作？

为什么linq不能按预期进行分组/计数

为什么变量分配不能按预期工作？

为什么此命令不能按预期删除文件？

为什么重命名命令不能按预期工作？