我的蜘蛛跑得这么好。一切正常,但这一点:
# -*- coding: utf-8 -*-
import scrapy
from info.items import InfoItem
class HeiseSpider(scrapy.Spider):
name = "heise"
start_urls = ['https://www.heise.de/']
def parse(self, response):
print ( "Parse" )
yield scrapy.Request(response.url,callback=self.getSubList)
def getSubList(self,response):
item = InfoItem()
print ( "Sub List: Will it work?" )
yield(scrapy.Request('https://www.test.de/', callback = self.getScore, dont_filter=True))
print ( "Should have" )
yield item
def getScore(self, response):
print ( "--------- Get Score ----------")
print ( response )
return True
输出是:
Will it work?
Should have
为什么getScore
没有被调用?我究竟做错了什么?
编辑:将代码更改为具有相同问题的准系统版本 - 未调用 getScore
刚刚进行了一次测试运行,它按预期完成了所有回调:
...
2017-05-13 12:27:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.heise.de/> (referer: None)
Parse
2017-05-13 12:27:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.heise.de/> (referer: https://www.heise.de/)
Sub List: Will it work?
Should have
2017-05-13 12:27:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.test.de/> (referer: https://www.heise.de/)
--------- Get Score ----------
<200 https://www.test.de/>
2017-05-13 12:27:59 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'bool' in <GET https://www.test.de/>
2017-05-13 12:27:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-13 12:27:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 693,
...
没有任何日志输出和 settings.py 丢失它有点猜测但很可能在您的 settings.py 中是一个ROBOTSTXT_OBEY=True
.
这意味着scrapy 将尊重robots.txt 文件施加的任何限制,并且https://www.test.de有一个不允许爬行的robots.txt。
因此,将 settings.py 中的 ROBOTSTXT 行更改为ROBOTSTXT_OBEY=False
它应该可以工作。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句