I use scrapy-redis simple to build a distributed crawler, slave machine needs to read url form master queue url, but there is a problem is that I get to url slave machine is after cPikle converted data, I want to get url from redis-url-queue is correct, what do you suggest?
Example:
from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Spider
from example.items import ExampleLoader
class MySpider(RedisSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'redisspider'
redis_key = 'wzws:requests'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
el = ExampleLoader(response=response)
el.add_xpath('name', '//title[1]/text()')
el.add_value('url', response.url)
return el.load_item()
MySpider inherited the RedisSpider, when I run scrapy runspider myspider_redis.py it occurs not legal url
scrapy-redis github address:scrapy-redis
There are a few internal queues used in scrapy-redis
. One is for start urls (by default <spider>:start_urls
), other for shared requests (by default <spider>:requests
) and another for the dupefilter.
The start urls queue and requests queue can't be the same as start urls queue expects single string values and the requests queue expects pickled data.
So, you should not be using <spider>:requests
as redis_key
in the spider.
Let me know if this helps, otherwise please share the messages in the redis_key
queue.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments