How do I get a normal url from redis rather than through url cPikle converted？

rowele Published at Dev

rowele

I use scrapy-redis simple to build a distributed crawler, slave machine needs to read url form master queue url, but there is a problem is that I get to url slave machine is after cPikle converted data, I want to get url from redis-url-queue is correct, what do you suggest?

Example:

from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Spider
from example.items import ExampleLoader
class MySpider(RedisSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'redisspider'
    redis_key = 'wzws:requests'

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        el = ExampleLoader(response=response)
        el.add_xpath('name', '//title[1]/text()')
        el.add_value('url', response.url)
        return el.load_item()

MySpider inherited the RedisSpider, when I run scrapy runspider myspider_redis.py it occurs not legal url

scrapy-redis github address:scrapy-redis

R. Max

There are a few internal queues used in scrapy-redis. One is for start urls (by default <spider>:start_urls), other for shared requests (by default <spider>:requests) and another for the dupefilter.

The start urls queue and requests queue can't be the same as start urls queue expects single string values and the requests queue expects pickled data.

So, you should not be using <spider>:requests as redis_key in the spider.

Let me know if this helps, otherwise please share the messages in the redis_key queue.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-26

Comments

0 comments

From Dev

Related Related

Article

How do I get a normal url from redis rather than through url cPikle converted？

How do I get a normal url from redis rather than through url cPikle converted？

How do you get the ASP.NET MVC action of the page (i.e. one from URL) rather than current action?

How do you GET DATA in SPSS syntax for a URL rather than local file?

How do you GET DATA in SPSS syntax for a URL rather than local file?

How do I get a TIFF bytestream from an OpenCV image, rather than a numpy array?

How do I get the user to input an int rather than a float?

How do I traverse through a linked list after adding tails rather than heads

how to get data from website through URL

How do I get the SPListItem from absolute URL?

How do I strip the query (used for GET parameters) from a URL?

How do I get multiple comma separated values from URL

How do I get the likes number from facebook for a given url?

How do I get the image url from an image field type

How do I create pretty url from the GET parameters in Django?

How do I get id number from url in javascript?

How can I ensure JAWS reads the link text rather than the URL?

URL in-browser DELETE link (rather than GET)

How to make the url of links on Screens in Moqui to be relative rather than absolute?

In Linux, how do I get man pages for C functions rather than for bash commands?

How do I get scribble to make -- be two short dashes rather than one long dash?

How do I get my mouse coordinates relative to the window rather than the screen?

How do I get this CSS/jQuery menu to open only on click, rather than hover?

How do I get python to search text for one word in a list rather than all the words in a list?

How do I tell OSX to use matplotlib from brew, rather than default?

How do I tell OSX to use matplotlib from brew, rather than default?

How do I import data from Activie directories through LDAP URL

How do I get a Python distribution URL?

How do I get query parameters in the URL

How do I get the url to a path in ruby

How do I get the redirect url?