我正在尝试从受密码保护的FTP服务器提取文件。这是我正在使用的代码:
import scrapy
from scrapy.contrib.spiders import XMLFeedSpider
from scrapy.http import Request
from crawler.items import CrawlerItem
class SiteSpider(XMLFeedSpider):
name = 'site'
allowed_domains = ['ftp.site.co.uk']
itertag = 'item'
def start_requests(self):
yield Request('ftp.site.co.uk/feed.xml',
meta={'ftp_user': 'test', 'ftp_password': 'test'})
def parse_node(self, response, selector):
item = CrawlerItem()
item['title'] = (selector.xpath('//title/text()').extract() or [''])[0]
return item
这是我得到的回溯错误:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1192, in run
self.mainLoop()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilC
urrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 112, in _next_reques
t
request = next(slot.start_requests)
File "/var/www/spider/crawler/spiders/site.py", line 13, in start_requests
meta={'ftp_user': 'test', 'ftp_password': 'test'})
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 26, in __i
nit__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 61, in _se
t_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: ftp.site.co.uk/f
eed.xml
您需要为网址添加方案:
ftp://ftp.site.co.uk
FTP URL语法定义为:
ftp://[<user>[:<password>]@]<host>[:<port>]/<url-path>
基本上,您可以这样做:
yield Request('ftp://ftp.site.co.uk/feed.xml', ...)
在Wikipedia上了解有关架构的更多信息:http : //en.wikipedia.org/wiki/URI_scheme
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句