scrapy“请求网址中缺少方案”

星际飞船9

这是我的以下代码-

import scrapy
from scrapy.http import Request

class lyricsFetch(scrapy.Spider):
    name = "lyricsFetch"
    allowed_domains = ["metrolyrics.com"]


print "\nEnter the name of the ARTIST of the song for which you want the lyrics for. Minimise the spelling mistakes, if possible."
artist_name = raw_input('>')

print "\nNow comes the main part. Enter the NAME of the song itself now. Again, try not to have any spelling mistakes."
song_name = raw_input('>')


artist_name = artist_name.replace(" ", "_")
song_name = song_name.replace(" ","_")
first_letter = artist_name[0]
print artist_name
print song_name

start_urls = ["www.lyricsmode.com/lyrics/"+first_letter+"/"+artist_name+"/"+song_name+".html" ]

print "\nParsing this link\t "+ str(start_urls)

def start_requests(self):
    yield Request("www.lyricsmode.com/feed.xml")

def parse(self, response):

    lyrics = response.xpath('//p[@id="lyrics_text"]/text()').extract()

    with open ("lyrics.txt",'wb') as lyr:
        lyr.write(str(lyrics))

    #yield lyrics

    print lyrics

当我使用scrapy shell时,我得到正确的输出,但是,每当我尝试使用scrapy爬网运行脚本时,都会出现ValueError。我究竟做错了什么?我浏览了这个网站和其他网站,却一无所获。我想到了通过另一个问题提出请求的想法,但是仍然没有用。有什么帮助吗?

我的追踪-

Enter the name of the ARTIST of the song for which you want the lyrics for. Minimise the spelling mistakes, if possible.
>bullet for my valentine

Now comes the main part. Enter the NAME of the song itself now. Again, try not to have any spelling mistakes.
>your betrayal
bullet_for_my_valentine
your_betrayal

Parsing this link        ['www.lyricsmode.com/lyrics/b/bullet_for_my_valentine/your_betrayal.html']
2016-01-24 19:58:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: lyricsFetch)
2016-01-24 19:58:25 [scrapy] INFO: Optional features available: ssl, http11
2016-01-24 19:58:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'lyricsFetch.spiders', 'SPIDER_MODULES': ['lyricsFetch.spiders'], 'BOT_NAME': 'lyricsFetch'}
2016-01-24 19:58:27 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-24 19:58:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-24 19:58:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-24 19:58:28 [scrapy] INFO: Enabled item pipelines:
2016-01-24 19:58:28 [scrapy] INFO: Spider opened
2016-01-24 19:58:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-24 19:58:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-24 19:58:28 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Users\Nishank\Miniconda2\lib\site-packages\scrapy\core\engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "C:\Users\Nishank\Desktop\SNU\Python\lyricsFetch\lyricsFetch\spiders\lyricsFetch.py", line 26, in start_requests
    yield Request("www.lyricsmode.com/feed.xml")
  File "C:\Users\Nishank\Miniconda2\lib\site-packages\scrapy\http\request\__init__.py", line 24, in __init__
    self._set_url(url)
  File "C:\Users\Nishank\Miniconda2\lib\site-packages\scrapy\http\request\__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: www.lyricsmode.com/feed.xml
2016-01-24 19:58:28 [scrapy] INFO: Closing spider (finished)
2016-01-24 19:58:28 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 1, 24, 14, 28, 28, 231000),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2016, 1, 24, 14, 28, 28, 215000)}
2016-01-24 19:58:28 [scrapy] INFO: Spider closed (finished)
马克斯

如@tintin所说,您http在URL中缺少该方案。Scrapy需要完全限定的URL才能处理请求。

据我所知,您在以下方面缺少该方案:

start_urls = ["www.lyricsmode.com/lyrics/ ...

yield Request("www.lyricsmode.com/feed.xml")

如果要从HTML内容中解析URL,则应使用urljoin确保获得完全合格的URL,例如:

next_url = response.urljoin(href)

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

scrapy错误:exceptions.ValueError:请求网址中缺少方案:

来自分类Dev

请求网址中缺少方案

来自分类Dev

ValueError:请求网址中缺少方案:python中的h

来自分类Dev

Scrapy python error-请求URL中缺少方案

来自分类Dev

Scrapy python error-请求URL中缺少方案

来自分类Dev

Scrapy:ValueError('请求url中的丢失方案:%s'%self._url)

来自分类Dev

请求中缺少令牌

来自分类Dev

Scrapy 中的限制请求

来自分类Dev

scrapy:请求网址必须为str或unicode获取列表

来自分类Dev

如何使用scrapy在Python中抓取网址

来自分类Dev

SOAP请求中缺少参数

来自分类Dev

请求中缺少添加日期

来自分类Dev

Scrapy-请求表后缺少数据

来自分类Dev

如何从请求中获取原始网址

来自分类Dev

在python请求中更改引荐来源网址

来自分类Dev

读取Angular应用中请求的网址

来自分类Dev

Python:请求在网址中添加变量

来自分类Dev

在顶点网址中隐藏请求参数

来自分类Dev

$ _GET for +在网址中获取请求?

来自分类Dev

网址中包含$ scope值的$ http请求

来自分类Dev

网址中缺少.com斜杠后,如何附加它

来自分类Dev

REST请求中缺少CSRF令牌

来自分类Dev

Python HTML请求中缺少属性

来自分类Dev

请求中缺少表单数据

来自分类Dev

此肥皂请求中缺少什么?

来自分类Dev

POST请求中缺少图像输入

来自分类Dev

失败的样本中缺少请求数据

来自分类Dev

使用Scrapy从动态网页中抓取网址

来自分类Dev

Scrapy Crawler 只提取 680 多个网址中的 19 个