Scrapy不会点击所有页面

rongon 发表于 Dev

荣贡

我正在使用Scrapy爬网商店。产品是动态加载的，这就是为什么我使用Selenium来浏览页面的原因。我开始抓取所有类别，然后将其用作主要功能。

在按每个类别进行爬网时会出现问题：指示蜘蛛从第一页抓取所有数据，然后单击一个按钮以转到下一页，直到没有按钮。如果仅将一个类别url作为放置，则代码可以正常工作start_url，但是奇怪的是，如果在主代码中运行它，则不会单击所有页面。在完成单击所有下一步按钮之前，它会随机切换到新类别。

我不知道为什么会这样。

import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys

class horniSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ["example.com"]
    start_urls = ['https://www.example.com']

    def parse(self, response):
        for post in response.xpath('//body'):
            item = HorniItem()
            for href in response.xpath('//li[@class="sub"]/a/@href'):
                item['maincategory'] = response.urljoin(href.extract())
                yield scrapy.Request(item['maincategory'], callback = self.parse_subcategories)

    def parse_subcategories(self, response):
        item = HorniItem()
        for href in response.xpath('//li[@class="sub"]/a/@href'):
            item['subcategory'] = response.urljoin(href.extract())
            yield scrapy.Request(item['subcategory'], callback = self.parse_articles)


    def __init__(self):
            self.driver = webdriver.Chrome()
            dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
            self.driver.close()

    def parse_articles(self, response):
            self.driver.get(response.url)
            response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
            item = HorniItem()
            for sel in response.xpath('//body'):
                item['title'] = sel.xpath('//div[@id="article-list-headline"]/div/h1/text()').extract()
                yield item
            for post in response.xpath('//body'):
            id = post.xpath('//a[@class="title-link"]/@href').extract()
            prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
                articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
                id = [i.split('/')[-2] for i in id]
            prices = [x for x in prices if x != u'\xa0']
                articles = [w.replace(u'\n', '') for w in articles]
                result = zip(id, prices, articles)
                for id, price, article in result:
                        item = HorniItem()
                        item['id'] = id
                item['price'] = price
                        item['name'] = article
                        yield item
            while True:
                next = self.driver.find_element_by_xpath('//div[@class="paging-wrapper"]/a[@class="paging-btn right"]')
                try:
                        next.click()
                    response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
                item = HorniItem()
                    for post in response.xpath('//body'):
                    id = post.xpath('//a[@class="title-link"]/@href').extract()
                    prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
                        articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
                        id = [i.split('/')[-2] for i in id]
                    prices = [x for x in prices if x != u'\xa0']
                        articles = [w.replace(u'\n', '') for w in articles]
                        result = zip(id, prices, articles)
                        for id, price, article in result:
                            item = HorniItem()
                                item['id'] = id
                        item['price'] = price
                                item['name'] = article
                                yield item
                except:
                        break

更新

因此看来问题出在DOWNLOAD_DELAY-setting上。由于网站上的下一步按钮实际上不会生成新的URL，而只是执行Java脚本，因此网站URL不会更改。

荣贡

我找到了答案：

问题在于，由于页面的内容是动态生成的，因此单击-NEXT按钮实际上并没有更改URL。结合DOWNLOAD_DELAY项目的设置，这意味着蜘蛛可以在页面上停留给定的时间，无论它是否能够单击每个可能的NEXT按钮。

将DOWNLOAD_DELAY-setting设置得足够高，可以使Spider在每个URL上停留足够长的时间，并爬行每个页面。

但是，问题在于，即使没有单击按钮，这也会迫使蜘蛛在每个URL上等待设置的时间NEXT。但是...

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-03-3

我来说两句

0条评论

登录后参与评论

上一篇：我可以将切片器存储在变量中吗？（熊猫/ Python）

来自分类Dev

Related 相关文章

文章

Scrapy不会点击所有页面

Scrapy不会点击所有页面

更新

href不会点击

Facebook原生广告不会点击：Android

Gui 按钮不会点击 javaFX

javascript 旋转器中的标签不会点击

为什么JSF会点击xhtml页面3次

所有页面链接都无法点击？

Scrapy XPath页面上的所有链接

苹果设备不会点击触摸链接，只会被选中。怎么解决？

量角器不会点击下拉菜单中的项目（角度材质 2）

MATLAB 数据光标不会点击绘图的每个点 (R2018b)

抓取页面不会返回所有HTML

list=allpages 不会提供所有页面

Scrapy不会抓取页面

Scrapy不会翻过页面

如何使用Scrapy抓取网站所有页面上的链接

使用Scrapy搜寻所有高尔夫球场页面

使用scrapy和FormRequest爬行所有页面

jQuery 数据表不会打印所有页面

是否可以拦截/覆盖页面中的所有点击事件？

如何在页面的所有按钮上设置自动点击？

如何在页面的所有按钮上设置自动点击？

查找并列出所有URL-s，作为指向页面底部的可点击链接

scrapy不会添加所有项目中都不存在的字段吗？

Scrapy 脚本无法获取电子商务网站页面上的所有产品

Python + 网页抓取 + scrapy：如何从 IMDb 页面获取所有电影的链接？

遍历所有页面

htaccess不会将网站上所有未找到的页面或URL重定向到404页

用硒和BeautifulSoup刮取不会返回页面中的所有项目

MainHandler不会处理所有页面请求-Google App Engine