我正在使用Scrapy爬网商店。产品是动态加载的,这就是为什么我使用Selenium来浏览页面的原因。我开始抓取所有类别,然后将其用作主要功能。
在按每个类别进行爬网时会出现问题:指示蜘蛛从第一页抓取所有数据,然后单击一个按钮以转到下一页,直到没有按钮。如果仅将一个类别url作为放置,则代码可以正常工作start_url
,但是奇怪的是,如果在主代码中运行它,则不会单击所有页面。在完成单击所有下一步按钮之前,它会随机切换到新类别。
我不知道为什么会这样。
import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys
class horniSpider(scrapy.Spider):
name = "final"
allowed_domains = ["example.com"]
start_urls = ['https://www.example.com']
def parse(self, response):
for post in response.xpath('//body'):
item = HorniItem()
for href in response.xpath('//li[@class="sub"]/a/@href'):
item['maincategory'] = response.urljoin(href.extract())
yield scrapy.Request(item['maincategory'], callback = self.parse_subcategories)
def parse_subcategories(self, response):
item = HorniItem()
for href in response.xpath('//li[@class="sub"]/a/@href'):
item['subcategory'] = response.urljoin(href.extract())
yield scrapy.Request(item['subcategory'], callback = self.parse_articles)
def __init__(self):
self.driver = webdriver.Chrome()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse_articles(self, response):
self.driver.get(response.url)
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
item = HorniItem()
for sel in response.xpath('//body'):
item['title'] = sel.xpath('//div[@id="article-list-headline"]/div/h1/text()').extract()
yield item
for post in response.xpath('//body'):
id = post.xpath('//a[@class="title-link"]/@href').extract()
prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
id = [i.split('/')[-2] for i in id]
prices = [x for x in prices if x != u'\xa0']
articles = [w.replace(u'\n', '') for w in articles]
result = zip(id, prices, articles)
for id, price, article in result:
item = HorniItem()
item['id'] = id
item['price'] = price
item['name'] = article
yield item
while True:
next = self.driver.find_element_by_xpath('//div[@class="paging-wrapper"]/a[@class="paging-btn right"]')
try:
next.click()
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
item = HorniItem()
for post in response.xpath('//body'):
id = post.xpath('//a[@class="title-link"]/@href').extract()
prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
id = [i.split('/')[-2] for i in id]
prices = [x for x in prices if x != u'\xa0']
articles = [w.replace(u'\n', '') for w in articles]
result = zip(id, prices, articles)
for id, price, article in result:
item = HorniItem()
item['id'] = id
item['price'] = price
item['name'] = article
yield item
except:
break
因此看来问题出在DOWNLOAD_DELAY
-setting上。由于网站上的下一步按钮实际上不会生成新的URL,而只是执行Java脚本,因此网站URL不会更改。
我找到了答案:
问题在于,由于页面的内容是动态生成的,因此单击-NEXT
按钮实际上并没有更改URL。结合DOWNLOAD_DELAY
项目的设置,这意味着蜘蛛可以在页面上停留给定的时间,无论它是否能够单击每个可能的NEXT
按钮。
将DOWNLOAD_DELAY
-setting设置得足够高,可以使Spider在每个URL上停留足够长的时间,并爬行每个页面。
但是,问题在于,即使没有单击按钮,这也会迫使蜘蛛在每个URL上等待设置的时间NEXT
。但是...
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句