编辑:
好的,所以我今天所做的只是想弄清楚这一点,不幸的是,我还没有这样做。我现在所拥有的是:
import scrapy
from Archer.items import ArcherItemGeorges
class georges_spider(scrapy.Spider):
name = "GEORGES"
allowed_domains = ["georges.com.au"]
start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]
def parse(self,response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
yield scrapy.Request(response.url, callback = self.primary_parse)
yield scrapy.Request(response.url, callback = self.secondary_parse)
def primary_parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
itemlist = []
product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()
for product, price in zip(product, price):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist
def secondary_parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
itemlist = []
product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()
for product, price in zip(product, price):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist
问题是,我似乎无法进行第二次解析……我只能进行一次解析。
不管是同时进行还是逐步进行两个解析?
原版的:
我正在慢慢了解它(Python和Scrapy),但是我曾经碰壁。我想做的是以下几点:
有一个摄影零售网站,上面列出了这样的产品:
Name of Camera Body
Price
With Such and Such Lens
Price
With Another Such and Such Lens
Price
我想做的是,获取信息并将其组织在如下列表中(输出到csv文件没有问题):
product,price
camerabody1,$100
camerabody1+lens1,$200
camerabody1+lens1+lens2,$300
camerabody2,$150
camerabody2+lens1,$200
camerabody2+lens1+lens2,$250
我当前的蜘蛛代码:
import scrapy
from Archer.items import ArcherItemGeorges
class georges_spider(scrapy.Spider):
name = "GEORGES"
allowed_domains = ["georges.com.au"]
start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]
def parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()
subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()
itemlist = []
for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist
哪个不执行我想要的,并且我不知道下一步该怎么做,我尝试在for循环内执行一个for循环,但这没有用,它只是输出混合结果。
也供我参考,我的items.py:
import scrapy
class ArcherItemGeorges(scrapy.Item):
product = scrapy.Field()
price = scrapy.Field()
subproduct = scrapy.Field()
subprice = scrapy.Field()
任何帮助将不胜感激,我正在努力学习,但是对Python来说是新手,我觉得我需要一些指导。
正如您的直觉所说,您正在抓取的元素的结构似乎在一个循环中要求一个循环。重新整理一下代码,您可以获得包含所有产品子产品的联接的列表。
我已经改名为request
同product
,并推出了subproduct
为清楚起见变量。我想subproduct
循环可能就是您试图找出的循环。
def parse(self, response):
# Loop all the product elements
for product in response.xpath('//div[@class="listing-item"]'):
item = ArcherItemGeorges()
product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
item['product'] = product_name
item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
# Yield the raw primary item
yield item
# Yield the primary item with its secondary items
for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
yield item
当然,您需要将所有大写字母,价格清除等应用于相应的字段。
简要说明:
下载页面后parse
,将使用Response
对象(HTML页面)调用该方法。据此,Response
我们必须以的形式提取/抓取数据items
。在这种情况下,我们要返回产品价格清单。这就是屈服表达的魔力开始起作用的地方。您可以将其视为无法完成功能(即生成器)执行的按需方式 return
。Scrapy将调用该parse
生成器,直到没有更多items
可屈服的内容为止,因此不再需要items
将其刮入其中Response
。
注释代码:
def parse(self, response):
# Loop all the product elements, those div elements with a "listing-item" class
for product in response.xpath('//div[@class="listing-item"]'):
# Create an empty item container
item = ArcherItemGeorges()
# Scrape the primary product name and keep in a variable for later use
product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
# Fill the 'product' field with the product name
item['product'] = product_name
# Fill the 'price' field with the scraped primary product price
item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
# Yield the primary product item. That with the primary name and price
yield item
# Now, for each product, we need to loop through all the subproducts
for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
# Let's prepare a new item with the subproduct appended to the previous
# stored product_name, that is, product + subproduct.
item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
# And set the item price field with the subproduct price
item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
# Now yield the composed product + subproduct item.
yield item
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句