编辑：如何创建“嵌套循环”，将项目返回到Python和Scrapy中的原始列表

Jason Lei 发表于 Dev

贾森·雷

编辑：

好的，所以我今天所做的只是想弄清楚这一点，不幸的是，我还没有这样做。我现在所拥有的是：

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self,response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        yield scrapy.Request(response.url, callback = self.primary_parse)
        yield scrapy.Request(response.url, callback = self.secondary_parse)

    def primary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

    def secondary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

问题是，我似乎无法进行第二次解析……我只能进行一次解析。

不管是同时进行还是逐步进行两个解析？

原版的：

我正在慢慢了解它（Python和Scrapy），但是我曾经碰壁。我想做的是以下几点：

有一个摄影零售网站，上面列出了这样的产品：

Name of Camera Body
Price

    With Such and Such Lens
    Price

    With Another Such and Such Lens
    Price

我想做的是，获取信息并将其组织在如下列表中（输出到csv文件没有问题）：

product,price
camerabody1,$100
camerabody1+lens1,$200
camerabody1+lens1+lens2,$300
camerabody2,$150
camerabody2+lens1,$200
camerabody2+lens1+lens2,$250

我当前的蜘蛛代码：

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()       
        subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        itemlist = []
        for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            item['product'] = product + " " + subproduct.strip().upper()
            item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

哪个不执行我想要的，并且我不知道下一步该怎么做，我尝试在for循环内执行一个for循环，但这没有用，它只是输出混合结果。

也供我参考，我的items.py：

import scrapy

    class ArcherItemGeorges(scrapy.Item):
        product = scrapy.Field()
        price = scrapy.Field()
        subproduct = scrapy.Field()
        subprice = scrapy.Field()

任何帮助将不胜感激，我正在努力学习，但是对Python来说是新手，我觉得我需要一些指导。

眼猫

正如您的直觉所说，您正在抓取的元素的结构似乎在一个循环中要求一个循环。重新整理一下代码，您可以获得包含所有产品子产品的联接的列表。

我已经改名为request同product，并推出了subproduct为清楚起见变量。我想subproduct循环可能就是您试图找出的循环。

def parse(self, response):
    # Loop all the product elements
    for product in response.xpath('//div[@class="listing-item"]'):
        item = ArcherItemGeorges()
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        item['product'] = product_name
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the raw primary item
        yield item
        # Yield the primary item with its secondary items
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            yield item

当然，您需要将所有大写字母，价格清除等应用于相应的字段。

简要说明：

下载页面后parse，将使用Response对象（HTML页面）调用该方法。据此，Response我们必须以的形式提取/抓取数据items。在这种情况下，我们要返回产品价格清单。这就是屈服表达的魔力开始起作用的地方。您可以将其视为无法完成功能（即生成器）执行的按需方式 return。Scrapy将调用该parse生成器，直到没有更多items可屈服的内容为止，因此不再需要items将其刮入其中Response。

注释代码：

def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-14

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

编辑：如何创建“嵌套循环”，将项目返回到Python和Scrapy中的原始列表

编辑：如何创建“嵌套循环”，将项目返回到Python和Scrapy中的原始列表

在python和嵌套列表中的for循环内追加

如何使用嵌套循环在Python中创建模式？

Python嵌套列表内部比较和编辑

调用exec后，如何使程序返回到原始循环？

如何从列表中删除项目但保留原始索引在Python中

如何在Python中逐行打印嵌套列表中的项目？

从列表的嵌套列表中返回到叶节点的路径列表

如何通过在Python中的列表中循环来创建带有项目的SQL表键

将嵌套列表中的值返回到PandasDataFrame

在python的嵌套列表中创建嵌套列表

Python：如何将总计返回到整数元组列表？

如何使用for循环将项目追加到列表中？

列表推导中的嵌套循环和条件检查（python）

如何将当前项目返回到管道？

如何使用for循环为列表中的每个项目创建按钮？

如何将循环数据输出到嵌套列表中？

如何将项目列表从linq返回到sql查询并在控制台应用程序中调用

将python列表返回到Java

将值返回到列表中

如何通过单击列表框行项目将搜索结果返回到文本框

Python将字典嵌套到reStructuredText项目符号列表中

如何使用for循环遍历Python中的嵌套列表

循环后，将所有迭代元素返回到列表中的正确方法是什么？

如何将下拉列表中的完整对象从 angular 2 返回到服务？

如何制作嵌套循环语句以返回到某个语句

替换列表回到python中的原始列表

从循环 Python 创建的嵌套列表中删除括号

在 R 中的循环中创建嵌套列表

嵌套循环以删除列表中的项目