登录后如何抓取页面

debugcn 发表于 Dev

层数

我试图找到一种方法来在登录区域中抓取并解析更多页面。这些示例链接可从我想解析的登录帐户访问。

#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....

我想要创建可以打开这些链接中的每个链接然后对其进行解析的Spider。我创建了另一个蜘蛛，它只能打开和解析其中的一个。

当我尝试创建“ for”或“ while”以使用其他链接调用更多请求时，由于我无法将更多的返回值生成到生成器中，因此它不允许我这样做，它返回错误。我也尝试了链接提取器，但是对我来说不起作用。

这是我的代码：

    #!c:/server/www/scrapy
    # -*- coding: utf-8 -*-
    from scrapy import Spider
    from scrapy.selector import Selector
    from scrapy.http import FormRequest
    from scrapy.http.request import Request
    from scrapy.spiders import CrawlSpider, Rule
    from array import *
    from stack.items import StackItem
    from scrapy.linkextractors import LinkExtractor

    class Spider3(Spider):
        name = "Spider3"
        allowed_domains = ["example.com"]
        start_urls = ["http://example.com/login"] #this link lead to login page

当我登录后，它返回带有url的页面，其中包含“ stat”，这就是为什么我在这里放置第一个“ if”条件的原因。登录后，我请求一个链接并调用函数parse_items。

        def parse(self, response):
                #when "stat" is in url it means that I just signed in
                if "stat" in response.url:              
                    return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items) 

            else:
                #this succesful login turns me to page, it's url contains "stat"
                return [FormRequest.from_response(response,
                        formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]

函数parse_items只是从一个所需的页面中解析所需的内容：

        def parse_items(self,response):
                questions = Selector(response).xpath('//*[@id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
                for question in questions:
                    item = StackItem()
                    item['name'] = question.xpath('th/text()').extract()[0]
                    item['value'] = question.xpath('td/text()').extract()[0]
                    yield item

您能帮我更新此代码以在每个会话中打开和解析多个页面吗？我不想为每个请求一遍又一遍地登录。

霸王龙

会话很可能取决于Cookie，scrapy会自行管理它。IE：

def parse_items(self,response):
    questions = Selector(response).xpath('//*[@id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
    for question in questions:
        item = StackItem()
        item['name'] = question.xpath('th/text()').extract()[0]
        item['value'] = question.xpath('td/text()').extract()[0]
        yield item  
    next_url = '' # find url to next page in the current page
    if next_url:
        yield Request(next_url, self.parse_items)
        # scrapy will retain the session for the next page if it's managed by cookies

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。