我试图找到一种方法来在登录区域中抓取并解析更多页面。这些示例链接可从我想解析的登录帐户访问。
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
我想要创建可以打开这些链接中的每个链接然后对其进行解析的Spider。我创建了另一个蜘蛛,它只能打开和解析其中的一个。
当我尝试创建“ for”或“ while”以使用其他链接调用更多请求时,由于我无法将更多的返回值生成到生成器中,因此它不允许我这样做,它返回错误。我也尝试了链接提取器,但是对我来说不起作用。
这是我的代码:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
当我登录后,它返回带有url的页面,其中包含“ stat”,这就是为什么我在这里放置第一个“ if”条件的原因。登录后,我请求一个链接并调用函数parse_items。
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
函数parse_items只是从一个所需的页面中解析所需的内容:
def parse_items(self,response):
questions = Selector(response).xpath('//*[@id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
您能帮我更新此代码以在每个会话中打开和解析多个页面吗?我不想为每个请求一遍又一遍地登录。
会话很可能取决于Cookie,scrapy会自行管理它。IE:
def parse_items(self,response):
questions = Selector(response).xpath('//*[@id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句