使用scrapy和FormRequest爬行所有页面

debugcn 发表于 Dev

劳里·猎鹰

我想在此网站上抓取所有编队链接：https : //www.formatic-centre.fr/formation/

显然，接下来的页面是使用AJAX动态加载的。我需要使用来自scrapy的FormRequest模拟那些请求。

那就是我所做的，我使用开发人员工具查找参数：ajax1

我把那些参数放进去了，FormRequest但是显然如果它不起作用，我需要包括我所做的标题：ajax2

但这也没有用。我想我做错了什么，但是呢？

这是我的脚本，如果需要的话（很长，很抱歉，因为我放置了所有参数和标头）：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from scrapy.http import FormRequest

class LinkSpider(scrapy.Spider):
    name = "link"
    #allow_domains = ['https://www.formatic-centre.fr/']
    start_urls = ['https://www.formatic-centre.fr/formation/']

    rules = (Rule(LinkExtractor(allow=r'formation'), callback="parse", follow= True),)

    def parse(self, response):
        card = response.xpath('//a[@class="title"]')
        for a in card:
            yield {'links': a.xpath('@href').get()}

        return [FormRequest(url="https://www.formatic-centre.fr/formation/",
            formdata={'action' :    "swlabscore",
                        'module[0]' : "top.Top_Controller",
                        'module[1]' : "ajax_get_course_pagination",
                        'page' :    "2",
                        'layout' :  "course",
                        'limit_post' :  "",
                        'offset_post' : "0",
                        'sort_by' : "",
                        'pagination' :  "yes",
                        'location_slug' :   "",
                        'columns' : "2",
                        'paged' :   "",
                        'cur_limit' :   "",
                        'rows': "0",
                        'btn_content' : "En+savoir+plus",
                        'uniq_id' : "block-13759488265f916bca45c89",
                        'ZmfUNQ': "63y[Jt",
                        'PmhpIuZ_cTnUxqg' : "7v@IahmJNMplbCu",
                        'cZWVDbSPzTXRe' : "n9oa2k5u4GHWm",
                        'eOBITfdGRuriQ' :   "hBPN5nObe.ktH",
                        "Accept" : "*/*",
                        "Accept-Encoding" : "gzip, deflate, br",
                        "Accept-Language" : "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
                        "Connection" : "keep-alive",
                        "Content-Length" : "1010",
                        "Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
                        "Cookie" : "_ga=GA1.2.815964309.1603392091; _gid=GA1.2.1686929506.1603392091; jlFYkafUWiyJe=LGAWcXg_wUjFo; z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o",
                        "Host" : "www.formatic-centre.fr",
                        "Origin" : "https://www.formatic-centre.fr",
                        "Referer" : "https://www.formatic-centre.fr/formation/",
                        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0",
                        "X-Requested-With" : "XMLHttpRequest",
                        "access-control-allow-credentials" : "true",
                        "access-control-allow-origin" : "https://www.formatic-centre.fr",
                        "cache-control" : "no-cache, must-revalidate, max-age=0",
                        "content-encoding": "gzip",
                        "content-length" :"2497",
                        "content-type" :"text/html; charset=UTF-8",
                        "date" :"Thu, 22 Oct 2020 18:42:54 GMT",
                        "expires" :"Wed, 11 Jan 1984 05:00:00 GMT",
                        "referrer-policy": "strict-origin-when-cross-origin",
                        "server": "Apache",
                        "set-cookie" : "jlFYkafUWiyJe=LGAWcXg_wUjFo; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "set-cookie" : "z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "set-cookie" : "YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "strict-transport-security" : "max-age=15552001; preload",
                        "vary" : "Accept-Encoding",
                        "x-content-type-options" : "nosniff",
                        "X-Firefox-Spdy" : "h2",
                        "x-frame-options" : "SAMEORIGIN",
                        "x-robots-tag" : "noindex"})]

脚本适用于第一页，我获得了链接，但是当他需要使用FormRequest时，什么也没发生，因此我无法在下一页中获得该链接。

有任何想法吗？

编辑：我没有看到它，但是终端告诉我这个错误：

2020-10-23 03:51:30 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.formatic-centre.fr/formation/> (referer: https://www.formatic-centre.fr/formation/) ['partial']
2020-10-23 03:51:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.formatic-centre.fr/formation/>: HTTP status code is not handled or not allowed

也许可以帮上忙吗？

巴杜克

您在格式化和发送自己headers以及payload自己的方式方面存在一些问题。

另外，您还必须不断更改页面，以便服务器知道您在哪里以及发回什么响应。

我不想建立一个新scrapy项目，但是这是我获得所有链接的方式，因此希望这会向正确的方向推。

而且，如果感觉像黑客一样，那是因为。

from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup


headers = {
    "accept": "*/*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "origin": "https://www.formatic-centre.fr",
    "referer": "https://www.formatic-centre.fr/formation/",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.99 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}

raw_string = "action=swlabscore&module%5B%5D=top.Top_Controller&module%5B%5D=ajax_get_course_pagination&params%5B0%5D%5Bpage%5D=2&params%5B0%5D%5Batts%5D%5Blayout%5D=course&params%5B0%5D%5Batts%5D%5Blimit_post%5D=&params%5B0%5D%5Batts%5D%5Boffset_post%5D=0&params%5B0%5D%5Batts%5D%5Bsort_by%5D=&params%5B0%5D%5Batts%5D%5Bpagination%5D=yes&params%5B0%5D%5Batts%5D%5Blocation_slug%5D=&params%5B0%5D%5Batts%5D%5Bcolumns%5D=2&params%5B0%5D%5Batts%5D%5Bpaged%5D=&params%5B0%5D%5Batts%5D%5Bcur_limit%5D=&params%5B0%5D%5Batts%5D%5Brows%5D=0&params%5B0%5D%5Batts%5D%5Bbtn_content%5D=En+savoir+plus&params%5B0%5D%5Batts%5D%5Buniq_id%5D=block-13759488265f916bca45c89&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Blarge%5D=swedugate-thumb-300x225&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bno-image%5D=thumb-300x225.gif&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bsmall%5D=swedugate-thumb-300x225&params%5B0%5D%5Blayout_course%5D=style-grid&ZmfUNQ=63y[Jt&PmhpIuZ_cTnUxqg=7v@IahmJNMplbCu&cZWVDbSPzTXRe=n9oa2k5u4GHWm&eOBITfdGRuriQ=hBPN5nObe.ktH"

payloadd = [
    ('action', 'swlabscore'),
    ('module[]', 'top.Top_Controller'),
    ('module[]', 'ajax_get_course_pagination'),
    ('params[0][page]', '1'),
    ('params[0][atts][layout]', 'course'),
    ('params[0][atts][offset_post]', '0'),
    ('params[0][atts][pagination]', 'yes'),
    ('params[0][atts][columns]', '2'),
    ('params[0][atts][rows]', '0'),
    ('params[0][atts][btn_content]', 'En savoir plus'),
    ('params[0][atts][uniq_id]', 'block-13759488265f916bca45c89'),
    ('params[0][atts][thumb-size][large]', 'swedugate-thumb-300x225'),
    ('params[0][atts][thumb-size][no-image]', 'thumb-300x225.gif'),
    ('params[0][atts][thumb-size][small]', 'swedugate-thumb-300x225'),
    ('params[0][layout_course]', 'style-grid'),
    ('ZmfUNQ', '63y[Jt'),
    ('PmhpIuZ_cTnUxqg', '7v@IahmJNMplbCu'),
    ('cZWVDbSPzTXRe', 'n9oa2k5u4GHWm'),
    ('eOBITfdGRuriQ', 'hBPN5nObe.ktH'),
]

all_links = []
for page in range(1, 10):
    payloadd.pop(3)
    payloadd.insert(3, ('params[0][page]', str(page)))
    response = requests.post(
        "https://www.formatic-centre.fr/wp-admin/admin-ajax.php?",
        headers=headers,
        data=urlencode(payloadd)
    )
    print(f"Getting links from page {page}...")
    soup = BeautifulSoup(response.text, "html.parser").find_all("a", class_="btn btn-green")
    links = [i["href"] for i in soup]
    print('\n'.join(links))
    all_links.extend(links)


with open("formatic-center_links.txt", "w") as f:
    f.writelines("\n".join(all_links) + "\n")

这将产生一个带有EN SAVOIR PLUS按钮下面所有链接的文件。

https://www.formatic-centre.fr/formation/les-regles-juridiques-du-teletravail/
https://www.formatic-centre.fr/formation/mieux-gerer-son-stress-en-periode-du-covid-19/
https://www.formatic-centre.fr/formation/dynamiser-vos-equipes-special-post-confinement/
https://www.formatic-centre.fr/formation/conduire-ses-entretiens-specifique-post-confinement/
https://www.formatic-centre.fr/formation/cours-excel/
https://www.formatic-centre.fr/formation/autocad-3d-2/
https://www.formatic-centre.fr/formation/concevoir-et-developper-une-strategie-marketing/
https://www.formatic-centre.fr/formation/preparer-soutenance/
https://www.formatic-centre.fr/formation/mettre-en-place-une-campagne-adwords/
https://www.formatic-centre.fr/formation/utiliser-google-analytics/
and so on ...

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。