我想在此网站上抓取所有编队链接:https : //www.formatic-centre.fr/formation/
显然,接下来的页面是使用AJAX动态加载的。我需要使用来自scrapy的FormRequest模拟那些请求。
那就是我所做的,我使用开发人员工具查找参数:ajax1
我把那些参数放进去了,FormRequest
但是显然如果它不起作用,我需要包括我所做的标题:ajax2
但这也没有用。我想我做错了什么,但是呢?
这是我的脚本,如果需要的话(很长,很抱歉,因为我放置了所有参数和标头):
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from scrapy.http import FormRequest
class LinkSpider(scrapy.Spider):
name = "link"
#allow_domains = ['https://www.formatic-centre.fr/']
start_urls = ['https://www.formatic-centre.fr/formation/']
rules = (Rule(LinkExtractor(allow=r'formation'), callback="parse", follow= True),)
def parse(self, response):
card = response.xpath('//a[@class="title"]')
for a in card:
yield {'links': a.xpath('@href').get()}
return [FormRequest(url="https://www.formatic-centre.fr/formation/",
formdata={'action' : "swlabscore",
'module[0]' : "top.Top_Controller",
'module[1]' : "ajax_get_course_pagination",
'page' : "2",
'layout' : "course",
'limit_post' : "",
'offset_post' : "0",
'sort_by' : "",
'pagination' : "yes",
'location_slug' : "",
'columns' : "2",
'paged' : "",
'cur_limit' : "",
'rows': "0",
'btn_content' : "En+savoir+plus",
'uniq_id' : "block-13759488265f916bca45c89",
'ZmfUNQ': "63y[Jt",
'PmhpIuZ_cTnUxqg' : "7v@IahmJNMplbCu",
'cZWVDbSPzTXRe' : "n9oa2k5u4GHWm",
'eOBITfdGRuriQ' : "hBPN5nObe.ktH",
"Accept" : "*/*",
"Accept-Encoding" : "gzip, deflate, br",
"Accept-Language" : "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
"Connection" : "keep-alive",
"Content-Length" : "1010",
"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" : "_ga=GA1.2.815964309.1603392091; _gid=GA1.2.1686929506.1603392091; jlFYkafUWiyJe=LGAWcXg_wUjFo; z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o",
"Host" : "www.formatic-centre.fr",
"Origin" : "https://www.formatic-centre.fr",
"Referer" : "https://www.formatic-centre.fr/formation/",
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0",
"X-Requested-With" : "XMLHttpRequest",
"access-control-allow-credentials" : "true",
"access-control-allow-origin" : "https://www.formatic-centre.fr",
"cache-control" : "no-cache, must-revalidate, max-age=0",
"content-encoding": "gzip",
"content-length" :"2497",
"content-type" :"text/html; charset=UTF-8",
"date" :"Thu, 22 Oct 2020 18:42:54 GMT",
"expires" :"Wed, 11 Jan 1984 05:00:00 GMT",
"referrer-policy": "strict-origin-when-cross-origin",
"server": "Apache",
"set-cookie" : "jlFYkafUWiyJe=LGAWcXg_wUjFo; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
"set-cookie" : "z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
"set-cookie" : "YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
"strict-transport-security" : "max-age=15552001; preload",
"vary" : "Accept-Encoding",
"x-content-type-options" : "nosniff",
"X-Firefox-Spdy" : "h2",
"x-frame-options" : "SAMEORIGIN",
"x-robots-tag" : "noindex"})]
脚本适用于第一页,我获得了链接,但是当他需要使用FormRequest时,什么也没发生,因此我无法在下一页中获得该链接。
有任何想法吗 ?
编辑:我没有看到它,但是终端告诉我这个错误:
2020-10-23 03:51:30 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.formatic-centre.fr/formation/> (referer: https://www.formatic-centre.fr/formation/) ['partial']
2020-10-23 03:51:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.formatic-centre.fr/formation/>: HTTP status code is not handled or not allowed
也许可以帮上忙吗?
您在格式化和发送自己headers
以及payload
自己的方式方面存在一些问题。
另外,您还必须不断更改页面,以便服务器知道您在哪里以及发回什么响应。
我不想建立一个新scrapy
项目,但是这是我获得所有链接的方式,因此希望这会向正确的方向推。
而且,如果感觉像黑客一样,那是因为。
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup
headers = {
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"origin": "https://www.formatic-centre.fr",
"referer": "https://www.formatic-centre.fr/formation/",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.99 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
raw_string = "action=swlabscore&module%5B%5D=top.Top_Controller&module%5B%5D=ajax_get_course_pagination¶ms%5B0%5D%5Bpage%5D=2¶ms%5B0%5D%5Batts%5D%5Blayout%5D=course¶ms%5B0%5D%5Batts%5D%5Blimit_post%5D=¶ms%5B0%5D%5Batts%5D%5Boffset_post%5D=0¶ms%5B0%5D%5Batts%5D%5Bsort_by%5D=¶ms%5B0%5D%5Batts%5D%5Bpagination%5D=yes¶ms%5B0%5D%5Batts%5D%5Blocation_slug%5D=¶ms%5B0%5D%5Batts%5D%5Bcolumns%5D=2¶ms%5B0%5D%5Batts%5D%5Bpaged%5D=¶ms%5B0%5D%5Batts%5D%5Bcur_limit%5D=¶ms%5B0%5D%5Batts%5D%5Brows%5D=0¶ms%5B0%5D%5Batts%5D%5Bbtn_content%5D=En+savoir+plus¶ms%5B0%5D%5Batts%5D%5Buniq_id%5D=block-13759488265f916bca45c89¶ms%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Blarge%5D=swedugate-thumb-300x225¶ms%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bno-image%5D=thumb-300x225.gif¶ms%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bsmall%5D=swedugate-thumb-300x225¶ms%5B0%5D%5Blayout_course%5D=style-grid&ZmfUNQ=63y[Jt&PmhpIuZ_cTnUxqg=7v@IahmJNMplbCu&cZWVDbSPzTXRe=n9oa2k5u4GHWm&eOBITfdGRuriQ=hBPN5nObe.ktH"
payloadd = [
('action', 'swlabscore'),
('module[]', 'top.Top_Controller'),
('module[]', 'ajax_get_course_pagination'),
('params[0][page]', '1'),
('params[0][atts][layout]', 'course'),
('params[0][atts][offset_post]', '0'),
('params[0][atts][pagination]', 'yes'),
('params[0][atts][columns]', '2'),
('params[0][atts][rows]', '0'),
('params[0][atts][btn_content]', 'En savoir plus'),
('params[0][atts][uniq_id]', 'block-13759488265f916bca45c89'),
('params[0][atts][thumb-size][large]', 'swedugate-thumb-300x225'),
('params[0][atts][thumb-size][no-image]', 'thumb-300x225.gif'),
('params[0][atts][thumb-size][small]', 'swedugate-thumb-300x225'),
('params[0][layout_course]', 'style-grid'),
('ZmfUNQ', '63y[Jt'),
('PmhpIuZ_cTnUxqg', '7v@IahmJNMplbCu'),
('cZWVDbSPzTXRe', 'n9oa2k5u4GHWm'),
('eOBITfdGRuriQ', 'hBPN5nObe.ktH'),
]
all_links = []
for page in range(1, 10):
payloadd.pop(3)
payloadd.insert(3, ('params[0][page]', str(page)))
response = requests.post(
"https://www.formatic-centre.fr/wp-admin/admin-ajax.php?",
headers=headers,
data=urlencode(payloadd)
)
print(f"Getting links from page {page}...")
soup = BeautifulSoup(response.text, "html.parser").find_all("a", class_="btn btn-green")
links = [i["href"] for i in soup]
print('\n'.join(links))
all_links.extend(links)
with open("formatic-center_links.txt", "w") as f:
f.writelines("\n".join(all_links) + "\n")
这将产生一个带有EN SAVOIR PLUS
按钮下面所有链接的文件。
https://www.formatic-centre.fr/formation/les-regles-juridiques-du-teletravail/
https://www.formatic-centre.fr/formation/mieux-gerer-son-stress-en-periode-du-covid-19/
https://www.formatic-centre.fr/formation/dynamiser-vos-equipes-special-post-confinement/
https://www.formatic-centre.fr/formation/conduire-ses-entretiens-specifique-post-confinement/
https://www.formatic-centre.fr/formation/cours-excel/
https://www.formatic-centre.fr/formation/autocad-3d-2/
https://www.formatic-centre.fr/formation/concevoir-et-developper-une-strategie-marketing/
https://www.formatic-centre.fr/formation/preparer-soutenance/
https://www.formatic-centre.fr/formation/mettre-en-place-une-campagne-adwords/
https://www.formatic-centre.fr/formation/utiliser-google-analytics/
and so on ...
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句