我创建了一个脚本来使用请求登录到linkedin。该脚本运行良好。
登录后,我使用此URL从此处https://www.linkedin.com/groups/137920/
刮取了该名称Marketing Intelligence Professionals
,您可以在此图像中看到该名称。
该脚本可以完美地解析名称。但是,我现在想做的就是刮掉连接到此图See all
所示页面底部的按钮的链接。
群组连结 you gotta log in to access the content
到目前为止,我已经创建了(它可以抓取第一个图像中显示的名称):
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = 'your email' #put your username here
payload['session_password'] = 'your password' #put your password here
r = s.post(post_url,data=payload)
r = s.get(target_url)
soup = BeautifulSoup(r.text,"lxml")
items = soup.select_one("code:contains('viewerGroupMembership')").get_text(strip=True)
print(json.loads(items)['data']['name']['text'])
如何See all
从那里刮掉连接到按钮的链接?
当您单击“查看全部”时,将调用一个内部Rest API:
GET https://www.linkedin.com/voyager/api/search/blended
该keywords
查询参数包含您所请求最初组(在初始页面的组标题)的标题。
为了获取组名,您可以抓取初始页面的html,但是有一个API在您提供组ID时返回组信息:
GET https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:GROUP_ID
您的组ID是137920,可以直接从URL中提取
一个例子 :
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode
username = 'your username'
password = 'your password'
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
group_res = re.search('.*/(.*)/$', target_url)
group_id = group_res.group(1)
with requests.Session() as s:
# login
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = username
payload['session_password'] = password
r = s.post(post_url, data = payload)
# API
csrf_token = s.cookies.get_dict()["JSESSIONID"].replace("\"","")
r = s.get(f"https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:{group_id}",
headers= {
"csrf-token": csrf_token
})
group_name = r.json()["name"]["text"]
print(f"searching data for group {group_name}")
params = {
"count": 10,
"keywords": group_name,
"origin": "SWITCH_SEARCH_VERTICAL",
"q": "all",
"start": 0
}
r = s.get(f"https://www.linkedin.com/voyager/api/search/blended?{urlencode(params)}&filters=List(resultType-%3EGROUPS)&queryContext=List(spellCorrectionEnabled-%3Etrue)",
headers= {
"csrf-token": csrf_token,
"Accept": "application/vnd.linkedin.normalized+json+2.1",
"x-restli-protocol-version": "2.0.0"
})
result = r.json()["included"]
print(result)
print("list of groupName/link")
print([
(t["groupName"], f'https://www.linkedin.com/groups/{t["objectUrn"].split(":")[3]}')
for t in result
])
一些注意事项:
application/vnd.linkedin.normalized+json+2.1
搜索电话必须使用特殊的媒体类型queryContext
并且filters
不应该使用url编码,否则它将不考虑这些参数本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句