我有很多网页抓取要做,所以我切换到无头浏览器,希望能让事情变得更快,但它并没有提高太多速度。
我看了这个堆栈溢出帖子,但我不明白有人写的答案是 Selenium 慢,还是我的代码错了?
这是我的慢代码:
# followed this tutorial https://medium.com/@stevennatera/web-scraping-with-selenium-and-chrome-canary-on-macos-fc2eff723f9e
from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'
options.add_argument('window-size=800x841')
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://poshmark.com/search?')
xpath='//input[@id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)
brand="anthropology"
style="headband"
searchBox.send_keys(' '.join([brand,style]))
from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)
url=driver.current_url
print(url)
import requests
response=requests.get(url)
print(response)
print(response.text)
# using beautiful soup to grab the listins:
#______________________________
#print(response)
html=response.content
from bs4 import BeautifulSoup
from urllib.parse import urljoin
#print(html)
soup=BeautifulSoup(html,'html.parser')
#'a' as in links or anchore tags
anchore_tags=soup.find_all('a')
#print(x)
# finding the hyper links
#href is the hyperlink
hyper_links=[link.get("href") for link in soup.find_all("a")]
#print(hyper_links)
#(Better visual link this )
#href is the hyperlink
# for link in soup.find_all("a"):
#
# print(link.get("href"))
clothing_listings=set([listing for listing in hyper_links if listing and "listing" in listing]) # if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
# turning the list into a set because some of them are repeated
print(len(clothing_listings))
print(set(clothing_listings))
print(len(set(clothing_listings)))
#for somereason a link that is called unlike is showing up so im geting rid of those
clothing_listings=set([listing for listing in hyper_links if listing and "unlike" in listing]) # if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
print(len(clothing_listings))# this is the correct size of the amount of clothing items by that search
driver.quit()
为什么刮东西要花这么长时间?
您正在使用requests
来获取 URL。那么,为什么不使用它来完成整个任务。您使用的部分selenium
似乎是多余的。您只需使用它打开链接,然后使用它requests
来获取结果 URL。您所要做的就是传递适当的标头,您可以通过在 Chrome 或 Firefox 中查看开发人员工具的网络选项卡来收集这些标头。
rh = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://poshmark.com/search?',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
修改 URL 以搜索特定术语:
query = 'anthropology headband'
url = 'https://poshmark.com/search?query={}&type=listings&department=Women'.format(query)
然后,使用BeautifulSoup
. 此外,您可以使用特定于您想要的链接的任何属性来缩小您抓取的链接的范围。在您的情况下,它class
是covershot-con
.
r = requests.get(url, headers = rh)
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all('a', {'class': 'covershot-con'})
结果如下:
for i in links:
print(i['href'])
/listing/Anthro-Beaded-Headband-5a78fb899a9455e90aef438e
/listing/NWT-ANTHROPOLOGIE-Twisted-Vines-Crystal-Headband-5abbfb4a07003ad2dc58142f
/listing/Anthropologie-Nicole-Co-White-Floral-Headband-59dea5adeaf0302a5600bc41
/listing/NWT-ANTHROPOLOGIE-Namrata-Spring-Blossom-Headband-5ab5509d72769b52ba31829e
.
.
.
/listing/Anthropologie-By-Lilla-Spiky-Blue-Headband-59064f2ffbf6f90bfb01b854
/listing/Anthropologie-Beaded-Headband-5ab2cfe79d20f01a73ab0ddb
/listing/Anthropologie-Floral-Hawaiian-Headband-59d09eb941b4e0e1710871ec
编辑(提示):
使用selenium
作为最后的手段(当所有其他方法失败)。就像@Gilles Quenot 所说的,selenium
不是为了快速执行网络请求。
了解如何使用requests
库(使用标题、传递数据等)。他们的文档页面足以开始使用。它足以完成大多数抓取任务,而且速度很快。
即使对于需要 JS 执行的页面,requests
如果你能弄清楚如何使用 .js 之类的库来执行 JS 部分,你也可以通过js2py
。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句