我在下面成功运行了该脚本,该脚本返回基于cite
标签的搜索链接列表。不幸的是,某些返回的链接已压缩。例如:www.intel.com/.../i-o-controller-hub-8-9-10-82566-82567-82562v-software- dev-manual.pdf
。有没有办法返回完整链接?
import urllib
from bs4 import BeautifulSoup
opener = urllib.request.build_opener()
opener.addheaders = []
num_pages = 2
search_query = 'algorithm+encoding+desirable+character+signal+64-bit+communication+binary+propert'
for start in range(0, num_pages):
url = 'http://www.google.com/search?q='+ search_query + '&start=' + str(start*num_pages)
page = opener.open(url)
soup = BeautifulSoup(page, "lxml")
for cite in soup.findAll('cite'):
print(cite.text)
是否有设置或更好的方法来从Google获取搜索链接?
提前致谢
除了搜索<cite>
元素之外,还可以<h3>
使用class来获取所有s r
。然后,您可以抓取其中的<a>
标签,并获取锚点的href,如下所示:
for link in soup.find_all('h3', class_='r'):
print(link.a['href'][7:])
拼接([7:]
)是因为每个网址都以url开头,/url?q=
因此Google可以跟踪它们。您的最终解决方案将如下所示
import urllib
from bs4 import BeautifulSoup
opener = urllib.request.build_opener()
opener.addheaders = []
num_pages = 2
search_query = 'algorithm+encoding+desirable+character+signal+64-bit+communication+binary+propert'
for start in range(0, num_pages):
url = 'http://www.google.com/search?q='+ search_query + '&start=' + str(start*num_pages)
page = opener.open(url)
soup = BeautifulSoup(page, "lxml")
for link in soup.find_all('h3', class_='r'):
print(link.a['href'][7:])
text = link.a['href'][7:]
head, sep, tail = text.partition('&sa')
print(head)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句