如何从Google搜索请求中的cite标签返回完整链接

debugcn 发表于 Dev

伊泰·利夫尼（Itay Livni）

我在下面成功运行了该脚本，该脚本返回基于cite标签的搜索链接列表。不幸的是，某些返回的链接已压缩。例如：www.intel.com/.../i-o-controller-hub-8-9-10-82566-82567-82562v-software- dev-manual.pdf。有没有办法返回完整链接？

import urllib
from bs4 import BeautifulSoup

opener = urllib.request.build_opener()
opener.addheaders = []
num_pages = 2

search_query = 'algorithm+encoding+desirable+character+signal+64-bit+communication+binary+propert'

for start in range(0, num_pages):
    url = 'http://www.google.com/search?q='+ search_query + '&start=' + str(start*num_pages)

    page = opener.open(url)
    soup = BeautifulSoup(page, "lxml")

    for cite in soup.findAll('cite'):
         print(cite.text)

是否有设置或更好的方法来从Google获取搜索链接？

提前致谢

热心

除了搜索<cite>元素之外，还可以<h3>使用class来获取所有s r。然后，您可以抓取其中的<a>标签，并获取锚点的href，如下所示：

for link in soup.find_all('h3', class_='r'):
    print(link.a['href'][7:])

拼接（[7:]）是因为每个网址都以url开头，/url?q=因此Google可以跟踪它们。您的最终解决方案将如下所示

import urllib
from bs4 import BeautifulSoup

opener = urllib.request.build_opener()
opener.addheaders = []
num_pages = 2

search_query = 'algorithm+encoding+desirable+character+signal+64-bit+communication+binary+propert'

for start in range(0, num_pages):
    url = 'http://www.google.com/search?q='+ search_query + '&start=' + str(start*num_pages)

    page = opener.open(url)
    soup = BeautifulSoup(page, "lxml")

    for link in soup.find_all('h3', class_='r'):
        print(link.a['href'][7:])

        text = link.a['href'][7:]
        head, sep, tail = text.partition('&sa')
        print(head)

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。