我正在尝试制作一个简单的爬网程序,该爬网程序通过以下https://en.wikipedia.org/wiki/Web_scraping页面进行抓取,然后从“关于”部分中提取19个链接。我设法做到这一点,但是我也试图从这19个链接中的每一个中提取第一段,这就是它停止“起作用”的地方。我从第一页得到相同的段落,而不是从每一页得到。到目前为止,这就是我所拥有的。我知道这样做可能会有更好的选择,但我想坚持使用BeautifulSoup和简单的python代码。
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text
soup = BeautifulSoup(data, 'html.parser')
def visit():
try:
p = soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit()
第一次打印的例子
Now visiting: https://en.wikipedia.org/wiki/OpenSocial
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
预期的功能应该是它为每个打印的新链接打印第一段,而不是与第一个链接相同的段。为了解决这个问题,我需要做什么?或关于我所缺少的任何提示。我对python相当陌生,因此我在研究事物时仍在学习概念。
在代码的顶部,您定义data
和soup
。两者都捆绑在一起https://en.wikipedia.org/wiki/Web_scraping
。
每次调用时visit()
,都从打印soup
,并且soup
永远不会更改。
您需要将该网址传递给visit()
,例如visit(url_to_visit)
。该visit
函数应接受url作为参数,然后使用来访问页面requests
,并根据返回的数据创建一个新的汤,然后打印第一段。
编辑以添加解释我的原始答案的代码:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
start_url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Renamed this to start_url to make it clear that this is the source page
data = requests.get(start_url).text
soup = BeautifulSoup(data, 'html.parser')
def visit(new_url): # function now accepts a url as an argument
try:
new_data = requests.get(new_url).text # retrieve the text from the url
new_soup = BeautifulSoup(new_data, 'html.parser') # process the retrieved html in beautiful soup
p = new_soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(start_url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit(url_to_visit) # here's where we pass each line to the visit() function
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句