크롤러를 통해 간단한 정보를 얻는 방법

debugcn 에 게시 Dev

두산 비가

이 https://en.wikipedia.org/wiki/Web_scraping 페이지 를 스크래핑 한 다음 정보보기 섹션에서 19 개의 링크를 추출 하는 간단한 크롤러를 만들려고합니다 . 이 작업을 수행 할 수 있지만 19 개의 링크 각각에서 첫 번째 단락을 추출하려고합니다. 여기서 "작동"이 중지됩니다. 첫 페이지에서 동일한 단락을 가져오고 각 페이지에서 가져 오지 않습니다. 이것이 제가 지금까지 가지고있는 것입니다. 이 작업을 수행하는 데 더 나은 옵션이있을 수 있다는 것을 알고 있지만 BeautifulSoup 및 간단한 파이썬 코드를 고수하고 싶습니다.

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Web_scraping'

data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text

soup = BeautifulSoup(data, 'html.parser')


def visit():
    try:
        p = soup.p
        print(p.get_text())
    except AttributeError:
        print('<p> Tag was not found')


links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
    if 'href' in link.attrs:
        links_todo.append(urljoin(url, link.attrs['href']))

while links_todo:
    url_to_visit = links_todo.pop()
    print('Now visiting:', url_to_visit)
    visit()

첫 번째 인쇄의 예

Now visiting: https://en.wikipedia.org/wiki/OpenSocial
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

의도 된 기능은 첫 번째 링크의 동일한 단락이 아닌 모든 새 링크에 대해 첫 번째 단락을 인쇄하는 것이어야합니다. 이 문제를 해결하려면 어떻게해야합니까? 또는 내가 놓친 것에 대한 팁. 저는 파이썬을 처음 접했기 때문에 작업하면서 개념을 배우고 있습니다.

소나무

코드 상단에서 정의 data하고 soup. 둘 다에 묶여 https://en.wikipedia.org/wiki/Web_scraping있습니다.

전화 할 때마다 visit(), 당신은에서 인쇄 soup및 soup변경하지 마십시오.

URL을에 전달해야합니다 ( visit()예 :) visit(url_to_visit). visit함수는 인수로 URL을 받아 사용하여 페이지를 방문해야 requests하고, 반환 된 데이터에서 새 수프를 만든 다음 첫 번째 단락을 인쇄 할 수 있습니다.

내 원래 답변을 설명하는 코드를 추가하도록 편집되었습니다.

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
 
start_url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Renamed this to start_url to make it clear that this is the source page 
data = requests.get(start_url).text
 
soup = BeautifulSoup(data, 'html.parser')
 
 
def visit(new_url): # function now accepts a url as an argument
    try:
        new_data = requests.get(new_url).text # retrieve the text from the url
        new_soup = BeautifulSoup(new_data, 'html.parser') # process the retrieved html in beautiful soup
        p = new_soup.p
        print(p.get_text())
    except AttributeError:
        print('<p> Tag was not found')
 
 
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
    if 'href' in link.attrs:
        links_todo.append(urljoin(start_url, link.attrs['href']))
 
while links_todo:
    url_to_visit = links_todo.pop()
    print('Now visiting:', url_to_visit)
    visit(url_to_visit) # here's where we pass each line to the visit() function

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정2021-05-28

몇 마디 만하겠습니다

0리뷰

로그인참여 후 검토

Related 관련 기사

기사

크롤러를 통해 간단한 정보를 얻는 방법

크롤러를 통해 간단한 정보를 얻는 방법

JMX 또는 EmbeddedActiveMQ를 통해 커넥터에 대한 정보를 얻는 방법

간단한 탭 메뉴를 얻기 위해 CSS를 정의하는 방법

각도를 얻는 간단한 방법

단일 링크를 통해 재설정 암호 토큰을 보내는 방법

PHP를 통해 복잡한 JSON에서 특정 데이터를 얻는 방법

Infusionsoft에서 주문 ID를 통해 주문 세부 정보를 얻는 방법

jQuery를 통해 컨트롤러에서 목록을 얻는 방법

fn_dblog () 함수를 통해 롤백 작업 세부 정보를 얻는 방법은 무엇입니까?

API를 통해 VM 인스턴스를 생성하기 위해 사용 가능한 모든 이미지에 대한 정보를 얻는 방법

명령 줄을 통해 크롤러 데이터를 PHP로 보내는 방법은 무엇입니까?

yt-project를 통해 간단한 3D 데이터를 볼륨 렌더링하는 방법

gdb에서 라이브러리 주소를 통해 정보를 얻는 방법 (그래서)

XML 노드 수를 얻는 가장 간단한 방법

간단한 예제를 통해 Firebird에서 실행 블록을 사용하는 방법 배우기

LAN을 통해 Git 리포지토리를 공유하는 가장 간단한 방법

노드를 통해 별도의 여러 API 호출을 수행하는 것보다 Chainlink에서 분산 된 데이터를 얻는 더 간단한 방법이 있습니까?

dataframe를 통해 ID의 행 정보를 선택하는 방법

jQuery animate를 통해 스크롤 속도를 늦추는 방법

API를 통해 pull request를 병합 한 사람을 얻는 방법

ssh를 통해 컬러 터미널을 얻는 방법?

ssh를 통해 컬러 터미널을 얻는 방법?

ssh를 통해 컬러 터미널을 얻는 방법?

문자열에서 여러 바이트를 얻는 더 간단한 방법?

NLog를 통해 AppInsights에서 범위 지정 정보를 얻는 방법은 무엇입니까?

출력 정보를 간단히 얻는 방법은 무엇입니까?

Graph API를 통해 Office365에 대한 서비스 상태 및 정보를 얻을 수있는 방법이 있습니까?

간단한 읽기 전용 API를 보호하는 간단한 방법

네트워크를 통해 DNS 정보를 변경하는 방법

Sparx EA의 Automation API를 통해 Priority, Difficulty의 가능한 값을 얻는 방법