Web使用BeautifulSoup抓取多个页面

debugcn 发表于 Dev

回切

我从首页收集了所有必要的信息，但是不知道如何从网站的所有页面收集信息。我尝试在其他stackoverflow主题中找到我的解决方案，但一无所知。如果您能帮助我，我将非常感谢。

我的解析网站：https : //jaze.ru/forum/topic?id=50&page=1

资源：

from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup

# my_url and cutoff mod_security 
my_url = Request('http://jaze.ru/forum/topic?id=50&page=1', headers={'User-Agent': 'Mozilla/5.0'})
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each name of player
containers = page_soup.findAll("div", {"class":"top-area"})


for container in containers:
    playerName = container.div.a.text.strip()
    print("BattlePass PlayerName: " + playerName)

来源2

from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup

# start page
i = 1
while True:
    link = 'https://jaze.ru/forum/topic?id=50&page='+str(i)
    my_url = Request(
        link,
        headers={'User-Agent': 'Mozilla/5.0'}
    )
    i += 1  # increment page no for next run
    uClient = uReq(my_url)
    if uClient.url != link:
        break
    page_html = uClient.read()
    # Check if there was a redirect
    uClient.close()
    # html parsing
    page_soup = soup(page_html, "html.parser")
    # grabs each name of player
    containers = page_soup.findAll("div", {"class": "top-area"})

    # save all info to csv file
    filename = "BattlePassNicknames.csv"
    f = open(filename, "w", encoding="utf-8")

    headers1 = "Member of JAZE Battle Pass 2019\n"

    f.write(headers1)

    for container in containers:
        playerName = container.div.a.text.strip()
        print("BattlePass PlayerName: " + playerName)

        f.write(playerName + "\n")

    f.close()

比托·本尼汉（Bitto Bennichan）

如果page查询参数大于上一个可用页面，则网站会将您重定向到另一个页面，您可以使用它来递增page直到被重定向。如果您已经知道主题id（在这种情况下为50），则适用。

from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup

# start page
i = 1
while True:
    link = 'https://jaze.ru/forum/topic?id=50&page='+str(i)
    my_url = Request(
        link,
        headers={'User-Agent': 'Mozilla/5.0'}
    )
    i += 1  # increment page no for next run
    uClient = uReq(my_url)
    if uClient.url != link:
        break
    page_html = uClient.read()
    # Check if there was a redirect
    uClient.close()
    # html parsing
    page_soup = soup(page_html, "html.parser")
    # grabs each name of player
    containers = page_soup.findAll("div", {"class": "top-area"})

    for container in containers:
        playerName = container.div.a.text.strip()
        print("BattlePass PlayerName: " + playerName)

输出量

BattlePass PlayerName: VANTY3
BattlePass PlayerName: VANTY3
BattlePass PlayerName: KK#キング
BattlePass PlayerName: memories
BattlePass PlayerName: Waffel
BattlePass PlayerName: CynoBap
...
BattlePass PlayerName: Switchback

如果您还想使用随机主题ids进行尝试，则必须在代码中的某处处理urllib.error.HTTPError，以处理所有404等。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。