Python Web抓取页面循环

debugcn 发表于 Dev

小牛

感谢您在这里多次询问此问题，但我似乎无法为我工作。

我已经编写了一个刮板，可以成功地从站点的第一页刮刮我所需的一切。但是，我无法弄清楚如何使其遍历各个页面。

网址只会像这样BLAH / 3 +'page = x'递增

我已经很长时间没有学习编码了，所以任何建议都将不胜感激！

import requests
from bs4 import BeautifulSoup


url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3'

soup = BeautifulSoup(r.content, "html.parser")

# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)

# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})


for item in general_data:
    name = print(item.contents[0].text)
    address = print(item.contents[1].text.replace('.',''))
    care_type = print(item.contents[2].text)

更新：

r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3')

for page in range(10):

    r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3' + 'page=' + page)

soup = BeautifulSoup(r.content, "html.parser")
#print(soup.prettify())


# String substitution for HTML
for link in soup.find_all("a"):
    "<a href='>%s'>%s</a>" %(link.get("href"), link.text)

# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})


for item in general_data:
    name = print(item.contents[0].text)
    address = print(item.contents[1].text.replace('.',''))
    care_type = print(item.contents[2].text)

更新2 ！：

import requests
from bs4 import BeautifulSoup

url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3&page='

for page in range(10):

r = requests.get(url + str(page))

soup = BeautifulSoup(r.content, "html.parser")

# String substitution for HTML
for link in soup.find_all("a"):
    print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))

# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})

for item in general_data:
    print(item.contents[0].text)
    print(item.contents[1].text.replace('.',''))
    print(item.contents[2].text)

简单的

要循环播放页面，page=x您需要for像这样循环播放>

import requests
from bs4 import BeautifulSoup

url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10&page='

for page in range(10):

    print('---', page, '---')

    r = requests.get(url + str(page))

    soup = BeautifulSoup(r.content, "html.parser")

    # String substitution for HTML
    for link in soup.find_all("a"):
        print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))

    # Fetch and print general data from title class
    general_data = soup.find_all('div', {'class' : 'title'})

    for item in general_data:
        print(item.contents[0].text)
        print(item.contents[1].text.replace('.',''))
        print(item.contents[2].text)

每个页面都可以不同，更好的解决方案需要更多有关页面的信息。有时你可以链接到最后一页，然后你可以使用它代替信息10中range(10)

或者，如果没有指向下一页的链接，则可以使用while True循环和break离开循环。但是首先，您必须显示有问题的此页面（URL到实际页面）。

编辑：示例如何获取到下一页的链接，然后获得所有页面-不仅像以前的版本那样只有10页。

import requests
from bs4 import BeautifulSoup

# link to first page - without `page=`
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10'

# only for information, not used in url
page = 0 

while True:

    print('---', page, '---')

    r = requests.get(url)

    soup = BeautifulSoup(r.content, "html.parser")

    # String substitution for HTML
    for link in soup.find_all("a"):
        print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))

    # Fetch and print general data from title class
    general_data = soup.find_all('div', {'class' : 'title'})

    for item in general_data:
        print(item.contents[0].text)
        print(item.contents[1].text.replace('.',''))
        print(item.contents[2].text)

    # link to next page

    next_page = soup.find('a', {'class': 'next'})

    if next_page:
        url = next_page.get('href')
        page += 1
    else:
        break # exit `while True`

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。