使用Python / BeautifulSoup进行Web抓取：具有指向个人资料的多个链接的网站>需要个人资料内容

debugcn 发表于 Dev

埃琳娜

对于我的硕士论文，我想向该领域的尽可能多的人发送调查表（幼儿教育），所以我的目标是从公共站点从Dacare中心（KiTa）抓取电子邮件。我对Python还是很陌生，因此尽管对大多数人来说这似乎微不足道，但事实证明，这对我的知识水平来说是很大的挑战。我也不太熟悉这行话，所以我什至不知道我需要寻找什么。

这是站点（德语）：https : //www.kitanetz.de/

要获得所需的内容，我必须首先选择一个国家（“ Bundesland”），将其定向到需要单击“ Kreise auflisten”的下一个级别。然后我进入下一个级别，列出了该国家/地区内的所有小县。每个链接都会打开包含邮政编码和个人资料链接的下一级页面。这些配置文件中有些包含电子邮件，有些则没有（找到教程可以解决这个问题）。

我花了两天的时间从其中一个页面上抓取了邮政编码和中心名称。我需要做什么，以便Python能够遍历每个国家，每个县和每个配置文件以获取链接？如果您知道资源或关键字，则应该注意下一步。我也尚未尝试使用熊猫将来自此代码的数据放入数据框中，但我的其他尝试均未奏效。

到目前为止，这是我的尝试。我在代码中的注释/问题中添加了##。＃是本教程中的注释：

    import requests
from bs4 import BeautifulSoup

## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup

# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")

# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html

# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do? 

## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]

## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})

table_data = table.find_all("tr")  ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags

for link in table.find_all("a"):
    print("Name: {}".format(link.text))
    print("href: {}".format(link.get("href")))
    
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
    # remove any newlines and extra spaces from left and right
    t_headers.append(th.text.replace('\n', ' ').strip())
    
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
    t_row = {}
    # Each table row is stored in the form of
    ## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
    ## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
    ## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.

    # find all td's(3) in tr and zip it with t_header
    for td, th in zip(tr.find_all("td"), t_headers): 
        t_row[th] = td.text.replace('\n', '').strip()
        table_data.append(t_row)

安德烈·凯斯利（Andrej Kesely）

您可以使用网站的sitemap.xml链接获取个人资料的所有链接。当您拥有所有链接时，这只是简单的解析：

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.kitanetz.de/sitemap.xml'

sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
    if r.search(loc.text):
        html_data = requests.get(loc.text).text
        soup = BeautifulSoup(html_data, 'html.parser')

        title = soup.h1.text

        email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
        if email:
            email = email[1] + '@' + email[2] + '.' + email[3]
        else:
            email = '-'

        print('{:<60} {:<35} {}'.format(title, email, loc.text))

印刷品：

Evangelisch-lutherische Kindertagessstätte Lemförde          [email protected]              https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I                                    [email protected]                 https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße)                           [email protected]     https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin                                              [email protected]     https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´       [email protected]              https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen                                    [email protected]            https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´                          [email protected]      https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte                                     [email protected]                  https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´                             [email protected]                https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte                                             [email protected]          https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe                                               [email protected]          https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh                             [email protected]   https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis                          [email protected]           https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I                           [email protected]            https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club                             [email protected]           https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal                                       [email protected]         https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt                     [email protected]    https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker [email protected]          https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun                    [email protected]    https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz                                     [email protected]             https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld                            [email protected]         https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php

... and so on.

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-2

我来说两句

0条评论

登录后参与评论

Related 相关文章

文章