使用Python / BeautifulSoup进行Web抓取:具有指向个人资料的多个链接的网站>需要个人资料内容



这是站点(德语):https : //www.kitanetz.de/

要获得所需的内容,我必须首先选择一个国家(“ Bundesland”),将其定向到需要单击“ Kreise auflisten”的下一个级别。然后我进入下一个级别,列出了该国家/地区内的所有小县。每个链接都会打开包含邮政编码和个人资料链接的下一级页面。这些配置文件中有些包含电子邮件,有些则没有(找到教程可以解决这个问题)。



    import requests
from bs4 import BeautifulSoup

## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup

# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")

# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html

# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do? 

## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]

## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})

table_data = table.find_all("tr")  ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags

for link in table.find_all("a"):
    print("Name: {}".format(link.text))
    print("href: {}".format(link.get("href")))
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
    # remove any newlines and extra spaces from left and right
    t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
    t_row = {}
    # Each table row is stored in the form of
    ## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
    ## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
    ## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.

    # find all td's(3) in tr and zip it with t_header
    for td, th in zip(tr.find_all("td"), t_headers): 
        t_row[th] = td.text.replace('\n', '').strip()

安德烈·凯斯利(Andrej Kesely)


import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.kitanetz.de/sitemap.xml'

sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
    if r.search(loc.text):
        html_data = requests.get(loc.text).text
        soup = BeautifulSoup(html_data, 'html.parser')

        title = soup.h1.text

        email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
        if email:
            email = email[1] + '@' + email[2] + '.' + email[3]
            email = '-'

        print('{:<60} {:<35} {}'.format(title, email, loc.text))


Evangelisch-lutherische Kindertagessstätte Lemförde          [email protected]              https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I                                    [email protected]                 https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße)                           [email protected]     https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin                                              [email protected]     https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´       [email protected]              https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen                                    [email protected]            https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´                          [email protected]      https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte                                     [email protected]                  https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´                             [email protected]                https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte                                             [email protected]          https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe                                               [email protected]          https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh                             [email protected]   https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis                          [email protected]           https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I                           [email protected]            https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club                             [email protected]           https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal                                       [email protected]         https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt                     [email protected]    https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker [email protected]          https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun                    [email protected]    https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz                                     [email protected]             https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld                            [email protected]         https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php

... and so on.


