使用Python / BeautifulSoup进行Web抓取:具有指向个人资料的多个链接的网站>需要个人资料内容

埃琳娜

对于我的硕士论文,我想向该领域的尽可能多的人发送调查表(幼儿教育),所以我的目标是从公共站点从Dacare中心(KiTa)抓取电子邮件。我对Python还是很陌生,因此尽管对大多数人来说这似乎微不足道,但事实证明,这对我的知识水平来说是很大的挑战。我也不太熟悉这行话,所以我什至不知道我需要寻找什么。

这是站点(德语):https : //www.kitanetz.de/

要获得所需的内容,我必须首先选择一个国家(“ Bundesland”),将其定向到需要单击“ Kreise auflisten”的下一个级别。然后我进入下一个级别,列出了该国家/地区内的所有小县。每个链接都会打开包含邮政编码和个人资料链接的下一级页面。这些配置文件中有些包含电子邮件,有些则没有(找到教程可以解决这个问题)。

我花了两天的时间从其中一个页面上抓取了邮政编码和中心名称我需要做什么,以便Python能够遍历每个国家,每个县和每个配置文件以获取链接?如果您知道资源或关键字,则应该注意下一步。我也尚未尝试使用熊猫将来自此代码的数据放入数据框中,但我的其他尝试均未奏效。

到目前为止,这是我的尝试。我在代码中的注释/问题中添加了##。#是本教程中的注释:

    import requests
from bs4 import BeautifulSoup

## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup

# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")

# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html

# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do? 

## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]

## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})

table_data = table.find_all("tr")  ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags

for link in table.find_all("a"):
    print("Name: {}".format(link.text))
    print("href: {}".format(link.get("href")))
    
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
    # remove any newlines and extra spaces from left and right
    t_headers.append(th.text.replace('\n', ' ').strip())
    
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
    t_row = {}
    # Each table row is stored in the form of
    ## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
    ## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
    ## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.

    # find all td's(3) in tr and zip it with t_header
    for td, th in zip(tr.find_all("td"), t_headers): 
        t_row[th] = td.text.replace('\n', '').strip()
        table_data.append(t_row)

 
安德烈·凯斯利(Andrej Kesely)

您可以使用网站的sitemap.xml链接获取个人资料的所有链接。当您拥有所有链接时,这只是简单的解析:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.kitanetz.de/sitemap.xml'

sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
    if r.search(loc.text):
        html_data = requests.get(loc.text).text
        soup = BeautifulSoup(html_data, 'html.parser')

        title = soup.h1.text

        email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
        if email:
            email = email[1] + '@' + email[2] + '.' + email[3]
        else:
            email = '-'

        print('{:<60} {:<35} {}'.format(title, email, loc.text))

印刷品:

Evangelisch-lutherische Kindertagessstätte Lemförde          [email protected]              https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I                                    [email protected]                 https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße)                           [email protected]     https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin                                              [email protected]     https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´       [email protected]              https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen                                    [email protected]            https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´                          [email protected]      https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte                                     [email protected]                  https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´                             [email protected]                https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte                                             [email protected]          https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe                                               [email protected]          https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh                             [email protected]   https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis                          [email protected]           https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I                           [email protected]            https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club                             [email protected]           https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal                                       [email protected]         https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt                     [email protected]    https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker [email protected]          https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun                    [email protected]    https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz                                     [email protected]             https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld                            [email protected]         https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php

... and so on.

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

Related 相关文章

热门标签

归档