对于我的硕士论文,我想向该领域的尽可能多的人发送调查表(幼儿教育),所以我的目标是从公共站点从Dacare中心(KiTa)抓取电子邮件。我对Python还是很陌生,因此尽管对大多数人来说这似乎微不足道,但事实证明,这对我的知识水平来说是很大的挑战。我也不太熟悉这行话,所以我什至不知道我需要寻找什么。
这是站点(德语):https : //www.kitanetz.de/
要获得所需的内容,我必须首先选择一个国家(“ Bundesland”),将其定向到需要单击“ Kreise auflisten”的下一个级别。然后我进入下一个级别,列出了该国家/地区内的所有小县。每个链接都会打开包含邮政编码和个人资料链接的下一级页面。这些配置文件中有些包含电子邮件,有些则没有(找到教程可以解决这个问题)。
我花了两天的时间从其中一个页面上抓取了邮政编码和中心名称。我需要做什么,以便Python能够遍历每个国家,每个县和每个配置文件以获取链接?如果您知道资源或关键字,则应该注意下一步。我也尚未尝试使用熊猫将来自此代码的数据放入数据框中,但我的其他尝试均未奏效。
到目前为止,这是我的尝试。我在代码中的注释/问题中添加了##。#是本教程中的注释:
import requests
from bs4 import BeautifulSoup
## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup
# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")
# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html
# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do?
## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]
## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})
table_data = table.find_all("tr") ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags
for link in table.find_all("a"):
print("Name: {}".format(link.text))
print("href: {}".format(link.get("href")))
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
t_row = {}
# Each table row is stored in the form of
## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
您可以使用网站的sitemap.xml
链接获取个人资料的所有链接。当您拥有所有链接时,这只是简单的解析:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.kitanetz.de/sitemap.xml'
sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
if r.search(loc.text):
html_data = requests.get(loc.text).text
soup = BeautifulSoup(html_data, 'html.parser')
title = soup.h1.text
email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
if email:
email = email[1] + '@' + email[2] + '.' + email[3]
else:
email = '-'
print('{:<60} {:<35} {}'.format(title, email, loc.text))
印刷品:
Evangelisch-lutherische Kindertagessstätte Lemförde [email protected] https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I [email protected] https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße) [email protected] https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin [email protected] https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´ [email protected] https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen [email protected] https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´ [email protected] https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte [email protected] https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´ [email protected] https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte [email protected] https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe [email protected] https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh [email protected] https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis [email protected] https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I [email protected] https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club [email protected] https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal [email protected] https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt [email protected] https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker [email protected] https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun [email protected] https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz [email protected] https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld [email protected] https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php
... and so on.
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句