BeautifulSoupを使用してWebページ内のWebページにアクセスしますか？

debugcn 投稿 Dev

Zain Ather

私は、beautifulsoupを使用してWebページのデータを解析するPythonスクリプトを作成しました。さらにやりたいのは、ページ上の各人の名前をクリックし、そのプロファイルにアクセスしてから、そのページのWebサイトのリンクをクリックして、そのWebサイトから電子メールID（利用可能な場合）を取得することです。誰かがこれで私を助けることができますか？私はbeautifulsoupとpythonを初めて使用するため、これ以上先に進むことができません。どんな助けでも大歓迎です。ありがとう！私が取り組んでいるリンクの種類は次のとおりです：https：//www.realtor.com/realestateagents/agentname-john

これが私のコードです：

from bs4 import BeautifulSoup
import requests
import csv




#####################  Website
#####################           URL

w_url = str('https://www.')+str(input('Please Enter Website URL :'))





####################### Number of
#######################           Pages

pages = int(input(' Please specify number of pages: '))




#######################  Range
#######################         Specified
page_range = list(range(0,pages))




#######################  WebSite
#######################          Name ( in case of multiple websites )
#site_name = int(input('Enter the website name ( IN CAPITALS ) :'))



#######################  Empty
#######################        List
agent_info= []




#######################   Creating
#######################            CSV File
csv_file = open(r'D:\Webscraping\real_estate_agents.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Name and Number'])





####################### FOR
#######################    LOOP
for k in page_range:
    website = requests.get(w_url+'/pg-'+'{}'.format(k)).text
    soup = BeautifulSoup(website,'lxml')


    class1 = 'jsx-1448471805 agent-name text-bold'
    class2 = 'jsx-1448471805 agent-phone hidden-xs hidden-xxs'



    for i in soup.find_all('div',class_=[[class1],[class2]]):

        w = i.text
        agent_info.append(w)





#####################  Reomiving
#####################            Duplicates

updated_info= list(dict.fromkeys(agent_info))





#####################   Writing Data
#####################               to CSV

for t in updated_info:
    print(t)
    csv_writer.writerow([t])
    print('\n')




csv_file.close()

chitown88

APIからデータを取得すると、より効率的になります（コード行が少なくなります）。また、Webサイトの電子メールもその範囲内にあるようです。したがって、必要に応じて、その電子メールの30,000以上のWebサイトのそれぞれにアクセスする必要がないため、すべてを短時間で取得できます。

APIには、必要な/必要なすべてのデータも含まれています。たとえば、1つのエージェントに関するすべてがここにあります。

{'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'advertiser_id': 2121274, 'agent_rating': 5, 'background_photo': {'href': 'https://ap.rdcpix.com/1223152681/cc48579b6a0fe6ccbbf44d83e8f82145g-c0o.jpg'}, 'broker': {'fulfillment_id': 3860509, 'designations': [], 'name': 'BRIDGE REALTY, LLC.', 'accent_color': '', 'photo': {'href': ''}, 'video': ''}, 'description': 'As a professional real estate agent licensed in the State of Minnesota, I am committed to providing only the highest standard of care as I assist you in navigating the twists and turns of home ownership. Whether you are buying or selling your home, I will do everything it takes to turn your real estate goals and desires into a reality. If you are looking for a real estate Agent who will put your needs first and go above and beyond to help you reach your goals, I am the agent for you.', 'designations': [], 'first_month': 0, 'first_name': 'John', 'first_year': 2010, 'has_photo': True, 'href': 'http://www.twincityhomes4sale.com', 'id': '56b63efd7e54f7010021459d', 'is_realtor': True, 'languages': [], 'last_name': 'Palomino', 'last_updated': 'Mon, 04 Jan 2021 18:46:12 GMT', 'marketing_area_cities': [{'city_state': 'Columbus_MN', 'name': 'Columbus', 'state_code': 'MN'}, {'city_state': 'Blaine_MN', 'name': 'Blaine', 'state_code': 'MN'}, {'city_state': 'Circle Pines_MN', 'name': 'Circle Pines', 'state_code': 'MN'}, {'city_state': 'Lino Lakes_MN', 'name': 'Lino Lakes', 'state_code': 'MN'}, {'city_state': 'Lexington_MN', 'name': 'Lexington', 'state_code': 'MN'}, {'city_state': 'Forest Lake_MN', 'name': 'Forest Lake', 'state_code': 'MN'}, {'city_state': 'Chisago City_MN', 'name': 'Chisago City', 'state_code': 'MN'}, {'city_state': 'Wyoming_MN', 'name': 'Wyoming', 'state_code': 'MN'}, {'city_state': 'Centerville_MN', 'name': 'Centerville', 'state_code': 'MN'}, {'city_state': 'Hugo_MN', 'name': 'Hugo', 'state_code': 'MN'}, {'city_state': 'Grant_MN', 'name': 'Grant', 'state_code': 'MN'}, {'city_state': 'St. Anthony_MN', 'name': 'St. Anthony', 'state_code': 'MN'}, {'city_state': 'Arden Hills_MN', 'name': 'Arden Hills', 'state_code': 'MN'}, {'city_state': 'New Brighton_MN', 'name': 'New Brighton', 'state_code': 'MN'}, {'city_state': 'Mounds View_MN', 'name': 'Mounds View', 'state_code': 'MN'}, {'city_state': 'White Bear Township_MN', 'name': 'White Bear Township', 'state_code': 'MN'}, {'city_state': 'Vadnais Heights_MN', 'name': 'Vadnais Heights', 'state_code': 'MN'}, {'city_state': 'Shoreview_MN', 'name': 'Shoreview', 'state_code': 'MN'}, {'city_state': 'Little Canada_MN', 'name': 'Little Canada', 'state_code': 'MN'}, {'city_state': 'Columbia Heights_MN', 'name': 'Columbia Heights', 'state_code': 'MN'}, {'city_state': 'Hilltop_MN', 'name': 'Hilltop', 'state_code': 'MN'}, {'city_state': 'Fridley_MN', 'name': 'Fridley', 'state_code': 'MN'}, {'city_state': 'Linwood_MN', 'name': 'Linwood', 'state_code': 'MN'}, {'city_state': 'East Bethel_MN', 'name': 'East Bethel', 'state_code': 'MN'}, {'city_state': 'Spring Lake Park_MN', 'name': 'Spring Lake Park', 'state_code': 'MN'}, {'city_state': 'North St. Paul_MN', 'name': 'North St. Paul', 'state_code': 'MN'}, {'city_state': 'Maplewood_MN', 'name': 'Maplewood', 'state_code': 'MN'}, {'city_state': 'St. Paul_MN', 'name': 'St. Paul', 'state_code': 'MN'}], 'mls': [{'member': {'id': '506004321'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'A', 'primary': True}], 'nar_only': 1, 'nick_name': '', 'nrds_id': '506004321', 'office': {'name': 'Bridge Realty, Llc', 'mls': [{'member': {'id': '10982'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'O', 'primary': True}], 'phones': [{'ext': '', 'number': '(952) 368-0021', 'type': 'Home'}], 'phone_list': {'phone_1': {'type': 'Home', 'number': '(952) 368-0021', 'ext': ''}}, 'photo': {'href': ''}, 'slogan': '', 'website': None, 'video': None, 'fulfillment_id': 3027311, 'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'email': '[email protected]', 'nrds_id': None}, 'party_id': 23115328, 'person_name': 'John Palomino', 'phones': [{'ext': '', 'number': '(763) 458-0788', 'type': 'Mobile'}], 'photo': {'href': 'https://ap.rdcpix.com/900899898/cc48579b6a0fe6ccbbf44d83e8f82145a-c0o.jpg'}, 'recommendations_count': 2, 'review_count': 7, 'role': 'agent', 'served_areas': [{'name': 'Circle Pines', 'state_code': 'MN'}, {'name': 'Forest Lake', 'state_code': 'MN'}, {'name': 'Hugo', 'state_code': 'MN'}, {'name': 'St. Paul', 'state_code': 'MN'}, {'name': 'Minneapolis', 'state_code': 'MN'}, {'name': 'Wyoming', 'state_code': 'MN'}], 'settings': {'share_contacts': False, 'full_access': False, 'recommendations': {'realsatisfied': {'user': 'John-Palomino', 'id': '1073IJk', 'linked': '3d91C', 'updated': '1529551719'}}, 'display_listings': True, 'far_override': True, 'show_stream': True, 'terms_of_use': True, 'has_dotrealtor': False, 'display_sold_listings': True, 'display_price_range': True, 'display_ratings': True, 'loaded_from_sb': True, 'broker_data_feed_opt_out': False, 'unsubscribe': {'autorecs': False, 'recapprove': False, 'account_notify': False}, 'new_feature_popup_closed': {'agent_left_nav_avatar_to_profile': False}}, 'slogan': 'Bridging the gap between buyers & sellers', 'specializations': [{'name': '1st time home buyers'}, {'name': 'Residential Listings'}, {'name': 'Rental/Investment Properties'}, {'name': 'Move Up Buyers'}], 'title': 'Agent', 'types': 'agent', 'user_languages': [], 'web_url': 'https://www.realtor.com/realestateagents/John-Palomino_BLOOMINGTON_MN_2121274_876599394', 'zips': ['55014', '55025', '55038', '55112', '55126', '55421', '55449', '55092', '55434', '55109'], 'email': '[email protected]', 'full_name': 'John Palomino', 'name': 'John Palomino, Agent', 'social_media': {'facebook': {'type': 'facebook', 'href': 'https://www.facebook.com/Johnpalominorealestate'}}, 'for_sale_price': {'count': 1, 'min': 299900, 'max': 299900, 'last_listing_date': '2021-01-29T11:10:24Z'}, 'recently_sold': {'count': 35, 'min': 115000, 'max': 460000, 'last_sold_date': '2020-12-18'}, 'agent_team_details': {'is_team_member': False}}

コード：

import requests
import pandas as pd
import math

# Function to pull the data
def get_agent_info(jsonData, rows):
    agents = jsonData['agents']
    for agent in agents:
        name = agent['person_name']

        if 'email' in agent.keys():
            email = agent['email']
        else:
            email = 'N/A'
        
        if 'href' in agent.keys():
            website = agent['href']
        else:
            website = 'N/A'
            
        try:
            office_data = agent['office']
            office_email = office_data['email']
        except:
            office_email = 'N/A'
        
        row = {'name':name, 'email':email, 'website':website, 'office_email':office_email}
        rows.append(row)
    return rows

rows = []   
url = 'https://www.realtor.com/realestateagents/api/v3/search'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}
payload = {'nar_only': '1','offset': '','limit': '300','marketing_area_cities':  '_',
           'postal_code': '','is_postal_search': 'true','name': 'john','types': 'agent',
           'sort': 'recent_activity_high','far_opt_out': 'false','client_id': 'FAR2.0',
           'recommendations_count_min': '','agent_rating_min': '','languages': '',
           'agent_type': '','price_min': '','price_max': '','designations': '',
           'photo': 'true'}

# Gets 1st page, finds how many pages yoyu'll need to go through, and parses the data   
jsonData = requests.get(url, headers=headers, params=payload).json()
total_matchs = jsonData['matching_rows']
total_pages = math.ceil(total_matchs/300)
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(1,total_pages))

# Iterate through next pages
for page in range(1,total_pages):
    payload.update({'offset':page*300})
    jsonData = requests.get(url, headers=headers, params=payload).json()
    rows = get_agent_info(jsonData, rows)
    print ('Completed: %s of %s' %(page+1,total_pages))

df = pd.DataFrame(rows)

出力：30,600の最初の10行のみ

print(df.head(10).to_string())
                name                       email                                 website                   office_email
0       John Croteau           [email protected]  https://www.facebook.com/JCtherealtor/      [email protected]
1  Stephanie St John       [email protected]   https://stephaniestjohn.shorewest.com     [email protected]
2     Johnine Larsen     [email protected]               http://realestategals.com  [email protected]
3    Leonard Johnson  [email protected]                 http://www.adrhomes.net     [email protected]
4  John C Fitzgerald           [email protected]                 http://www.JCFHomes.com                               
5  John Vrsansky  Jr     [email protected]           http://www.OnTargetRealty.com        [email protected]
6      John Williams    [email protected]        http://www.johnwilliamsidaho.com               [email protected]
7        John Zeiter          [email protected]                                                         [email protected]
8      Mitch Johnson  [email protected]                                            [email protected]
9          John Lowe           [email protected]                http://johnlowegroup.com  [email protected]

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-14

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

BeautifulSoupを使用してWebページ内のWebページにアクセスしますか？

BeautifulSoupを使用してWebページ内のWebページにアクセスしますか？

sshを介してWebページにアクセスする

Javaからの資格情報を使用してWebページにアクセスできません

IPアドレスを停止してWebサイトのホームページにアクセスします

Pythonで基本認証を使用してWebページにアクセスする

Webページにアクセスしてコマンドを実行しますか？

Webページの `Ctrl + S`は、FTPを使用してWebページデータをハードディスクに転送しますか？

「静的」HTMLを使用する代わりにWebページインポートモジュール、BeautifulSoupを使用してこれらのモジュールにアクセスするにはどうすればよいですか

リクエストとBeautifulSoupパッケージを使用してWebページをデコードします

「このWebページにクリップボードへのアクセスを許可しますか」メッセージを無効にします

Javascriptで最後のWebページにアクセスしてからの時間を取得する方法

BeautifulSoupを使用してWebページから特定のリンクをスクレイピングする

Pythonを使用してWebページのコンテンツにアクセスする

VBAにアクセスして、SSO要件のあるWebページを開きます

beautifulsoupを使用してWebページの特定の部分を削除します

サーブレットを使用しても、web-infフォルダ内のjspページにアクセスできません

BeautifulSoupとMechanizeを使用してWebページにログインします

特定のWebページにアクセスした人は何人ですか？

Webページのアクセスキーとしてショートカットキーを使用する

PythonとBeautifulSoupを使用してWebページからリンクを取得する

PHP：WebページにアクセスせずにPHPスクリプトを実行しますか？

Djangoを使用してWebページのデータにアクセスするにはどうすればよいですか？

WebページからAmazoncognitoにFacebookアクセストークンを登録します

コントローラを使用してWebページへのアクセスを制限する

BeautifulSoupを使用してWebページからリンクを取得し、スクロールして詳細を表示します

アクセスしたすべてのWebページを保存することは可能ですか？

Webページ内のフォルダからJSPページにアクセスする

1つのWebで複数のページからデータをスクレイピングする方法、PythonとBeautifulSoupを使用しています

BeautifulSoupを使用してWebページからテキストと画像を抽出します

ページオブジェクトモデルを使用したフレームワークの作成SeleniumWebdriver Java TestNGとMavenは、Webページ上の複数のページにアクセスする必要があります