优化代码以从许多html文件中提取信息

The_Real_MP 发表于 Dev

The_Real_MP

我正在尝试从python文件的目录中提取特定信息，该目录是我先前使用python的请求库提取的。自从我建立了一个随机等待计时器以来，html的提取已经很慢了，但是现在我想对每个检索到的html文件进行编程，看来我的脚本还没有很好地优化。这是一个问题，因为我想在42000个html文件中使用每行8000行以上的文件。这可能会花费很多时间。

由于我从未遇到过我的计算机所面临的这些问题，所以我不知道从哪里开始学习以优化我的代码。我对您的问题是，我是否应该以更省时的方式以不同的方式来解决这个问题？您的建议将不胜感激。

这是我正在使用的代码，我更改了一些敏感信息：

#empty lists of features of houses
link = []
name_list = []
agent_list = []
description_list = []
features_list = []

#here link_list is a list that I previously retrieved and holds all the links to the original html files extracted in a previous step.
for i in range(1,len(link_list)):
    html = open_html('C:\\Users\\Documents\\file_p{}.html'.format(i))
    soup = BeautifulSoup(html, 'html.parser')
    link.append(link_list[i])
    name_list.append(soup.select_one('.object-header__title').text)
    agent_list.append(soup.select_one('.object-contact-agent-link').text)
    description_list.append(soup.select_one('.object-description-body').text)
    features_list.append(soup.select_one('.object-features').text)
    

d = {'Link_P': link, 'Name_P': name_list, 'Features_P': features_list, 'Description_P': description_list, 'Agent_P': agent_list}
df = pd.DataFrame(data=d)
df

更新：感谢这里的帮助，我设法使我的代码更高效。最后，我尝试了多处理，并使用Python中的时间模块对速度进行了很多测试。从中我学到了很多有关优化代码的知识。

import numpy as np
import requests
from bs4 import BeautifulSoup
from time import sleep
import time
from tqdm import tqdm_notebook
import json
from random import randint
import csv
import concurrent.futures
import re
from itertools import product
from multiprocessing import Pool

"""Retrieve individual pages html source code and save this, 
then open individual page and retrieve specific data. 
This data is finally saved in a dedicated csv """         
def extract_indiv_page_html(page, start_num, end_num):
    sleep(randint(1,3))
    url = "https://www.xxx.xx{}".format(page)
    headers = {'authority': 'www.xxx.xx', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'upgrade-insecure-requests': '1', 'user-agent': xxx, 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'referer': 'https://www.xxx.xx/xxxx/', 'accept-language': 'en;q=0.9,en-GB;q=0.8,en-US;q=0.7'}
    r = requests.get(url, headers = headers )
    name = re.findall('.+?/',page)
    save_html(r.content, 'path\\file_{}_{}.html'.format(name[1].replace('/', ''), name[2].replace('/', '')))
    html = open_html('path\\file_{}_{}.html'.format(name[1].replace('/', ''), name[2].replace('/', '')))
    soup = BeautifulSoup(html, 'html.parser')
    link = url
    try:
        Name_P = soup.select_one('.object-header__title').text
    except AttributeError:
        Name_P = ''
    try:
        Agent_P = soup.select_one('.object-contact-agent-link').text
    except AttributeError:
        Agent_P = ''
    try:
        Description_P = soup.select_one('.object-description-body').text
    except AttributeError:
        Description_P = ''
    try:
        Features_P = soup.select_one('.object-features').text
    except AttributeError:
        Features_P = ''
    with open('path\\P_details_df_date_{}_{}.csv'.format(start_num, end_num), 'a', newline='') as file:
            writer = csv.writer(file)
            writer.writerow([link, Name_P, Agent_P, Description_P, Features_P, start_num, end_num])


start = time.perf_counter()
pages = 3015

#creates a list of all the links that we could extract from the original listing htmls
link_list = []
if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor() as executor:
          indiv_p_link = executor.map(extract_links, [i for i in range(1, pages)])
          [link_list.extend(i) for i in indiv_p_link]

"""segments the list in slices of 1% of its total. 
This is done so that we split the saved data over more csv files. 
This will decrease working time if we are trying to extract data from many html links."""
start_indices = [(len(link_list)//100*i) for i in range(0, 100, 1)]
end_indices = [(len(link_list)//100*i) for i in range(1, 100, 1)]
end_indices.append(len(link_list))

#creating a tuple list, this is needed to pass into the starmap method of our pools
tuple_list = []
for indice_segment in range(0, 100):
    for number in range(start_indices[indice_segment], end_indices[indice_segment]):
        extract_indiv_page_tuple = (link_list[number], start_indices[indice_segment], end_indices[indice_segment])
        tuple_list.append(extract_indiv_page_tuple)
        
 
if __name__ == "__main__":
    p = Pool(4)
    p.starmap(extract_indiv_page_html, tuple_list)
    p.terminate()
    p.join()




finish = time.perf_counter()       
print('Finished in {} seconds'.format(round(finish-start, 2)))