Pandas Date Groupby＆Apply-パフォーマンスの向上

debugcn 投稿 Dev

クリス

私は30分ごとの日付グループ化を行っており、データセットの毎日の統計を計算するために適用していますが、時間がかかります。次の機能のパフォーマンスを向上させる方法はありますか？ベクトル化について読んだことがありますが、それを実装する方法がわかりません。

適用と変換を使用して希望の出力を取得しましたが、1年分のデータには約2〜3秒かかります。データが多いので、もっと速くしたいと思っています。誰でも私を正しい方向に向けることができますか？

import pandas as pd
import numpy as np
import timeit

# dummy data
date_range = pd.date_range('2017-01-01 00:00', '2018-01-01 00:00', freq='30Min')
df = pd.DataFrame(np.random.randint(2, 20, (date_range.shape[0], 2)), index=date_range, columns=['Electricity', 'Natural Gas'])

print(df.head())
print(df.shape)

t1 = timeit.default_timer()
onhour = df.groupby([pd.Grouper(freq='D')]).apply(lambda x: np.count_nonzero(
    x[x > x.quantile(0.05) + x.mean() * .1] >
    x.quantile(0.05) + 0.25 * (x.quantile(0.95)-x.quantile(0.05)),
    axis=0) / 2)

onhour = pd.DataFrame(
    onhour.values.tolist(),
    index=onhour.index,
    columns=df.columns)

print(f"start_time in {timeit.default_timer() - t1}")
print(onhour.head())

t1 = timeit.default_timer()
onhour = df.groupby([pd.Grouper(freq='D')]).transform(lambda x: np.count_nonzero(
    x[x > x.quantile(0.05) + x.mean() * .1] >
    x.quantile(0.05) + 0.25 * (x.quantile(0.95)-x.quantile(0.05)),
    axis=0) / 2).resample('D').mean()

print(f"start_time in {timeit.default_timer() - t1}")
print(onhour.head())

godot

あなたはすでにパンダのベクトル化最適化を使用しているので、多くの時間を得ることはできませんが、いくつかのトリックで1.5秒であなたを得ることができます。

1）aggを使用する

またはのagg代わりにを使用すると、各列（電気とガス）に対して同じ計算が行われるため、より良い結果が得られます。transformapply

2）分位数の計算を保存します。

5％分位数の3倍を計算しています。私はfunction代わりにPythonを使用しましたがlambda、メモ化された分位関数を追加すればラムダを使用できます（実際には固定に役立つ可能性がありますが、確かです）。

def count_something(row):
    qt_df = row.quantile([0.05, 0.95])
    return np.count_nonzero(
        row[row > qt_df.loc[0.05] + row.mean() * .1] > qt_df.loc[0.05] + 0.25 * (qt_df.loc[0.95] - qt_df.loc[0.05]),
        axis=0) / 2

t1 = timeit.default_timer()

onhour = df.groupby([pd.Grouper(freq='D')]).agg(count_something)

print(f"start_time in {timeit.default_timer() - t1}")
print(onhour.head())

本当に計算を高速化したい場合で、計算を並列化または分散する方法がある場合は、python daskを使用できると思いますが、問題をどれだけ改善できるかはわかりません。

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-10

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

Pandas Date Groupby＆Apply-パフォーマンスの向上

Pandas Date Groupby＆Apply-パフォーマンスの向上

パフォーマンスを向上させる（ベクトル化する？）pandas.groupby.aggregate

Pandas DataFramesでの行追加パフォーマンスの向上

反復的なgroupby操作のパフォーマンスの向上

WHERE vs. GROUPBYによるパフォーマンスの向上

Pandas GroupBy Date Chunks

pandasまたはscipyのgroupby.sum（）スパース行列：パフォーマンスを探しています

Pandasデータフレームを使用したPythonforループのパフォーマンスの向上

Pandas DataFrameのパフォーマンス

Pandas DataFrameのパフォーマンス

pandas 1.2.1 to_csvのパフォーマンス（datetimeをインデックスとして、date_formatを設定）

Chaining groupby and apply pandas

Pandasデータフレームで.applyを使用しているときのカスタム関数のパフォーマンス

Pandas dataframe Groupby and retrieve date range

パンダのGroupByフィルター操作のパフォーマンスを向上させるにはどうすればよいですか？

pandas groupby apply is really slow

pandas groupby.apply to pyspark

groupby 内の Groupby と All() のパフォーマンス

GroupByとMoreLinqのDistinctByのパフォーマンスの違い

groupby.shift中のパフォーマンスの問題

Pandas .groupby.size（）からの出力のフォーマット

apply（）ラムダに2つの列が必要な場合、マルチインデックスデータフレーム上のPandas groupby（）

Python: Pandas, how to select rows that in different ranges of date after groupby

Pandas groupby multiple columns basis date column by epoch week

Pandas.groupby.apply（）のメモリリーク？

How to apply Pandas isocalendar() to a list of dates rather than a single date

How to apply Pandas isocalendar() to a list of dates rather than a single date

MySQLのパフォーマンスDATE_FORMAT（）とYEAR（）およびMONTH（）

Python / Pandas-パフォーマンス-列の値の発生率の計算

Apply multiple if/else statement to groupby object in pandas