I have a Pandas dataframe where I am trying to replace the values in each group by the mean of the group. On my machine, the line df["signal"].groupby(g).transform(np.mean)
takes about 10 seconds to run with N
and N_TRANSITIONS
set to the numbers below.
Is there any faster way to achieve the same result?
import pandas as pd
import numpy as np
from time import time
np.random.seed(0)
N = 120000
N_TRANSITIONS = 1400
# generate groups
transition_points = np.random.permutation(np.arange(N))[:N_TRANSITIONS]
transition_points.sort()
transitions = np.zeros((N,), dtype=np.bool)
transitions[transition_points] = True
g = transitions.cumsum()
df = pd.DataFrame({ "signal" : np.random.rand(N)})
# here is my bottleneck for large N
tic = time()
result = df["signal"].groupby(g).transform(np.mean)
toc = time()
print toc - tic
Inspired by Jeff's answer. This is the fastest method on my machine:
pd.Series(np.repeat(grp.mean().values, grp.count().values))
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments