Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

Doctor J Published at Dev

Doctor J

I'd like to create a new dataframe from the results of groupby on another. The result should have one row per group (basically a vectorized map-reduce), and the new column names bear no relation to the existing names. This seems like a natural use for agg, but it only seems to produce existing columns.

d = pd.DataFrame({'a': [0,0,1,1], 'b': [3,4,5,6], 'c': [7,8,9,0]})

   a  b  c
0  0  3  7
1  0  4  8
2  1  5  9
3  1  6  0

agg() will create new columns with a Series:

d.groupby('a')['b'].agg({'x': lambda g: g.sum()})

    x
a    
0   7
1  11

But frustratingly not with a DataFrame:

d.groupby('a').agg({'x': lambda g: g.b.sum()})
KeyError: 'x'

I can do it by returning a one-row DataFrame from apply():

d.groupby('a').apply(lambda g: pd.DataFrame([{'x': g.b.mean(), 'y': (g.b * g.c).sum()}])).reset_index(level=1, drop=True)

     x   y
a         
0  3.5  53
1  5.5  45

but this is ugly and, as you can imagine, creating a new dict, list, and DataFrame for every row is slow for even modestly-sized inputs.

Doctor J

Here is a comparison of a few different ways to do it. I prefer returning a Series; reasonably succinct, clear, and efficient. Thanks to @Siraj S for the inspiration.

df = pd.DataFrame(np.random.rand(1000000, 5), columns=list('abcde'))
grp = df.groupby((df.a * 100).astype(int))


%timeit grp.apply(lambda g: pd.DataFrame([{'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}])).reset_index(level=1, drop=True)
1 loop, best of 3: 328 ms per loop

%timeit grp.apply(lambda g: (g.e.count(), (g.b * g.c).sum() / g.c.sum(), g.d.mean(), g.e.std())).apply(pd.Series)
1 loop, best of 3: 266 ms per loop

%timeit grp.apply(lambda g: pd.Series({'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}))
1 loop, best of 3: 265 ms per loop

%timeit grp.apply(lambda g: {'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}).apply(pd.Series)
1 loop, best of 3: 273 ms per loop

%timeit pd.concat([grp.apply(lambda g: g.e.count()), grp.apply(lambda g: (g.b * g.c).sum() / g.c.sum()), grp.apply(lambda g: g.d.mean()), grp.apply(lambda g: g.e.std())], axis=1)
1 loop, best of 3: 708 ms per loop

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-26

Comments

0 comments

From Dev

Related Related

Article

Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

Pandas - Groupby and create new DataFrame?

Pandas - create a new DataFrame from first n groups of a groupby operation

Pandas create new column with count from groupby

Python: Using apply efficiently on a pandas GroupBy object

Create Contour Plot from Pandas Groupby Dataframe

How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

In Pandas, how to apply 2 custom formulas in a groupby.agg() method?

Pandas groupby agg with multiple functions to apply returns error

Pandas - Create a new column with apply for float indexed dataframe

Apply a function to two DataFrame columns and create new column from result

Python - Pandas groupby agg

Python - Pandas groupby agg

create a new data frame from GroupBy object in pandas

How do I put a series (such as) the result of a pandas groupby.apply(f) into a new column of the dataframe?

Create new pandas timeseries dataframe from other dataframe

Pandas DataFrame GroupBy sum/count to new DataFrame

How to efficiently apply a function to each DataFrame of a Pandas Panel

Pandas create new dataframe choosing max value from multiple observations

pandas groupby add column from apply operation

Pandas stack/groupby to make a new dataframe

Pandas : Assign result of groupby to dataframe to a new column

Return groupby columns as new dataframe in Python Pandas

pandas dataframe create a new column whose values are based on groupby sum on another column

How to reference groupby index when using apply, transform, agg - Python Pandas?

Create a new column using specific columns in Pandas using DataFrame.apply

Pandas create new column with groupby and avoid loops

Pandas: Groupby dataframe and create dicts with missing data

How to create a new Pandas DataFrame from alternating boolean rows such that the new DataFrame is ready to plot?

Dataframe Comprehension in Pandas Python to create new Dataframe