Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

Doctor J

I'd like to create a new dataframe from the results of groupby on another. The result should have one row per group (basically a vectorized map-reduce), and the new column names bear no relation to the existing names. This seems like a natural use for agg, but it only seems to produce existing columns.

d = pd.DataFrame({'a': [0,0,1,1], 'b': [3,4,5,6], 'c': [7,8,9,0]})

   a  b  c
0  0  3  7
1  0  4  8
2  1  5  9
3  1  6  0

agg() will create new columns with a Series:

d.groupby('a')['b'].agg({'x': lambda g: g.sum()})

    x
a    
0   7
1  11

But frustratingly not with a DataFrame:

d.groupby('a').agg({'x': lambda g: g.b.sum()})
KeyError: 'x'

I can do it by returning a one-row DataFrame from apply():

d.groupby('a').apply(lambda g: pd.DataFrame([{'x': g.b.mean(), 'y': (g.b * g.c).sum()}])).reset_index(level=1, drop=True)

     x   y
a         
0  3.5  53
1  5.5  45

but this is ugly and, as you can imagine, creating a new dict, list, and DataFrame for every row is slow for even modestly-sized inputs.

Doctor J

Here is a comparison of a few different ways to do it. I prefer returning a Series; reasonably succinct, clear, and efficient. Thanks to @Siraj S for the inspiration.

df = pd.DataFrame(np.random.rand(1000000, 5), columns=list('abcde'))
grp = df.groupby((df.a * 100).astype(int))


%timeit grp.apply(lambda g: pd.DataFrame([{'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}])).reset_index(level=1, drop=True)
1 loop, best of 3: 328 ms per loop

%timeit grp.apply(lambda g: (g.e.count(), (g.b * g.c).sum() / g.c.sum(), g.d.mean(), g.e.std())).apply(pd.Series)
1 loop, best of 3: 266 ms per loop

%timeit grp.apply(lambda g: pd.Series({'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}))
1 loop, best of 3: 265 ms per loop

%timeit grp.apply(lambda g: {'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}).apply(pd.Series)
1 loop, best of 3: 273 ms per loop

%timeit pd.concat([grp.apply(lambda g: g.e.count()), grp.apply(lambda g: (g.b * g.c).sum() / g.c.sum()), grp.apply(lambda g: g.d.mean()), grp.apply(lambda g: g.e.std())], axis=1)
1 loop, best of 3: 708 ms per loop

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Pandas - Groupby and create new DataFrame?

From Dev

Pandas - create a new DataFrame from first n groups of a groupby operation

From Dev

Pandas create new column with count from groupby

From Dev

Python: Using apply efficiently on a pandas GroupBy object

From Dev

Create Contour Plot from Pandas Groupby Dataframe

From Dev

How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

From Dev

In Pandas, how to apply 2 custom formulas in a groupby.agg() method?

From Dev

Pandas groupby agg with multiple functions to apply returns error

From Dev

Pandas - Create a new column with apply for float indexed dataframe

From Dev

Apply a function to two DataFrame columns and create new column from result

From Dev

Python - Pandas groupby agg

From Dev

Python - Pandas groupby agg

From Dev

create a new data frame from GroupBy object in pandas

From Dev

How do I put a series (such as) the result of a pandas groupby.apply(f) into a new column of the dataframe?

From Dev

Create new pandas timeseries dataframe from other dataframe

From Dev

Pandas DataFrame GroupBy sum/count to new DataFrame

From Dev

How to efficiently apply a function to each DataFrame of a Pandas Panel

From Dev

Pandas create new dataframe choosing max value from multiple observations

From Dev

pandas groupby add column from apply operation

From Dev

Pandas stack/groupby to make a new dataframe

From Dev

Pandas : Assign result of groupby to dataframe to a new column

From Dev

Return groupby columns as new dataframe in Python Pandas

From Dev

pandas dataframe create a new column whose values are based on groupby sum on another column

From Dev

How to reference groupby index when using apply, transform, agg - Python Pandas?

From Dev

Create a new column using specific columns in Pandas using DataFrame.apply

From Dev

Pandas create new column with groupby and avoid loops

From Dev

Pandas: Groupby dataframe and create dicts with missing data

From Dev

How to create a new Pandas DataFrame from alternating boolean rows such that the new DataFrame is ready to plot?

From Dev

Dataframe Comprehension in Pandas Python to create new Dataframe

Related Related

  1. 1

    Pandas - Groupby and create new DataFrame?

  2. 2

    Pandas - create a new DataFrame from first n groups of a groupby operation

  3. 3

    Pandas create new column with count from groupby

  4. 4

    Python: Using apply efficiently on a pandas GroupBy object

  5. 5

    Create Contour Plot from Pandas Groupby Dataframe

  6. 6

    How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

  7. 7

    In Pandas, how to apply 2 custom formulas in a groupby.agg() method?

  8. 8

    Pandas groupby agg with multiple functions to apply returns error

  9. 9

    Pandas - Create a new column with apply for float indexed dataframe

  10. 10

    Apply a function to two DataFrame columns and create new column from result

  11. 11

    Python - Pandas groupby agg

  12. 12

    Python - Pandas groupby agg

  13. 13

    create a new data frame from GroupBy object in pandas

  14. 14

    How do I put a series (such as) the result of a pandas groupby.apply(f) into a new column of the dataframe?

  15. 15

    Create new pandas timeseries dataframe from other dataframe

  16. 16

    Pandas DataFrame GroupBy sum/count to new DataFrame

  17. 17

    How to efficiently apply a function to each DataFrame of a Pandas Panel

  18. 18

    Pandas create new dataframe choosing max value from multiple observations

  19. 19

    pandas groupby add column from apply operation

  20. 20

    Pandas stack/groupby to make a new dataframe

  21. 21

    Pandas : Assign result of groupby to dataframe to a new column

  22. 22

    Return groupby columns as new dataframe in Python Pandas

  23. 23

    pandas dataframe create a new column whose values are based on groupby sum on another column

  24. 24

    How to reference groupby index when using apply, transform, agg - Python Pandas?

  25. 25

    Create a new column using specific columns in Pandas using DataFrame.apply

  26. 26

    Pandas create new column with groupby and avoid loops

  27. 27

    Pandas: Groupby dataframe and create dicts with missing data

  28. 28

    How to create a new Pandas DataFrame from alternating boolean rows such that the new DataFrame is ready to plot?

  29. 29

    Dataframe Comprehension in Pandas Python to create new Dataframe

HotTag

Archive