How to iterate over consecutive chunks of Pandas dataframe efficiently

Andrew Clegg

I have a large dataframe (several million rows).

I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-sized) subsets of rows, rather than using any particular property of the individual rows to decide which group they go to.

The use case: I want to apply a function to each row via a parallel map in IPython. It doesn't matter which rows go to which back-end engine, as the function calculates a result based on one row at a time. (Conceptually at least; in reality it's vectorized.)

I've come up with something like this:

# Generate a number from 0-9 for each row, indicating which tenth of the DF it belongs to
max_idx = dataframe.index.max()
tenths = ((10 * dataframe.index) / (1 + max_idx)).astype(np.uint32)

# Use this value to perform a groupby, yielding 10 consecutive chunks
groups = [g[1] for g in dataframe.groupby(tenths)]

# Process chunks in parallel
results = dview.map_sync(my_function, groups)

But this seems very long-winded, and doesn't guarantee equal sized chunks. Especially if the index is sparse or non-integer or whatever.

Any suggestions for a better way?

Thanks!

DSM

In practice, you can't guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby. Starting from:

>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)
>>> df[0] = range(15)
>>> df
    0         1         2         3         4
0   0  0.746300  0.346277  0.220362  0.172680
0   1  0.657324  0.687169  0.384196  0.214118
0   2  0.016062  0.858784  0.236364  0.963389
[...]
0  13  0.510273  0.051608  0.230402  0.756921
0  14  0.950544  0.576539  0.642602  0.907850

[15 rows x 5 columns]

where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:

>>> df.groupby(np.arange(len(df))//10)
<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>
>>> for k,g in df.groupby(np.arange(len(df))//10):
...     print(k,g)
...     
0    0         1         2         3         4
0  0  0.746300  0.346277  0.220362  0.172680
0  1  0.657324  0.687169  0.384196  0.214118
0  2  0.016062  0.858784  0.236364  0.963389
[...]
0  8  0.241049  0.246149  0.241935  0.563428
0  9  0.493819  0.918858  0.193236  0.266257

[10 rows x 5 columns]
1     0         1         2         3         4
0  10  0.037693  0.370789  0.369117  0.401041
0  11  0.721843  0.862295  0.671733  0.605006
[...]
0  14  0.950544  0.576539  0.642602  0.907850

[5 rows x 5 columns]

Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use .iloc[a:b] to ignore the index values and access data by position.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How to iterate over two pandas dataframes in chunks

From Dev

Iterate over chunks of dataframe by time period

From Java

How to iterate over rows in a DataFrame in Pandas

From Dev

How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?

From Dev

Iterate over column in dataframe (Pandas)

From Java

How to iterate over columns of pandas dataframe to run regression

From Dev

How to iterate over pandas multiindex dataframe using index

From Dev

How to properly iterate over each row for a set of pandas dataframe

From Dev

How to iterate & perform operation over columns in pandas dataframe

From Dev

Iterate over rows and expand pandas dataframe

From Dev

Iterate over pandas dataframe in jinja2

From Dev

Iterate over pandas dataframe rows as pure text

From Dev

How to iterate over DataFrame and generate a new DataFrame

From Java

How do I efficiently iterate over each entry in a Java Map?

From Dev

How summing DataFrame column values over chunks defined by a list?

From Dev

How to iterate over rows using apply to a dataframe?

From Dev

How can I iterate over Pandas pivot table? (A multi-index dataframe?)

From Java

How iterate over rows in a dataframe dictionnary and change some values - Pandas Python

From Dev

How can I iterate over Pandas pivot table? (A multi-index dataframe?)

From Dev

How to iterate over a pandas dataframe and compare certain columns based on a third column?

From Dev

How to load data in chunks from a pandas dataframe to a spark dataframe

From Dev

Pandas Panel : How To Iterate Over the Minor Axis?

From Dev

How to iterate over time periods in pandas

From Dev

How to groupby consecutive values in pandas DataFrame

From Dev

How to replace efficiently values on a pandas DataFrame?

From Dev

How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

From Dev

How to efficiently calculate running maxima in a Pandas dataframe?

From Dev

How to efficiently change data layout of a DataFrame in pandas?

From Dev

How to efficiently iterate a Multimap?

Related Related

  1. 1

    How to iterate over two pandas dataframes in chunks

  2. 2

    Iterate over chunks of dataframe by time period

  3. 3

    How to iterate over rows in a DataFrame in Pandas

  4. 4

    How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?

  5. 5

    Iterate over column in dataframe (Pandas)

  6. 6

    How to iterate over columns of pandas dataframe to run regression

  7. 7

    How to iterate over pandas multiindex dataframe using index

  8. 8

    How to properly iterate over each row for a set of pandas dataframe

  9. 9

    How to iterate & perform operation over columns in pandas dataframe

  10. 10

    Iterate over rows and expand pandas dataframe

  11. 11

    Iterate over pandas dataframe in jinja2

  12. 12

    Iterate over pandas dataframe rows as pure text

  13. 13

    How to iterate over DataFrame and generate a new DataFrame

  14. 14

    How do I efficiently iterate over each entry in a Java Map?

  15. 15

    How summing DataFrame column values over chunks defined by a list?

  16. 16

    How to iterate over rows using apply to a dataframe?

  17. 17

    How can I iterate over Pandas pivot table? (A multi-index dataframe?)

  18. 18

    How iterate over rows in a dataframe dictionnary and change some values - Pandas Python

  19. 19

    How can I iterate over Pandas pivot table? (A multi-index dataframe?)

  20. 20

    How to iterate over a pandas dataframe and compare certain columns based on a third column?

  21. 21

    How to load data in chunks from a pandas dataframe to a spark dataframe

  22. 22

    Pandas Panel : How To Iterate Over the Minor Axis?

  23. 23

    How to iterate over time periods in pandas

  24. 24

    How to groupby consecutive values in pandas DataFrame

  25. 25

    How to replace efficiently values on a pandas DataFrame?

  26. 26

    How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

  27. 27

    How to efficiently calculate running maxima in a Pandas dataframe?

  28. 28

    How to efficiently change data layout of a DataFrame in pandas?

  29. 29

    How to efficiently iterate a Multimap?

HotTag

Archive