How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

geo_so

To give you the context of the question:

I have decent SQL table (72M rows, 6GB) with data which could be understood as "column-based", e.g.:

------------------------------
| fk_id | date       | field |
------------------------------
|     1 | 2001-01-02 |    24 |
|     1 | 2001-01-03 |    25 |
|     1 | 2001-01-04 |    21 |
|     1 | 2001-01-05 |    20 |
|     1 | 2001-01-06 |    30 |
|     1 | 2001-01-07 |    33 |
|            ....            |
|     2 | 2001-01-02 |    10 |
|     2 | 2001-01-03 |    15 |
|     2 | 2001-01-04 |    12 |
|     2 | 2001-01-05 |    11 |
|     2 | 2001-01-06 |    10 |
|     2 | 2001-01-07 |    12 |
|            ....            |
|            ....            |
| 12455 | 2015-01-01 |    99 |
| 12456 | 2005-10-10 |    10 |
| 12456 | 2005-10-11 |    10 |
|            ....            |
------------------------------

The desired end result in Python as a pandas.DataFrame should look like this, where date becomes the index column the foreign keys the column names and the values of the column field the content of a matrix:

------------------------------------------------------
| date       |     1 |     2 |  .... | 12455 | 12456 | 
------------------------------------------------------
| 2001-01-02 |    24 |    10 |  .... |   NaN |   NaN |
| 2001-01-03 |    25 |    15 |  .... |   NaN |   NaN |
| 2001-01-04 |    21 |    12 |  .... |   NaN |   NaN |
| 2001-01-05 |    20 |    11 |  .... |   NaN |   NaN |
| 2001-01-06 |    30 |    10 |  .... |   NaN |   NaN |
| 2001-01-07 |    33 |    12 |  .... |   NaN |   NaN |
|       .... |    .. |    .. |  .... |  .... |  .... |
| 2005-10-10 |    50 |     4 |  .... |   NaN |    10 |
| 2005-10-11 |    51 |     3 |  .... |   NaN |    10 |
|       .... |    .. |    .. |  .... |  .... |  .... |
| 2015-01-01 |    40 |   NaN |  .... |    50 |    99 |
------------------------------------------------------

Till now, I accomplish this with the following code:

def _split_by_fk(self, df):
    """
    :param df: pandas.DataFrame
    :param fields: Iterable
    :return: pandas.Panel
    """
    data = dict()
    res = df.groupby('fk_id')
    for r in res:
        fk_id = r[0]
        data[fk_id] = r[1]['field']
    return pd.DataFrame(data)

def get_data(self, start, end):
    s = select([daily_data.c.date, daily_data.c.fk_id, daily_data.c.field])\
        .where(and_(end >= daily_data.c.date, daily_data.c.date >= start))\
        .order_by(daily_data.c.fk_id, daily_data.c.date)
    data = pd.read_sql(s, con=db_engine, index_col='date')
    return self._split_by_fk(data)


>>> get_data('1960-01-01', '1989-12-31')

which does basically:

  1. Query SQL DB via sqlalchemy directly through pandas.read_sql function.
  2. groupby the received DataFrame
  3. Iterate over the group result object and put them in a dictionary
  4. Convert the dict into a DataFrame.

To query 29 years of daily data with 13'813 columns takes with the above approach 4min 38s (the whole DataFrame takes up 796.5MB in memory), where %lprun shows that most of the time is spent in the read_sql function and the rest in the _split_by_fk (excerpt of the output):

% Time   Line Contents
===============================================================
83.8     data = pd.read_sql(s, con=db_engine, index_col='date')
16.2     return self._split_by_fk(data)

My code feels not very elegant as I am collecting all groups in a dictionary to transform them again into a DataFrame.

Now to my actual question: Is there a (more) efficient/pythonic way to "columnize" a pandas.DataFrame in the manner shown above?


PS: I would be not happy to pointers and hints into more general directions regarding the handling of such data structures and amount of data, tough, I think that it should be possible to solve everything "small data"-style.

BrenBarn

If I understand you right, you can do df.pivot(index='date', columns='fk_id', values='field').

I think that it should be possible to solve everything "small data"-style.

Good luck with that. A DataFrame with 12000 columns is unlikely to perform well.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Pandas dataframe pivoting

From Dev

Pivoting Dataframe with Pandas

From Java

Pivoting pandas dataframe by rank on id

From Dev

Pandas Dataframe Stacking versus Pivoting

From Dev

Resampling, grouping, pivoting a pandas dataframe

From Dev

Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

From Dev

How to replace efficiently values on a pandas DataFrame?

From Dev

How to efficiently calculate running maxima in a Pandas dataframe?

From Dev

How to efficiently change data layout of a DataFrame in pandas?

From Dev

Pivoting a pandas dataframe with duplicate index values

From Dev

Pivoting a Pandas dataframe with a gapless daterange as index

From Dev

pivoting pandas dataframe into prefixed cols, not a MultiIndex

From Dev

Pivoting a pandas dataframe to generate a (seaborn) heatmap

From Dev

how to groupby pandas dataframe on some condition

From Java

How to group dataframe rows into list in pandas groupby

From Java

How to GroupBy a Dataframe in Pandas and keep Columns

From Dev

How to groupby consecutive values in pandas DataFrame

From Dev

How to groupby pandas DataFrame by customized function

From Dev

How to groupby with certain condition in pandas dataframe

From Dev

How to add row to pandas DataFrame with missing value efficiently?

From Java

How to efficiently extract date year from dataframe header in Pandas?

From Dev

How to iterate over consecutive chunks of Pandas dataframe efficiently

From Dev

How can I efficiently move from a Pandas dataframe to JSON

From Dev

How to efficiently apply a function to each DataFrame of a Pandas Panel

From Dev

How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?

From Dev

How to efficiently partial argsort Pandas dataframe across columns

From Dev

How to efficiently subtract each row from pandas dataframe?

From Dev

Fillna (forward fill) on a large dataframe efficiently with groupby?

From Dev

Convert big pandas DataFrame efficiently

Related Related

  1. 1

    Pandas dataframe pivoting

  2. 2

    Pivoting Dataframe with Pandas

  3. 3

    Pivoting pandas dataframe by rank on id

  4. 4

    Pandas Dataframe Stacking versus Pivoting

  5. 5

    Resampling, grouping, pivoting a pandas dataframe

  6. 6

    Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

  7. 7

    How to replace efficiently values on a pandas DataFrame?

  8. 8

    How to efficiently calculate running maxima in a Pandas dataframe?

  9. 9

    How to efficiently change data layout of a DataFrame in pandas?

  10. 10

    Pivoting a pandas dataframe with duplicate index values

  11. 11

    Pivoting a Pandas dataframe with a gapless daterange as index

  12. 12

    pivoting pandas dataframe into prefixed cols, not a MultiIndex

  13. 13

    Pivoting a pandas dataframe to generate a (seaborn) heatmap

  14. 14

    how to groupby pandas dataframe on some condition

  15. 15

    How to group dataframe rows into list in pandas groupby

  16. 16

    How to GroupBy a Dataframe in Pandas and keep Columns

  17. 17

    How to groupby consecutive values in pandas DataFrame

  18. 18

    How to groupby pandas DataFrame by customized function

  19. 19

    How to groupby with certain condition in pandas dataframe

  20. 20

    How to add row to pandas DataFrame with missing value efficiently?

  21. 21

    How to efficiently extract date year from dataframe header in Pandas?

  22. 22

    How to iterate over consecutive chunks of Pandas dataframe efficiently

  23. 23

    How can I efficiently move from a Pandas dataframe to JSON

  24. 24

    How to efficiently apply a function to each DataFrame of a Pandas Panel

  25. 25

    How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?

  26. 26

    How to efficiently partial argsort Pandas dataframe across columns

  27. 27

    How to efficiently subtract each row from pandas dataframe?

  28. 28

    Fillna (forward fill) on a large dataframe efficiently with groupby?

  29. 29

    Convert big pandas DataFrame efficiently

HotTag

Archive