Fillna (forward fill) on a large dataframe efficiently with groupby?

trench

What is the most efficient way to forward fill information in a large dataframe?

I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.

Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?

id       start_date   end_date    is_current  location  dimensions...
xyz987   2016-03-11   2016-04-02  Expired       CA      lots_of_stuff
xyz987   2016-04-03   2016-04-21  Expired       NaN     lots_of_stuff
xyz987   2016-04-22          NaN  Current       CA      lots_of_stuff

That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.

I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?

cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)

There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a

df.fillna(method='ffill', inplace=True)

but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).

Alexander

How about forward filling each group?

 df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe

From Dev

Applying forward fill on NaNs at the top of a column using df.fillna?

From Java

How to groupby and forward fill 0s?

From Dev

Faster way to forward-fill and back-fill a groupby

From Dev

Efficiently fill NA cells by numbers in dataframe

From Dev

Series.fillna() in a MultiIndex DataFrame Does not Fill; Is This a Bug?

From Dev

forward fill specific columns in pandas dataframe

From Dev

Forward-fill dates within MultiIndexed DataFrame

From Dev

PySpark Dataframe forward fill on all columns

From Dev

How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

From Java

Fill DataFrame NaN with another DataFrame with groupby

From Dev

Efficiently building a large (200 MM line) dataframe

From Dev

Is it possible to do fill forward with DataFrame.mul() fill_value?

From Dev

Python filling string column "forward" and groupby attaching groupby result to dataframe

From Dev

DataFrame take every 3rd row and forward fill

From Dev

Forward fill all except last value in python pandas dataframe

From Dev

DataFrame take every 3rd row and forward fill

From Dev

Inplace Forward Fill on a multi-level column dataframe

From Dev

Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

From Dev

R - Efficiently create dataframe from large raster excluding NA values

From Dev

Write a user defined fillna function in pandas dataframe to fill np.nan different values with conditions

From Dev

How to fillna limited by date in a groupby

From Dev

Forward fill column on condition

From Dev

Forward fill column on condition

From Dev

numpy forward fill with condition

From Dev

Pandas forward fill proportionally

From Dev

how to forward fill non-null values in a pandas dataframe based on a set condition

From Dev

Filling missing values using forward and backward fill in pandas dataframe (ffill and bfill)

From Dev

Fill in missing dates an forward fill in R

Related Related

  1. 1

    Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe

  2. 2

    Applying forward fill on NaNs at the top of a column using df.fillna?

  3. 3

    How to groupby and forward fill 0s?

  4. 4

    Faster way to forward-fill and back-fill a groupby

  5. 5

    Efficiently fill NA cells by numbers in dataframe

  6. 6

    Series.fillna() in a MultiIndex DataFrame Does not Fill; Is This a Bug?

  7. 7

    forward fill specific columns in pandas dataframe

  8. 8

    Forward-fill dates within MultiIndexed DataFrame

  9. 9

    PySpark Dataframe forward fill on all columns

  10. 10

    How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

  11. 11

    Fill DataFrame NaN with another DataFrame with groupby

  12. 12

    Efficiently building a large (200 MM line) dataframe

  13. 13

    Is it possible to do fill forward with DataFrame.mul() fill_value?

  14. 14

    Python filling string column "forward" and groupby attaching groupby result to dataframe

  15. 15

    DataFrame take every 3rd row and forward fill

  16. 16

    Forward fill all except last value in python pandas dataframe

  17. 17

    DataFrame take every 3rd row and forward fill

  18. 18

    Inplace Forward Fill on a multi-level column dataframe

  19. 19

    Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

  20. 20

    R - Efficiently create dataframe from large raster excluding NA values

  21. 21

    Write a user defined fillna function in pandas dataframe to fill np.nan different values with conditions

  22. 22

    How to fillna limited by date in a groupby

  23. 23

    Forward fill column on condition

  24. 24

    Forward fill column on condition

  25. 25

    numpy forward fill with condition

  26. 26

    Pandas forward fill proportionally

  27. 27

    how to forward fill non-null values in a pandas dataframe based on a set condition

  28. 28

    Filling missing values using forward and backward fill in pandas dataframe (ffill and bfill)

  29. 29

    Fill in missing dates an forward fill in R

HotTag

Archive