Fillna (forward fill) on a large dataframe efficiently with groupby?

trench Published at Dev

trench

What is the most efficient way to forward fill information in a large dataframe?

I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.

Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?

id       start_date   end_date    is_current  location  dimensions...
xyz987   2016-03-11   2016-04-02  Expired       CA      lots_of_stuff
xyz987   2016-04-03   2016-04-21  Expired       NaN     lots_of_stuff
xyz987   2016-04-22          NaN  Current       CA      lots_of_stuff

That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.

I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?

cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)

There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a

df.fillna(method='ffill', inplace=True)

but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).

Alexander

How about forward filling each group?

 df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-28

Comments

0 comments

From Dev

Related Related

Article

Fillna (forward fill) on a large dataframe efficiently with groupby?

Fillna (forward fill) on a large dataframe efficiently with groupby?

Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe

Applying forward fill on NaNs at the top of a column using df.fillna?

How to groupby and forward fill 0s?

Faster way to forward-fill and back-fill a groupby

Efficiently fill NA cells by numbers in dataframe

Series.fillna() in a MultiIndex DataFrame Does not Fill; Is This a Bug?

forward fill specific columns in pandas dataframe

Forward-fill dates within MultiIndexed DataFrame

PySpark Dataframe forward fill on all columns

How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

Fill DataFrame NaN with another DataFrame with groupby

Efficiently building a large (200 MM line) dataframe

Is it possible to do fill forward with DataFrame.mul() fill_value?

Python filling string column "forward" and groupby attaching groupby result to dataframe

DataFrame take every 3rd row and forward fill

Forward fill all except last value in python pandas dataframe

DataFrame take every 3rd row and forward fill

Inplace Forward Fill on a multi-level column dataframe

Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

R - Efficiently create dataframe from large raster excluding NA values

Write a user defined fillna function in pandas dataframe to fill np.nan different values with conditions

How to fillna limited by date in a groupby

Forward fill column on condition

Forward fill column on condition

numpy forward fill with condition

Pandas forward fill proportionally

how to forward fill non-null values in a pandas dataframe based on a set condition

Filling missing values using forward and backward fill in pandas dataframe (ffill and bfill)

Fill in missing dates an forward fill in R