I have a dataframe that I would like to filter down to only the rows that first show change in a certain column within a group.
For example, my dataframe looks like this:
GROUP | DATE | QUANTITY |
---|---|---|
A | 2020-01-01 | 2 |
A | 2020-01-02 | 2 |
A | 2020-01-03 | 3 |
A | 2020-01-04 | 2 |
B | 2020-01-01 | 1 |
B | 2020-01-04 | 2 |
C | 2020-01-01 | 3 |
C | 2020-01-06 | 2 |
C | 2020-01-07 | 2 |
I would like to be able to produce the table below:
GROUP | DATE | QUANTITY |
---|---|---|
A | 2020-01-01 | 2 |
A | 2020-01-03 | 3 |
A | 2020-01-04 | 2 |
B | 2020-01-01 | 1 |
B | 2020-01-04 | 2 |
C | 2020-01-01 | 3 |
C | 2020-01-06 | 2 |
So that we only keep the first row when QUANTITY changes within the group when sorted by date.
How can I achieve this without resorting to an inefficient for loop?
Convert to a datetime
and sort the values. Then using shift
create a mask that keeps rows where the group changes (i.e. first row within group) or the value changes; logically equivalent to keeping rows within group where the quantity changes.
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values(['GROUP', 'DATE'])
m = (df['QUANTITY'].ne(df['QUANTITY'].shift()) # Quanity Changes
| df['GROUP'].ne(df['GROUP'].shift())) # Group Changes
df[m]
GROUP DATE QUANTITY
0 A 2020-01-01 2
2 A 2020-01-03 3
3 A 2020-01-04 2
4 B 2020-01-01 1
5 B 2020-01-04 2
6 C 2020-01-01 3
7 C 2020-01-06 2
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments