Group rows in dataframe by assigning values as a column in pandas dataframe

debugcn Published at Dev

Shubham R

Consider this dataframe:

import pandas as pd
   df = pd.DataFrame({
    'id': [458,459,464,469,507,512,516,519,519,615]
})

i want to find the difference of 2nd row - 1st row so i implemented:

df['diff'] = df['id'] - df['id'].shift(-1)
df.fillna(1)

    id    diff
0   458   -1.0
1   459   -5.0
2   464   -5.0
3   469   -38.0
4   507   -5.0
5   512   -4.0
6   516   -3.0
7   519    0.0
8   519   -96.0
9   615    1.0

Now i want to group these diff column in such a way that, whenever the difference between the two rows is greater than 10, make a new column group and set all the above rows to 1, and so on.

As you can see in column diff diffrence between 4th row and 3rd

    id    diff    group
0   458   -1.0     1
1   459   -5.0     1
2   464   -5.0     1
3   469   -38.0    1
4   507   -5.0     2
5   512   -4.0     2
6   516   -3.0     2
7   519    0.0     2
8   519   -96.0    2
9   615    1.0     3

Any ideas how to achieve this?

jezrael

You can use diff, compare and then cumsum boolean mask, last add 1:

print (df['diff'].diff())
0     NaN
1    -4.0
2     0.0
3   -33.0
4    33.0
5     1.0
6     1.0
7     3.0
8   -96.0
9    97.0
Name: diff, dtype: float64

df['group'] = (df['diff'].diff() > 10).cumsum() + 1
print (df)
    id  diff  group
0  458  -1.0      1
1  459  -5.0      1
2  464  -5.0      1
3  469 -38.0      1
4  507  -5.0      2
5  512  -4.0      2
6  516  -3.0      2
7  519   0.0      2
8  519 -96.0      2
9  615   1.0      3

df = df.assign(group=df['diff'].diff().gt(10).cumsum().add(1))
print (df)
    id  diff  group
0  458  -1.0      1
1  459  -5.0      1
2  464  -5.0      1
3  469 -38.0      1
4  507  -5.0      2
5  512  -4.0      2
6  516  -3.0      2
7  519   0.0      2
8  519 -96.0      2
9  615   1.0      3

Collected from the Internet

Please contact [email protected] to delete if infringement.