Is there a way in pandas to groupby and then count unique where another column has a specified value?

debugcn 投稿 Dev

greenstamp :

I have a pandas dataframe with numerous columns. For simplicity, let's say the columns are, 'country', 'time_bucket', 'category' and 'id'. The 'category' can be either 'staff' or 'student'.

import pandas as pd
    data = {'country':  ['A', 'A', 'A', 'B', 'B',],
            'time_bucket': ['8', '8', '8', '8', '9'],
            'category': ['staff', 'staff', 'student','student','staff'],
            'id': ['101', '172', '122', '142', '132'],
            }
        
        df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df


country time_bucket category    id
0   A      8      staff        101
1   A      8      staff        172
2   A      8      student      122
3   B      8      student      142
4   B      9      staff        132

I want to find out the total number of staff and the total number of students in a country at a particular time interval and add these as new columns.

I can get the total number of people in a country at a particular time interval:

df['persons_count'] = df.groupby(['time_bucket','country'])['id'].transform('nunique')

country time_bucket category    id  persons_count
0   A      8         staff      101    3
1   A      8         staff      172    3
2   A      8         student    122    3
3   B      8         student    142    1
4   B      9         staff      132    1

However, I cannot work out how to take 'type' into account and add this in to my code.

I want something like this:

country time_bucket category    id  staff_count student_count
0   A     8          staff      101     2           1  
1   A     8          staff      172     2           1
2   A     8          student    122     2           1
3   B     8          student    142     0           1
4   B     9          staff      132     1           0

Any advice would be much appreciated!

Adding a new example which shows the need for unique 'id' count

import pandas as pd
data = {'country':  ['A', 'A', 'A', 'A','B', 'B',],
                'time_bucket': ['8', '8', '8', '8', '8','9'],
                'category': ['staff', 'staff', 'student','student','student','staff'],
                'id': ['101', '172', '122', '122','142', '132'],
                }
        
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df

country time_bucket category    id
0   A     8         staff       101
1   A     8         staff       172
2   A     8         student     122
3   A     8         student     122
4   B     8         student     142
5   B     9         staff       132

I want something like this:

country time_bucket category    id  staff_count student_count
0   A     8          staff      101     2           1  
1   A     8          staff      172     2           1
2   A     8          student    122     2           1
3   A     8          student    122     2           1
4   B     8          student    142     0           1
5   B     9          staff      132     1           0

Chris :

import pandas as pd
data = {'country':  ['A', 'A', 'A', 'B', 'B',],
    'time_bucket': ['8', '8', '8', '8', '9'],
    'category': ['staff', 'staff', 'student','student','staff'],
    'id': ['101', '172', '122', '142', '132'],
    }

df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])


df['persons_count'] = df.groupby(['time_bucket','country', 'category'])['id'].transform('nunique')

df = df.pivot_table(index=['country','time_bucket','id'], columns='category',values='persons_count').fillna(0)

Output

                     category   staff   student
country time_bucket        id       
      A           8       101     2.0       0.0
                          122     0.0       1.0
                          172     2.0       0.0
      B           8       142     0.0       1.0
                  9       132     1.0       0.0

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]