I have a pandas dataframe with numerous columns. For simplicity, let's say the columns are, 'country', 'time_bucket', 'category' and 'id'. The 'category' can be either 'staff' or 'student'.
import pandas as pd
data = {'country': ['A', 'A', 'A', 'B', 'B',],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student','student','staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 B 8 student 142
4 B 9 staff 132
I want to find out the total number of staff and the total number of students in a country at a particular time interval and add these as new columns.
I can get the total number of people in a country at a particular time interval:
df['persons_count'] = df.groupby(['time_bucket','country'])['id'].transform('nunique')
country time_bucket category id persons_count
0 A 8 staff 101 3
1 A 8 staff 172 3
2 A 8 student 122 3
3 B 8 student 142 1
4 B 9 staff 132 1
However, I cannot work out how to take 'type' into account and add this in to my code.
I want something like this:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 B 8 student 142 0 1
4 B 9 staff 132 1 0
Any advice would be much appreciated!
Adding a new example which shows the need for unique 'id' count
import pandas as pd
data = {'country': ['A', 'A', 'A', 'A','B', 'B',],
'time_bucket': ['8', '8', '8', '8', '8','9'],
'category': ['staff', 'staff', 'student','student','student','staff'],
'id': ['101', '172', '122', '122','142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 A 8 student 122
4 B 8 student 142
5 B 9 staff 132
I want something like this:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 A 8 student 122 2 1
4 B 8 student 142 0 1
5 B 9 staff 132 1 0
import pandas as pd
data = {'country': ['A', 'A', 'A', 'B', 'B',],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student','student','staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df['persons_count'] = df.groupby(['time_bucket','country', 'category'])['id'].transform('nunique')
df = df.pivot_table(index=['country','time_bucket','id'], columns='category',values='persons_count').fillna(0)
Output
category staff student
country time_bucket id
A 8 101 2.0 0.0
122 0.0 1.0
172 2.0 0.0
B 8 142 0.0 1.0
9 132 1.0 0.0
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加