我有一个数据框,其中包含许多患者及其六小时内的测量值,但是对于某些患者,并非所有六小时值都已记录。我想要为每个subject-id,在hour列中添加1到6形式的值,如果小时值已经存在,请将其写入相同的值,否则将其留空。
注意(我稍后将使用缺失值技术来处理此空白值。)
subject_id hour value
2 1 23
2 3 15
2 5 28
2 6 11
3 4 18
3 6 22
这是我想要得到的输出
subject_id hour value
2 1 23
2 2
2 3 15
2 4
2 5 28
2 6 11
3 1
3 2
3 3
3 4 18
3 5
3 6 22
任何人都可以帮助我,使任何帮助将不胜感激
使用DataFrame.reindex
有MultiIndex.from_product
:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df = df.set_index(['subject_id','hour']).reindex(mux).reset_index()
print (df)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
另一种方法是先创建所有可能的组合product
,然后再DataFrame.merge
进行左联接:
from itertools import product
df1 = pd.DataFrame(list(product(df['subject_id'].unique(), np.arange(1,7))),
columns=['subject_id','hour'])
df = df1.merge(df, how='left')
print (df)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
编辑:如果得到错误:
无法处理非唯一的多索引
表示每个subject_id
都有重复的值hour
。
print (df)
subject_id hour value
0 2 1 23 <- duplicate 2, 1
1 2 1 50 <- duplicate 2, 1
2 2 3 15
3 2 5 28
4 2 6 11
5 3 4 18
6 3 6 22
可能的解决方案是汇总sum
或mean
替代set_index
:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df1 = df.groupby(['subject_id','hour']).sum().reindex(mux).reset_index()
print (df1)
subject_id hour value
0 2 1 73.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
详细说明:
print (df.groupby(['subject_id','hour']).sum())
value
subject_id hour
2 1 73
3 15
5 28
6 11
3 4 18
6 22
或删除重复项:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df1 = (df.drop_duplicates(['subject_id','hour'])
.set_index(['subject_id','hour'])
.reindex(mux)
.reset_index())
print (df1)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
详细说明:
print (df.drop_duplicates(['subject_id','hour']))
subject_id hour value
0 2 1 23 <- duplicates are removed
2 2 3 15
3 2 5 28
4 2 6 11
5 3 4 18
6 3 6 22
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句