Let say I have this simplified dataframe with three variables:
ID sample test_result
P1 Normal 9
P1 Normal 18
P2 Normal 7
P2 Normal 16
P3 Normal 2
P3 Normal 11
P1 Tumor 6
P1 Tumor 15
P2 Tumor 5
P2 Tumor 15
P3 Tumor 3
P3 Tumor 12
I want to know how to sum the test_result
values for each identical ID
in each sample type (i.e. Normal
, Tumor
). Then I want to then take the difference between the summed normal and tumor test_result
values.
I have tried using groupby on sample column and then use the diff() method on test_result column but that did not work. I guess I need to know how to do apply the .sum() first, but not sure how.
Here is what I have tried:
df.groupby('sample')['test_result'].diff()
The output I am expecting is like:
ID test_result
P1 6 # (the sum of P1 Normal = 27) - (the sum of P1 Tumor = 21)
P2 3
P3 -2
Any idea how to tackle this?
Use groupby
with sum
and reshape by unstack
:
df = df.groupby(['ID','sample'])['test_result'].sum().unstack()
Or pivot_table
:
df = df.pivot_table(index='ID',columns='sample', values='test_result', aggfunc='sum')
and then subtract columns:
df['new'] = df['Normal'] - df['Tumor']
print (df)
sample Normal Tumor new
ID
P1 27 21 6
P2 23 20 3
P3 13 15 -2
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments