Filter pandas dataframe by quantile based on the value of another column

debugcn Published at Dev

thoughtsymmetry

I'm having trouble applying some filter to my dataset, can't get my head around on how to do it. I have a dataframe that consist of 7 columns. One column, participant_id that identifies a participant (numerical value unique to each participant), a second column that indicates to which group each partipant belongs to (group), a third column called trial that identifies the trial (each participant performs multiple trials of a task) and four columns var 1, var 2, var 3, var 4 (a numerical value corresponding to four variables recorded of each trial). There are participants with 100+ trials, and participants with around 50 trials.

Short example:

       participant_id        group         trial       var1        var2       var3        var4        
         189                   A             1      -0.231046    0.245615  -0.581238   -0.593562  
         189                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             3      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             4      -0.231046    0.245615  -0.581238   -0.593562
         345                   B             1         NaN       0.245615  -0.581238   -0.593562
         345                   B            378     -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         432                   B            517     -0.231046    0.245615     NaN      -0.593562
         432                   B             2      -0.231046    0.245615     NaN      -0.593562
         432                   B            333     -0.231046    0.245615  -0.581238   -0.593562

Example for 4 participants. Note that the trial number is not unique, as two different participants can have a trial '1', and the same participant also can have multiple trial '1'. (Don't mind that the value of the var columns is the same for every row, in the real dataset that's not the case). The real dataset is 10000 rows.

What I need to do is to keep the top n quantile trials of each unique participant based on the value of var1. I was thinking of using the pandas .quantile(n) function, but I don't know how to tell it that it should analyze each participant separately, and not the whole row to get the quantile.

Any help is appreciated.

Rob Raymond

apply() on a group is passed a Dataframe of each group
have provided example using lambda as well as stand alone function
sample data as noted does not have any feature to demonstrate it's function

df = pd.read_csv(io.StringIO("""       participant_id        group         trial       var1        var2       var3        var4        
         189                   A             1      -0.231046    0.245615  -0.581238   -0.593562  
         189                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             3      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             4      -0.231046    0.245615  -0.581238   -0.593562
         345                   B             1         NaN       0.245615  -0.581238   -0.593562
         345                   B            378     -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         432                   B            517     -0.231046    0.245615     NaN      -0.593562
         432                   B             2      -0.231046    0.245615     NaN      -0.593562
         432                   B            333     -0.231046    0.245615  -0.581238   -0.593562
"""), sep="\s+")

def f(d):
    return d.quantile()

df.groupby("participant_id", as_index=False).apply(f)
df.groupby("participant_id", as_index=False).apply(lambda d: d.quantile())

updated

question was how to utilise pandas quantile capabilities within groups, answered by first part
this specifically shows how to use this approach to filter to rows within quantiles
have generated a random dataset that is same shape for demonstration purposes

# generate a useful dataset to analyse...
s = 100
df = pd.DataFrame({"participant_id":np.random.choice([189,227,345,432],s),
             "trial":np.random.randint(1,6,s),
             "group":np.random.choice(["A","B"],s),
              "var1":np.random.rand(s),
              "var2":np.random.rand(s),
              "var3":np.random.rand(s),
              "var4":np.random.rand(s),
             })

# split into 10 quantile bins and take 0th bin
# split into 10 quantile bins and take 0th bin
df.groupby("participant_id").apply(
    lambda d: d.loc[pd.qcut(d.var1, q=10, retbins=False, labels=False).le(0)]).droplevel(0)

	participant_id	trial	group	var1	var2	var3	var4
13	189	3	A	0.0273875	0.87134	0.555792	0.67094
85	189	1	A	0.0106758	0.352578	0.481009	0.910989
91	189	4	A	0.00835706	0.644102	0.990459	0.816669
4	227	5	B	0.10132	0.870446	0.389972	0.313782
71	227	4	B	0.0221867	0.17566	0.659024	0.910838
74	227	1	A	0.0762526	0.458995	0.492384	0.556408
28	345	5	A	0.130674	0.0336628	0.0429884	0.799307
82	345	1	B	0.234522	0.371108	0.451911	0.54528
93	345	4	B	0.272915	0.594262	0.392285	0.56374
16	432	2	B	0.0403964	0.880132	0.45438	0.0466626
18	432	5	A	0.0884496	0.304541	0.969059	0.949315
22	432	2	B	0.115796	0.0371306	0.631284	0.537881
55	432	2	B	0.102859	0.0416843	0.761466	0.56438

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-18

Comments

0 comments

From Dev

Related Related

Article