I'm having trouble applying some filter to my dataset, can't get my head around on how to do it. I have a dataframe that consist of 7 columns. One column, participant_id
that identifies a participant (numerical value unique to each participant), a second column that indicates to which group each partipant belongs to (group
), a third column called trial
that identifies the trial (each participant performs multiple trials of a task) and four columns var 1
, var 2
, var 3
, var 4
(a numerical value corresponding to four variables recorded of each trial). There are participants with 100+ trials, and participants with around 50 trials.
Short example:
participant_id group trial var1 var2 var3 var4
189 A 1 -0.231046 0.245615 -0.581238 -0.593562
189 A 2 -0.231046 0.245615 -0.581238 -0.593562
189 A 3 -0.231046 0.245615 -0.581238 -0.593562
189 A 4 -0.231046 0.245615 -0.581238 -0.593562
345 B 1 NaN 0.245615 -0.581238 -0.593562
345 B 378 -0.231046 0.245615 -0.581238 -0.593562
227 A 1 -0.231046 0.245615 -0.581238 -0.593562
227 A 1 -0.231046 0.245615 -0.581238 -0.593562
227 A 2 -0.231046 0.245615 -0.581238 -0.593562
432 B 517 -0.231046 0.245615 NaN -0.593562
432 B 2 -0.231046 0.245615 NaN -0.593562
432 B 333 -0.231046 0.245615 -0.581238 -0.593562
Example for 4 participants. Note that the trial number is not unique, as two different participants can have a trial '1', and the same participant also can have multiple trial '1'. (Don't mind that the value of the var columns is the same for every row, in the real dataset that's not the case). The real dataset is 10000 rows.
What I need to do is to keep the top n quantile trials of each unique participant based on the value of var1
. I was thinking of using the pandas .quantile(n) function, but I don't know how to tell it that it should analyze each participant separately, and not the whole row to get the quantile.
Any help is appreciated.
apply()
on a group is passed a Dataframe of each grouplambda
as well as stand alone functiondf = pd.read_csv(io.StringIO(""" participant_id group trial var1 var2 var3 var4
189 A 1 -0.231046 0.245615 -0.581238 -0.593562
189 A 2 -0.231046 0.245615 -0.581238 -0.593562
189 A 3 -0.231046 0.245615 -0.581238 -0.593562
189 A 4 -0.231046 0.245615 -0.581238 -0.593562
345 B 1 NaN 0.245615 -0.581238 -0.593562
345 B 378 -0.231046 0.245615 -0.581238 -0.593562
227 A 1 -0.231046 0.245615 -0.581238 -0.593562
227 A 1 -0.231046 0.245615 -0.581238 -0.593562
227 A 2 -0.231046 0.245615 -0.581238 -0.593562
432 B 517 -0.231046 0.245615 NaN -0.593562
432 B 2 -0.231046 0.245615 NaN -0.593562
432 B 333 -0.231046 0.245615 -0.581238 -0.593562
"""), sep="\s+")
def f(d):
return d.quantile()
df.groupby("participant_id", as_index=False).apply(f)
df.groupby("participant_id", as_index=False).apply(lambda d: d.quantile())
# generate a useful dataset to analyse...
s = 100
df = pd.DataFrame({"participant_id":np.random.choice([189,227,345,432],s),
"trial":np.random.randint(1,6,s),
"group":np.random.choice(["A","B"],s),
"var1":np.random.rand(s),
"var2":np.random.rand(s),
"var3":np.random.rand(s),
"var4":np.random.rand(s),
})
# split into 10 quantile bins and take 0th bin
# split into 10 quantile bins and take 0th bin
df.groupby("participant_id").apply(
lambda d: d.loc[pd.qcut(d.var1, q=10, retbins=False, labels=False).le(0)]).droplevel(0)
participant_id | trial | group | var1 | var2 | var3 | var4 | |
---|---|---|---|---|---|---|---|
13 | 189 | 3 | A | 0.0273875 | 0.87134 | 0.555792 | 0.67094 |
85 | 189 | 1 | A | 0.0106758 | 0.352578 | 0.481009 | 0.910989 |
91 | 189 | 4 | A | 0.00835706 | 0.644102 | 0.990459 | 0.816669 |
4 | 227 | 5 | B | 0.10132 | 0.870446 | 0.389972 | 0.313782 |
71 | 227 | 4 | B | 0.0221867 | 0.17566 | 0.659024 | 0.910838 |
74 | 227 | 1 | A | 0.0762526 | 0.458995 | 0.492384 | 0.556408 |
28 | 345 | 5 | A | 0.130674 | 0.0336628 | 0.0429884 | 0.799307 |
82 | 345 | 1 | B | 0.234522 | 0.371108 | 0.451911 | 0.54528 |
93 | 345 | 4 | B | 0.272915 | 0.594262 | 0.392285 | 0.56374 |
16 | 432 | 2 | B | 0.0403964 | 0.880132 | 0.45438 | 0.0466626 |
18 | 432 | 5 | A | 0.0884496 | 0.304541 | 0.969059 | 0.949315 |
22 | 432 | 2 | B | 0.115796 | 0.0371306 | 0.631284 | 0.537881 |
55 | 432 | 2 | B | 0.102859 | 0.0416843 | 0.761466 | 0.56438 |
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments