Filter pandas dataframe by quantile based on the value of another column

thoughtsymmetry

I'm having trouble applying some filter to my dataset, can't get my head around on how to do it. I have a dataframe that consist of 7 columns. One column, participant_id that identifies a participant (numerical value unique to each participant), a second column that indicates to which group each partipant belongs to (group), a third column called trial that identifies the trial (each participant performs multiple trials of a task) and four columns var 1, var 2, var 3, var 4 (a numerical value corresponding to four variables recorded of each trial). There are participants with 100+ trials, and participants with around 50 trials.

Short example:

       participant_id        group         trial       var1        var2       var3        var4        
         189                   A             1      -0.231046    0.245615  -0.581238   -0.593562  
         189                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             3      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             4      -0.231046    0.245615  -0.581238   -0.593562
         345                   B             1         NaN       0.245615  -0.581238   -0.593562
         345                   B            378     -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         432                   B            517     -0.231046    0.245615     NaN      -0.593562
         432                   B             2      -0.231046    0.245615     NaN      -0.593562
         432                   B            333     -0.231046    0.245615  -0.581238   -0.593562

Example for 4 participants. Note that the trial number is not unique, as two different participants can have a trial '1', and the same participant also can have multiple trial '1'. (Don't mind that the value of the var columns is the same for every row, in the real dataset that's not the case). The real dataset is 10000 rows.

What I need to do is to keep the top n quantile trials of each unique participant based on the value of var1. I was thinking of using the pandas .quantile(n) function, but I don't know how to tell it that it should analyze each participant separately, and not the whole row to get the quantile.

Any help is appreciated.

Rob Raymond
  • apply() on a group is passed a Dataframe of each group
  • have provided example using lambda as well as stand alone function
  • sample data as noted does not have any feature to demonstrate it's function
df = pd.read_csv(io.StringIO("""       participant_id        group         trial       var1        var2       var3        var4        
         189                   A             1      -0.231046    0.245615  -0.581238   -0.593562  
         189                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             3      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             4      -0.231046    0.245615  -0.581238   -0.593562
         345                   B             1         NaN       0.245615  -0.581238   -0.593562
         345                   B            378     -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         432                   B            517     -0.231046    0.245615     NaN      -0.593562
         432                   B             2      -0.231046    0.245615     NaN      -0.593562
         432                   B            333     -0.231046    0.245615  -0.581238   -0.593562
"""), sep="\s+")

def f(d):
    return d.quantile()

df.groupby("participant_id", as_index=False).apply(f)
df.groupby("participant_id", as_index=False).apply(lambda d: d.quantile())

updated

  • question was how to utilise pandas quantile capabilities within groups, answered by first part
  • this specifically shows how to use this approach to filter to rows within quantiles
  • have generated a random dataset that is same shape for demonstration purposes
# generate a useful dataset to analyse...
s = 100
df = pd.DataFrame({"participant_id":np.random.choice([189,227,345,432],s),
             "trial":np.random.randint(1,6,s),
             "group":np.random.choice(["A","B"],s),
              "var1":np.random.rand(s),
              "var2":np.random.rand(s),
              "var3":np.random.rand(s),
              "var4":np.random.rand(s),
             })

# split into 10 quantile bins and take 0th bin
# split into 10 quantile bins and take 0th bin
df.groupby("participant_id").apply(
    lambda d: d.loc[pd.qcut(d.var1, q=10, retbins=False, labels=False).le(0)]).droplevel(0)


participant_id trial group var1 var2 var3 var4
13 189 3 A 0.0273875 0.87134 0.555792 0.67094
85 189 1 A 0.0106758 0.352578 0.481009 0.910989
91 189 4 A 0.00835706 0.644102 0.990459 0.816669
4 227 5 B 0.10132 0.870446 0.389972 0.313782
71 227 4 B 0.0221867 0.17566 0.659024 0.910838
74 227 1 A 0.0762526 0.458995 0.492384 0.556408
28 345 5 A 0.130674 0.0336628 0.0429884 0.799307
82 345 1 B 0.234522 0.371108 0.451911 0.54528
93 345 4 B 0.272915 0.594262 0.392285 0.56374
16 432 2 B 0.0403964 0.880132 0.45438 0.0466626
18 432 5 A 0.0884496 0.304541 0.969059 0.949315
22 432 2 B 0.115796 0.0371306 0.631284 0.537881
55 432 2 B 0.102859 0.0416843 0.761466 0.56438

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Filter pandas dataframe by quantile based on the value of another column

From Dev

filter pandas dataframe based in another column

From Dev

Filter DataFrame based on Max value in Column - Pandas

From Dev

Filter pandas dataframe based on a column: keep all rows if a value is that column

From Dev

Setting a column value based on another column in a pandas dataframe

From Java

extract column value based on another column pandas dataframe

From Dev

normalize column in pandas dataframe based on value in another column

From Dev

normalize column in pandas dataframe based on value in another column

From Dev

Reshaping a column based on another column in a pandas dataframe

From Dev

Making new column in pandas DataFrame based on filter

From Dev

Filter pandas Dataframe based on max values in a column

From Dev

Filter pandas dataframe based on column list values

From Dev

pandas - change value in column based on another column

From Dev

Excel filter a list based on a value in another column

From Dev

Change Value of a Dataframe Column Based on a Filter

From Dev

Get names based on another column in pandas dataframe

From Dev

pandas, how to filter dataframe by column value

From Dev

Filter pandas DataFrame by column time value

From Dev

Conditional Sums based on another DataFrame column value

From Dev

filter dataframe based on condition on another column in the dataframe in R

From Dev

Python/Pandas: filter and organize the rows and columns of a dataframe based on another dataframe

From Dev

Pandas assign value of one column based on another

From Dev

Pandas, subtract values based on value of another column

From Dev

Python: Pandas - Separate a Dataframe based on a column value

From Java

Deleting DataFrame row in Pandas based on column value

From Dev

Pandas DataFrame manipulation based on Column value

From Dev

Repeat rows in a pandas DataFrame based on column value

From Dev

pandas add column to dataframe having the value from another row based on condition

From Java

Pandas/Python: Set value of one column based on value in another column

Related Related

  1. 1

    Filter pandas dataframe by quantile based on the value of another column

  2. 2

    filter pandas dataframe based in another column

  3. 3

    Filter DataFrame based on Max value in Column - Pandas

  4. 4

    Filter pandas dataframe based on a column: keep all rows if a value is that column

  5. 5

    Setting a column value based on another column in a pandas dataframe

  6. 6

    extract column value based on another column pandas dataframe

  7. 7

    normalize column in pandas dataframe based on value in another column

  8. 8

    normalize column in pandas dataframe based on value in another column

  9. 9

    Reshaping a column based on another column in a pandas dataframe

  10. 10

    Making new column in pandas DataFrame based on filter

  11. 11

    Filter pandas Dataframe based on max values in a column

  12. 12

    Filter pandas dataframe based on column list values

  13. 13

    pandas - change value in column based on another column

  14. 14

    Excel filter a list based on a value in another column

  15. 15

    Change Value of a Dataframe Column Based on a Filter

  16. 16

    Get names based on another column in pandas dataframe

  17. 17

    pandas, how to filter dataframe by column value

  18. 18

    Filter pandas DataFrame by column time value

  19. 19

    Conditional Sums based on another DataFrame column value

  20. 20

    filter dataframe based on condition on another column in the dataframe in R

  21. 21

    Python/Pandas: filter and organize the rows and columns of a dataframe based on another dataframe

  22. 22

    Pandas assign value of one column based on another

  23. 23

    Pandas, subtract values based on value of another column

  24. 24

    Python: Pandas - Separate a Dataframe based on a column value

  25. 25

    Deleting DataFrame row in Pandas based on column value

  26. 26

    Pandas DataFrame manipulation based on Column value

  27. 27

    Repeat rows in a pandas DataFrame based on column value

  28. 28

    pandas add column to dataframe having the value from another row based on condition

  29. 29

    Pandas/Python: Set value of one column based on value in another column

HotTag

Archive