Statistics of one hot encoded columns in pandas dataframe

Ross Published at Dev

Ross

I have a Pandas dataframe with a column titled "label". It has three columns titled featureA_1, featureA_2, featureA_3 respectively. These columns represent columns representing one hot encoded values of featureA (which can have three unique values.) Similarly, it also has two columns titled featureB_1 and featureB_2 respectively. These columns represent one hot encoded values of featureB (which can have two distinct values.)

Following is an example of the said dataframe

The above mentioned dataframe can be generated using the following:

import pandas as pd
dictt = {
    "label": ["cat", "cat", "cat", "cat", "cat", "dog", "dog", "dog"],
    "featureA_1": [1, 0, 1, 1, 0, 1, 1, 0],
    "featureA_2": [0, 1, 0, 0, 0, 0, 0, 0],
    "featureA_3": [0, 0, 0, 0, 1, 0, 0, 1],
    "featureB_1": [0, 0, 1, 1, 0, 0, 1, 1],
    "featureB_2": [1, 1, 0, 0, 1, 1, 0, 0],
}

df1 = pd.DataFrame(dictt)

Because of one hot encoding, each row in the above dataframe will have the value 1 for only one of the feature values featureA_1, featureA_2, featureA_3 and 0 for others. Similarly, each row will have value 1 for only one of the feature values featureB_1 and featureB_2 and zero for the other.

I want to create a dataframe where I will have the percentage of entries in each label with feature values featureA_1, featureA_2, featureA_3 and percentage of entries in each label with feature values featureB_1 and featureB_2.

I also want to have the standard deviations of those percentages of featureA value types and featureB value types.

Following is an example of the dataframe that I desire to have:

What is the most efficient way of doing this? In my actual work, I will have dataframes with millions of rows.

jezrael

Use:

#aggregate mean for percentages of 1, because only 0, 1 values 
df = df1.groupby('label').mean().add_suffix('_perc').round(2)

#aggregate std witg ddof=0, because default pandas ddof=1
df2 = df.groupby(lambda x: x.split('_')[0], axis=1).std(ddof=0).add_suffix('_std').round(2)

#join together
df = pd.concat([df, df2],axis=1).sort_index(axis=1).reset_index()
print (df)
  label  featureA_1_perc  featureA_2_perc  featureA_3_perc  featureA_std  \
0   cat             0.60              0.2             0.20          0.19   
1   dog             0.67              0.0             0.33          0.27   

   featureB_1_perc  featureB_2_perc  featureB_std  
0             0.40             0.60          0.10  
1             0.67             0.33          0.17

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-22

Comments

0 comments

From Dev

Related Related

Article