I have a Pandas dataframe with a column titled "label"
. It has three columns titled featureA_1, featureA_2, featureA_3
respectively. These columns represent columns representing one hot encoded values of featureA
(which can have three unique values.) Similarly, it also has two columns titled featureB_1
and featureB_2
respectively. These columns represent one hot encoded values of featureB
(which can have two distinct values.)
Following is an example of the said dataframe
.
The above mentioned dataframe can be generated using the following:
import pandas as pd
dictt = {
"label": ["cat", "cat", "cat", "cat", "cat", "dog", "dog", "dog"],
"featureA_1": [1, 0, 1, 1, 0, 1, 1, 0],
"featureA_2": [0, 1, 0, 0, 0, 0, 0, 0],
"featureA_3": [0, 0, 0, 0, 1, 0, 0, 1],
"featureB_1": [0, 0, 1, 1, 0, 0, 1, 1],
"featureB_2": [1, 1, 0, 0, 1, 1, 0, 0],
}
df1 = pd.DataFrame(dictt)
Because of one hot encoding, each row in the above dataframe will have the value 1 for only one of the feature values featureA_1, featureA_2, featureA_3
and 0 for others. Similarly, each row will have value 1 for only one of the feature values featureB_1
and featureB_2
and zero for the other.
I want to create a dataframe where I will have the percentage of entries in each label with feature values featureA_1, featureA_2, featureA_3
and percentage of entries in each label with feature values featureB_1
and featureB_2
.
I also want to have the standard deviations of those percentages of featureA value types and featureB value types.
Following is an example of the dataframe that I desire to have:
What is the most efficient way of doing this? In my actual work, I will have dataframes with millions of rows.
Use:
#aggregate mean for percentages of 1, because only 0, 1 values
df = df1.groupby('label').mean().add_suffix('_perc').round(2)
#aggregate std witg ddof=0, because default pandas ddof=1
df2 = df.groupby(lambda x: x.split('_')[0], axis=1).std(ddof=0).add_suffix('_std').round(2)
#join together
df = pd.concat([df, df2],axis=1).sort_index(axis=1).reset_index()
print (df)
label featureA_1_perc featureA_2_perc featureA_3_perc featureA_std \
0 cat 0.60 0.2 0.20 0.19
1 dog 0.67 0.0 0.33 0.27
featureB_1_perc featureB_2_perc featureB_std
0 0.40 0.60 0.10
1 0.67 0.33 0.17
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments