Statistics of one hot encoded columns in pandas dataframe

Ross

I have a Pandas dataframe with a column titled "label". It has three columns titled featureA_1, featureA_2, featureA_3 respectively. These columns represent columns representing one hot encoded values of featureA (which can have three unique values.) Similarly, it also has two columns titled featureB_1 and featureB_2 respectively. These columns represent one hot encoded values of featureB (which can have two distinct values.)

Following is an example of the said dataframe

.

The above mentioned dataframe can be generated using the following:

import pandas as pd
dictt = {
    "label": ["cat", "cat", "cat", "cat", "cat", "dog", "dog", "dog"],
    "featureA_1": [1, 0, 1, 1, 0, 1, 1, 0],
    "featureA_2": [0, 1, 0, 0, 0, 0, 0, 0],
    "featureA_3": [0, 0, 0, 0, 1, 0, 0, 1],
    "featureB_1": [0, 0, 1, 1, 0, 0, 1, 1],
    "featureB_2": [1, 1, 0, 0, 1, 1, 0, 0],
}

df1 = pd.DataFrame(dictt)

Because of one hot encoding, each row in the above dataframe will have the value 1 for only one of the feature values featureA_1, featureA_2, featureA_3 and 0 for others. Similarly, each row will have value 1 for only one of the feature values featureB_1 and featureB_2 and zero for the other.

I want to create a dataframe where I will have the percentage of entries in each label with feature values featureA_1, featureA_2, featureA_3 and percentage of entries in each label with feature values featureB_1 and featureB_2.

I also want to have the standard deviations of those percentages of featureA value types and featureB value types.

Following is an example of the dataframe that I desire to have:

enter image description here

What is the most efficient way of doing this? In my actual work, I will have dataframes with millions of rows.

jezrael

Use:

#aggregate mean for percentages of 1, because only 0, 1 values 
df = df1.groupby('label').mean().add_suffix('_perc').round(2)

#aggregate std witg ddof=0, because default pandas ddof=1
df2 = df.groupby(lambda x: x.split('_')[0], axis=1).std(ddof=0).add_suffix('_std').round(2)

#join together
df = pd.concat([df, df2],axis=1).sort_index(axis=1).reset_index()
print (df)
  label  featureA_1_perc  featureA_2_perc  featureA_3_perc  featureA_std  \
0   cat             0.60              0.2             0.20          0.19   
1   dog             0.67              0.0             0.33          0.27   

   featureB_1_perc  featureB_2_perc  featureB_std  
0             0.40             0.60          0.10  
1             0.67             0.33          0.17  

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Statistics of one hot encoded columns in pandas dataframe

From Dev

Compare pandas dataframe with one-hot encoded encoding

From Dev

Save one-hot-encoded features into Pandas DataFrame the fastest way

From Dev

Compare pandas dataframe with one-hot encoded encoding

From Dev

Save one-hot-encoded features into Pandas DataFrame the fastest way

From Dev

Creating a pandas dataframe from a csv file with 1-hot encoded set of columns

From Dev

Create table by grouping mean values by column and list of one-hot encoded columns (Python, pandas)

From Dev

How to convert separated values into one-hot encoded columns?

From Dev

Is this one hot encoded?

From Dev

Pandas DataFrame GroupBy by several columns with mean, StdDev and count statistics

From Dev

Pandas Dataframe - Bin on multiple columns & get statistics on another column

From Dev

Pandas DataFrame GroupBy by several columns with mean, StdDev and count statistics

From Dev

Smoothing one-hot encoded matrix rows

From Dev

Tensorflow placeholder for one-hot encoded labels

From Dev

Efficient to covert a pandas dataframe to one hot based on data in each row

From Dev

statistics on subsets of a pandas dataframe

From Dev

Replacing Columns from one dataframe with columns from another dataframe in pandas

From Dev

Replacing Columns from one dataframe with columns from another dataframe in pandas

From Dev

Define two columns with one map in Pandas DataFrame

From Dev

Pandas, DataFrame: Splitting one column into multiple columns

From Dev

Pandas, DataFrame: Splitting one column into multiple columns

From Dev

Dropping a number of columns in a pandas DataFrame on one line

From Dev

Match one to many columns in Pandas dataframe

From Dev

Calculate summary statistics of columns in dataframe

From Dev

Adding Columns to pandas dataframe & iterating through one of the columns

From Dev

How to sum up the columns of a pandas dataframe according to the elements in one of the columns

From Java

How to create a pandas dataframe with 2 dataframes one as columns and one as rows

From Dev

Only allow one to one mapping between two columns in pandas dataframe

From Dev

moving from one column in a dataframe in pandas to many or many columns to one

Related Related

  1. 1

    Statistics of one hot encoded columns in pandas dataframe

  2. 2

    Compare pandas dataframe with one-hot encoded encoding

  3. 3

    Save one-hot-encoded features into Pandas DataFrame the fastest way

  4. 4

    Compare pandas dataframe with one-hot encoded encoding

  5. 5

    Save one-hot-encoded features into Pandas DataFrame the fastest way

  6. 6

    Creating a pandas dataframe from a csv file with 1-hot encoded set of columns

  7. 7

    Create table by grouping mean values by column and list of one-hot encoded columns (Python, pandas)

  8. 8

    How to convert separated values into one-hot encoded columns?

  9. 9

    Is this one hot encoded?

  10. 10

    Pandas DataFrame GroupBy by several columns with mean, StdDev and count statistics

  11. 11

    Pandas Dataframe - Bin on multiple columns & get statistics on another column

  12. 12

    Pandas DataFrame GroupBy by several columns with mean, StdDev and count statistics

  13. 13

    Smoothing one-hot encoded matrix rows

  14. 14

    Tensorflow placeholder for one-hot encoded labels

  15. 15

    Efficient to covert a pandas dataframe to one hot based on data in each row

  16. 16

    statistics on subsets of a pandas dataframe

  17. 17

    Replacing Columns from one dataframe with columns from another dataframe in pandas

  18. 18

    Replacing Columns from one dataframe with columns from another dataframe in pandas

  19. 19

    Define two columns with one map in Pandas DataFrame

  20. 20

    Pandas, DataFrame: Splitting one column into multiple columns

  21. 21

    Pandas, DataFrame: Splitting one column into multiple columns

  22. 22

    Dropping a number of columns in a pandas DataFrame on one line

  23. 23

    Match one to many columns in Pandas dataframe

  24. 24

    Calculate summary statistics of columns in dataframe

  25. 25

    Adding Columns to pandas dataframe & iterating through one of the columns

  26. 26

    How to sum up the columns of a pandas dataframe according to the elements in one of the columns

  27. 27

    How to create a pandas dataframe with 2 dataframes one as columns and one as rows

  28. 28

    Only allow one to one mapping between two columns in pandas dataframe

  29. 29

    moving from one column in a dataframe in pandas to many or many columns to one

HotTag

Archive