Pivoting a pandas dataframe with duplicate index values

metersk Published at Dev

metersk

I have a data frame which has rows for each user joining my site and making a purchase.

+---+-----+--------------------+---------+--------+-----+
|   | uid |        msg         |  _time  | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 |   1 | confirmed_settings | 1/29/15 | M      |  37 |
| 1 |   1 | sale               | 4/13/15 | M      |  37 |
| 2 |   3 | confirmed_settings | 4/19/15 | M      |  35 |
| 3 |   4 | confirmed_settings | 2/21/15 | M      |  21 |
| 4 |   5 | confirmed_settings | 3/28/15 | M      |  18 |
| 5 |   4 | sale               | 3/15/15 | M      |  21 |
+---+-----+--------------------+---------+--------+-----+

I would like to change the dataframe so that each row is unique for a uid and there is a columns called sale and confirmed_settings which have the timestamp of the action. Note that not every user has a sale, but every user has a confirmed_settings. Like below:

+---+-----+--------------------+---------+---------+--------+-----+
|   | uid | confirmed_settings |  sale   |  _time  | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 |   1 | 1/29/15            | 4/13/15 | 1/29/15 | M      |  37 |
| 1 |   3 | 4/19/15            | null    | 4/19/15 | M      |  35 |
| 2 |   4 | 2/21/15            | 3/15/15 | 2/21/15 | M      |  21 |
| 3 |   5 | 3/28/15            | null    | 3/28/15 | M      |  18 |
+---+-----+--------------------+---------+---------+--------+-----+

To do this, I am trying:

df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')

But I get this error: ValueError: Index contains duplicate entries, cannot reshape

How can I pivot a df with duplicate index values to transform my dataframe?

Edit: df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()

gives this error DataError: No numeric types to aggregate but im not even sure that is the right path to go on.

fixxxer

x is the data frame that you have as input :

    uid               msg   _time   gender  age
0   1   confirmed_settings  1/29/15 M       37
1   1   sale                4/13/15 M       37
2   3   confirmed_settings  4/19/15 M       35
3   4   confirmed_settings  2/21/15 M       21
4   5   confirmed_settings  3/28/15 M       18
5   4   sale                3/15/15 M       21

y = x.pivot(index='uid', columns='msg', values='_time')
x.join(y).drop('msg', axis=1)

gives you:

    uid _time   gender  age     confirmed_settings  sale
0   1   1/29/15     M   37                    NaN   NaN
1   1   4/13/15     M   37                1/29/15   4/13/15
2   3   4/19/15     M   35                    NaN   NaN
3   4   2/21/15     M   21                4/19/15   NaN
4   5   3/28/15     M   18                2/21/15   3/15/15
5   4   3/15/15     M   21                3/28/15   NaN

Collected from the Internet

Please contact [email protected] to delete if infringement.