How to use pandas groupby with aggregation for counting total time taken by a particular event for a particular ID?

debugcn Published at Dev

Darknorth

Sample input:

    ID       Timestamp       Event
1   1   2010-05-31 13:43:45    a
2   1   2010-05-31 13:44:25    a
3   1   2010-05-31 13:44:55    a
4   1   2010-05-31 13:45:35    b
5   1   2010-05-31 13:47:05    b
6   1   2010-05-31 13:47:45    a
7   1   2010-05-31 13:49:45    c
8   1   2010-05-31 13:50:45    b
9   1   2010-06-01 09:20:00    c
10  1   2010-06-01 09:22:00    c
11  1   2010-06-01 09:23:20    c
12  1   2010-06-01 09:24:00    a
13  1   2010-06-01 09:24:15    a
14  1   2010-06-01 09:25:00    b
15  1   2010-06-01 09:25:42    b
16  2   2010-05-31 11:25:38    a
17  2   2010-05-31 11:26:00    c
18  2   2010-05-31 11:26:45    a
19  2   2010-05-31 11:27:10    a
20  2   2010-05-31 11:27:37    a
21  2   2010-05-31 11:29:35    b
22  2   2010-05-31 11:30:25    b
 .  .             .            .
 .  .             .            .
 .  .             .            .

This is a pseudocode of what I'm currently doing, but its not working properly

# Right now I'm looping through the dataframe, using this pseudocode:
for row in df.itertuples():
    check for ID match: (say, 1 == 1)
        check for event match: (say a == a)
           calculate time duration, 
           check for < 3600
                store in a dictionary of the events ('a', 'b',etc = keys), cumulative
                store this  events dictionary for the ID
           else Increment session accordingly
        else
           store the previous occurrence time of the previous event,
           use it for the next occurrence of the event, using a dictionary;

           At the same time, find the current event's duration w.r.t.
           the time stored at the previous changing of the event(in the above line), 
         again check for < 3600 
                   store in a dictionary of the events ('a', 'b',etc = keys), cumulative
                   store this  events dictionary for the ID
       else Increment session accordingly
    else
        reset events dictionary to zero values
        reset the event change dictionary to zero values

This is a sample output, but similar to what I want to do.

    ID   a_time   b_time   c_time   Session
1    1     120      200      100       2
2    2     235      340      145       3
.    .      .        .        .        .
.    .      .        .        .        .

Actually there are 7 unique events in all, I've just mentioned 3
to keep it less complicated.

{Basically “a_time” = sum of time taken by all occurences the event ‘a’ for the current ID (if the time duration for that occurrence isn’t greater than 3600s; else increase the session count accordingly  session+time-duration/3600, OR by plus 1 only if the date changes as well)

‘’ = similar entry}

Hope: Is there a more pythonic, or a vectorized way to do this? Like using

df.groupby(['ID'])['Timestamp'].diff()

I used a lot of versions of this, but it didn't work for me, as I get "no aggregation function available for pd.groupby"

If I didn't need to use the time logs then I'd just use the command

df.groupby(['ID'])['Event'].value_counts().unstack().fillna(0)

to get an output which would be the count of each of the unique events for each ID.

My obtained output, on using the pseudocode approach is just the repetition of the 1st row (just some random numbers shown here, not calculated) shown below, and I'm not able to see all the events as well, just three:

        ID       a_time       b_time       c_time
1       1         120          200           100

Using groupby with ['Timestamp'].diff() as mentioned in the above part gives me error "no aggregation function available for pd.groupby". I even used sort_values, apply but I always got this same error (just mentioned in the above line).

Darknorth

I solved it myself. I figured out that I could just use this:

df.at[index, col] = df.at[index, col] + duration

Basically, as dataframe can visualized as a matrix, hence I still can access a dataframe by it's cell, so I managed to solve it.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-7

Comments

0 comments

From Dev

How to process the records in Pandas to find the total time for 'true' and then 'false' and then again for 'true' for sorted data on particular day

From Dev

Related Related

Article