Sample input:
ID Timestamp Event
1 1 2010-05-31 13:43:45 a
2 1 2010-05-31 13:44:25 a
3 1 2010-05-31 13:44:55 a
4 1 2010-05-31 13:45:35 b
5 1 2010-05-31 13:47:05 b
6 1 2010-05-31 13:47:45 a
7 1 2010-05-31 13:49:45 c
8 1 2010-05-31 13:50:45 b
9 1 2010-06-01 09:20:00 c
10 1 2010-06-01 09:22:00 c
11 1 2010-06-01 09:23:20 c
12 1 2010-06-01 09:24:00 a
13 1 2010-06-01 09:24:15 a
14 1 2010-06-01 09:25:00 b
15 1 2010-06-01 09:25:42 b
16 2 2010-05-31 11:25:38 a
17 2 2010-05-31 11:26:00 c
18 2 2010-05-31 11:26:45 a
19 2 2010-05-31 11:27:10 a
20 2 2010-05-31 11:27:37 a
21 2 2010-05-31 11:29:35 b
22 2 2010-05-31 11:30:25 b
. . . .
. . . .
. . . .
This is a pseudocode of what I'm currently doing, but its not working properly
# Right now I'm looping through the dataframe, using this pseudocode:
for row in df.itertuples():
check for ID match: (say, 1 == 1)
check for event match: (say a == a)
calculate time duration,
check for < 3600
store in a dictionary of the events ('a', 'b',etc = keys), cumulative
store this events dictionary for the ID
else Increment session accordingly
else
store the previous occurrence time of the previous event,
use it for the next occurrence of the event, using a dictionary;
At the same time, find the current event's duration w.r.t.
the time stored at the previous changing of the event(in the above line),
again check for < 3600
store in a dictionary of the events ('a', 'b',etc = keys), cumulative
store this events dictionary for the ID
else Increment session accordingly
else
reset events dictionary to zero values
reset the event change dictionary to zero values
This is a sample output, but similar to what I want to do.
ID a_time b_time c_time Session
1 1 120 200 100 2
2 2 235 340 145 3
. . . . . .
. . . . . .
Actually there are 7 unique events in all, I've just mentioned 3
to keep it less complicated.
{Basically “a_time” = sum of time taken by all occurences the event ‘a’ for the current ID (if the time duration for that occurrence isn’t greater than 3600s; else increase the session count accordingly session+time-duration/3600, OR by plus 1 only if the date changes as well)
‘’ = similar entry}
Hope: Is there a more pythonic, or a vectorized way to do this? Like using
df.groupby(['ID'])['Timestamp'].diff()
I used a lot of versions of this, but it didn't work for me, as I get "no aggregation function available for pd.groupby"
If I didn't need to use the time logs then I'd just use the command
df.groupby(['ID'])['Event'].value_counts().unstack().fillna(0)
to get an output which would be the count of each of the unique events for each ID.
My obtained output, on using the pseudocode approach is just the repetition of the 1st row (just some random numbers shown here, not calculated) shown below, and I'm not able to see all the events as well, just three:
ID a_time b_time c_time
1 1 120 200 100
Using groupby with ['Timestamp'].diff() as mentioned in the above part gives me error "no aggregation function available for pd.groupby". I even used sort_values, apply but I always got this same error (just mentioned in the above line).
I solved it myself. I figured out that I could just use this:
df.at[index, col] = df.at[index, col] + duration
Basically, as dataframe can visualized as a matrix, hence I still can access a dataframe by it's cell, so I managed to solve it.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments