我正在尝试解决熊猫问题,该问题包括增加列“ duration”,但在连续事件中对“ id”和“ user_number”对的行中。我已经尝试了几种将groupby或boolean列与cumsum一起使用的方法,但是它总是会累积不连续的事件(或行)的值。
我有一个数据样本:
id user_num user_key start_time end_time type duration
1185ba 413 221528661 2020-12-13 15:15:00.000 2020-12-13 15:34:09.000 L 1149000.0
1185ba 403 22251339 2020-12-13 15:34:09.000 2020-12-13 15:49:00.000 L 891000.0
1185ba 403 22421785 2020-12-13 15:49:00.000 2020-12-13 16:10:00.000 L 1260000.0
6666uj 403 22621565 2020-12-13 16:10:00.000 2020-12-13 16:37:00.000 L 1620000.0
1185ba 490 221493747 2020-12-13 16:37:05.000 2020-12-13 16:49:00.000 L 715000.0
1185ba 490 221063609 2020-12-13 16:49:00.000 2020-12-13 16:50:45.000 L 105000.0
1185ba 413 221528653 2020-12-13 16:50:45.000 2020-12-13 16:51:06.463 L 21463.0 0
1185ba 11 12470010 2020-12-13 16:51:06.463 2020-12-13 16:55:00.000 NL 233537.0
1185ba 413 221528654 2020-12-13 16:55:00.000 2020-12-13 17:00:00.000 L 300000.0
1185ba 413 562281418 2020-12-13 17:00:00.000 2020-12-13 17:25:00.000 L 1500000.0
1185ba 413 221528638 2020-12-13 17:25:00.000 2020-12-13 17:31:30.013 L 390013.0
这个例子就是我想要得到的:
id user_num user_key start_time end_time type duration (sum)
1185ba 413 221528661 2020-12-13 15:15:00.000 2020-12-13 15:34:09.000 L 1149000.0
1185ba 403 22251339 2020-12-13 15:34:09.000 2020-12-13 15:49:00.000 L 891000.0
1185ba 403 22421785 2020-12-13 15:49:00.000 2020-12-13 16:10:00.000 L 2151000.0
6666uj 403 22621565 2020-12-13 16:10:00.000 2020-12-13 16:37:00.000 L 1620000.0
1185ba 490 221493747 2020-12-13 16:37:05.000 2020-12-13 16:49:00.000 L 715000.0
1185ba 490 221063609 2020-12-13 16:49:00.000 2020-12-13 16:50:45.000 L 820000.0
1185ba 413 221528653 2020-12-13 16:50:45.000 2020-12-13 16:51:06.463 L 21463.0
1185ba 11 2470010 2020-12-13 16:51:06.463 2020-12-13 16:55:00.000 NL 233537.0
1185ba 413 221528654 2020-12-13 16:55:00.000 2020-12-13 17:00:00.000 L 300000.0
1185ba 413 562281418 2020-12-13 17:00:00.000 2020-12-13 17:25:00.000 L 1800000.0
1185ba 413 221528638 2020-12-13 17:25:00.000 2020-12-13 17:31:30.013 L 2190013.0
我怎么能得到这样的桌子?我在搜索这样的内容时遇到了困难。我们有一个user_num,其ID为413,ID和ID为1185ba,从15:15开始,但是中间有一些事件,直到到达该行1185ba 413 221528653 2020-12-13 16:50:45.000 2020-12-13 16:51:06.463 L 21463.0
并且重新启动总和或总和。但是关键点是它与user_key的ID,如果一对关键点是连续的,则必须对持续时间求和,如果不再发生这种情况,将重置该对的计数。
主要思想是在按分组后确定连续事件的序列id
,user_num
然后执行cumsum
首先我们加载数据
import pandas as pd
from io import StringIO
data = StringIO(
"""
id user_num user_key start_time end_time type duration
1185ba 413 221528661 2020-12-13 15:15:00.000 2020-12-13 15:34:09.000 L 1149000.0
1185ba 403 22251339 2020-12-13 15:34:09.000 2020-12-13 15:49:00.000 L 891000.0
1185ba 403 22421785 2020-12-13 15:49:00.000 2020-12-13 16:10:00.000 L 1260000.0
6666uj 403 22621565 2020-12-13 16:10:00.000 2020-12-13 16:37:00.000 L 1620000.0
1185ba 490 221493747 2020-12-13 16:37:05.000 2020-12-13 16:49:00.000 L 715000.0
1185ba 490 221063609 2020-12-13 16:49:00.000 2020-12-13 16:50:45.000 L 105000.0
1185ba 413 221528653 2020-12-13 16:50:45.000 2020-12-13 16:51:06.463 L 21463.0
1185ba 11 12470010 2020-12-13 16:51:06.463 2020-12-13 16:55:00.000 NL 233537.0
1185ba 413 221528654 2020-12-13 16:55:00.000 2020-12-13 17:00:00.000 L 300000.0
1185ba 413 562281418 2020-12-13 17:00:00.000 2020-12-13 17:25:00.000 L 1500000.0
1185ba 413 221528638 2020-12-13 17:25:00.000 2020-12-13 17:31:30.013 L 390013.0
""")
df = pd.read_csv(data, delimiter='\s\s+')
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
接下来,我们增加一列,标记(内的每个id
,user_num
组)中的时间序列中的间隙
df2 = (df.groupby(['id','user_num'], sort=False, as_index = False)
.apply(lambda g:
g.assign(new_seq_start = (g['start_time'] != g['end_time'].shift()))
)
.reset_index(drop = True)
)
df2
这给
id user_num user_key start_time end_time type duration new_seq_start
-- ------ ---------- ---------- -------------------------- -------------------------- ------ ---------- ---------------
0 1185ba 413 221528661 2020-12-13 15:15:00 2020-12-13 15:34:09 L 1149000.0 True
1 1185ba 413 221528653 2020-12-13 16:50:45 2020-12-13 16:51:06.463000 L 21463.0 True
2 1185ba 413 221528654 2020-12-13 16:55:00 2020-12-13 17:00:00 L 300000.0 True
3 1185ba 413 562281418 2020-12-13 17:00:00 2020-12-13 17:25:00 L 1500000.0 False
4 1185ba 413 221528638 2020-12-13 17:25:00 2020-12-13 17:31:30.013000 L 390013.0 False
5 1185ba 403 22251339 2020-12-13 15:34:09 2020-12-13 15:49:00 L 891000.0 True
6 1185ba 403 22421785 2020-12-13 15:49:00 2020-12-13 16:10:00 L 1260000.0 False
7 6666uj 403 22621565 2020-12-13 16:10:00 2020-12-13 16:37:00 L 1620000.0 True
8 1185ba 490 221493747 2020-12-13 16:37:05 2020-12-13 16:49:00 L 715000.0 True
9 1185ba 490 221063609 2020-12-13 16:49:00 2020-12-13 16:50:45 L 105000.0 False
10 1185ba 11 12470010 2020-12-13 16:51:06.463000 2020-12-13 16:55:00 NL 233537.0 True
然后,我们使用此新列new_seq_start
来定义一个计数器group_idx
,该计数器在存在间隙时递增
df2['group_idx'] = df2.groupby(['id','user_num'])['new_seq_start'].cumsum()
df2
所以我们得到
id user_num user_key start_time end_time type duration new_seq_start group_idx duration (sum)
-- ------ ---------- ---------- -------------------------- -------------------------- ------ ---------- --------------- ----------- ----------------
0 1185ba 413 221528661 2020-12-13 15:15:00 2020-12-13 15:34:09 L 1149000.0 True 1 1149000.0
1 1185ba 413 221528653 2020-12-13 16:50:45 2020-12-13 16:51:06.463000 L 21463.0 True 2 21463.0
2 1185ba 413 221528654 2020-12-13 16:55:00 2020-12-13 17:00:00 L 300000.0 True 3 300000.0
3 1185ba 413 562281418 2020-12-13 17:00:00 2020-12-13 17:25:00 L 1500000.0 False 3 1800000.0
4 1185ba 413 221528638 2020-12-13 17:25:00 2020-12-13 17:31:30.013000 L 390013.0 False 3 2190013.0
5 1185ba 403 22251339 2020-12-13 15:34:09 2020-12-13 15:49:00 L 891000.0 True 1 891000.0
6 1185ba 403 22421785 2020-12-13 15:49:00 2020-12-13 16:10:00 L 1260000.0 False 1 2151000.0
7 6666uj 403 22621565 2020-12-13 16:10:00 2020-12-13 16:37:00 L 1620000.0 True 1 1620000.0
8 1185ba 490 221493747 2020-12-13 16:37:05 2020-12-13 16:49:00 L 715000.0 True 1 715000.0
9 1185ba 490 221063609 2020-12-13 16:49:00 2020-12-13 16:50:45 L 105000.0 False 1 820000.0
10 1185ba 11 12470010 2020-12-13 16:51:06.463000 2020-12-13 16:55:00 NL 233537.0 True 1 233537.0
现在我们可以groupby
在原始列上加上group_idx
,应用cumsum
和按原始顺序排序
df2['duration (sum)'] = df2.groupby(['id','user_num','group_idx'])['duration'].cumsum()
df2.sort_values('start_time')
完毕:
id user_num user_key start_time end_time type duration new_seq_start group_idx duration (sum)
-- ------ ---------- ---------- -------------------------- -------------------------- ------ ---------- --------------- ----------- ----------------
0 1185ba 413 221528661 2020-12-13 15:15:00 2020-12-13 15:34:09 L 1149000.0 True 1 1149000.0
5 1185ba 403 22251339 2020-12-13 15:34:09 2020-12-13 15:49:00 L 891000.0 True 1 891000.0
6 1185ba 403 22421785 2020-12-13 15:49:00 2020-12-13 16:10:00 L 1260000.0 False 1 2151000.0
7 6666uj 403 22621565 2020-12-13 16:10:00 2020-12-13 16:37:00 L 1620000.0 True 1 1620000.0
8 1185ba 490 221493747 2020-12-13 16:37:05 2020-12-13 16:49:00 L 715000.0 True 1 715000.0
9 1185ba 490 221063609 2020-12-13 16:49:00 2020-12-13 16:50:45 L 105000.0 False 1 820000.0
1 1185ba 413 221528653 2020-12-13 16:50:45 2020-12-13 16:51:06.463000 L 21463.0 True 2 21463.0
10 1185ba 11 12470010 2020-12-13 16:51:06.463000 2020-12-13 16:55:00 NL 233537.0 True 1 233537.0
2 1185ba 413 221528654 2020-12-13 16:55:00 2020-12-13 17:00:00 L 300000.0 True 3 300000.0
3 1185ba 413 562281418 2020-12-13 17:00:00 2020-12-13 17:25:00 L 1500000.0 False 3 1800000.0
4 1185ba 413 221528638 2020-12-13 17:25:00 2020-12-13 17:31:30.013000 L 390013.0 False 3 2190013.0
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句