将具有相同属性的连续事件分组，并计算其他列的累计值

Bruno Miguel Gonçalves 发表于 Dev

布鲁诺·米格尔·贡萨尔维斯（Bruno MiguelGonçalves）

我正在尝试解决熊猫问题，该问题包括增加列“ duration”，但在连续事件中对“ id”和“ user_number”对的行中。我已经尝试了几种将groupby或boolean列与cumsum一起使用的方法，但是它总是会累积不连续的事件（或行）的值。

我有一个数据样本：

id      user_num    user_key    start_time              end_time                  type  duration
1185ba  413         221528661   2020-12-13 15:15:00.000 2020-12-13 15:34:09.000   L     1149000.0   
1185ba  403         22251339    2020-12-13 15:34:09.000 2020-12-13 15:49:00.000   L     891000.0    
1185ba  403         22421785    2020-12-13 15:49:00.000 2020-12-13 16:10:00.000   L     1260000.0   
6666uj  403         22621565    2020-12-13 16:10:00.000 2020-12-13 16:37:00.000   L     1620000.0   
1185ba  490         221493747   2020-12-13 16:37:05.000 2020-12-13 16:49:00.000   L     715000.0    
1185ba  490         221063609   2020-12-13 16:49:00.000 2020-12-13 16:50:45.000   L     105000.0    
1185ba  413         221528653   2020-12-13 16:50:45.000 2020-12-13 16:51:06.463   L     21463.0 0   
1185ba  11          12470010    2020-12-13 16:51:06.463 2020-12-13 16:55:00.000   NL    233537.0    
1185ba  413         221528654   2020-12-13 16:55:00.000 2020-12-13 17:00:00.000   L     300000.0    
1185ba  413         562281418   2020-12-13 17:00:00.000 2020-12-13 17:25:00.000   L     1500000.0
1185ba  413         221528638   2020-12-13 17:25:00.000 2020-12-13 17:31:30.013   L     390013.0

这个例子就是我想要得到的：

id      user_num    user_key    start_time              end_time                  type  duration (sum)
1185ba  413         221528661  2020-12-13 15:15:00.000 2020-12-13 15:34:09.000    L     1149000.0   
1185ba  403         22251339   2020-12-13 15:34:09.000 2020-12-13 15:49:00.000    L     891000.0
1185ba  403         22421785   2020-12-13 15:49:00.000 2020-12-13 16:10:00.000    L     2151000.0
6666uj  403         22621565   2020-12-13 16:10:00.000 2020-12-13 16:37:00.000    L     1620000.0   
1185ba  490         221493747  2020-12-13 16:37:05.000 2020-12-13 16:49:00.000    L     715000.0    
1185ba  490         221063609  2020-12-13 16:49:00.000 2020-12-13 16:50:45.000    L     820000.0    
1185ba  413         221528653  2020-12-13 16:50:45.000 2020-12-13 16:51:06.463    L     21463.0 
1185ba   11         2470010    2020-12-13 16:51:06.463 2020-12-13 16:55:00.000    NL    233537.0    
1185ba  413         221528654  2020-12-13 16:55:00.000 2020-12-13 17:00:00.000    L     300000.0    
1185ba  413         562281418  2020-12-13 17:00:00.000 2020-12-13 17:25:00.000    L     1800000.0
1185ba  413         221528638  2020-12-13 17:25:00.000 2020-12-13 17:31:30.013    L     2190013.0

我怎么能得到这样的桌子？我在搜索这样的内容时遇到了困难。我们有一个user_num，其ID为413，ID和ID为1185ba，从15:15开始，但是中间有一些事件，直到到达该行1185ba 413 221528653 2020-12-13 16:50:45.000 2020-12-13 16:51:06.463 L 21463.0 并且重新启动总和或总和。但是关键点是它与user_key的ID，如果一对关键点是连续的，则必须对持续时间求和，如果不再发生这种情况，将重置该对的计数。

皮特巴格

主要思想是在按分组后确定连续事件的序列id，user_num然后执行cumsum

首先我们加载数据

import pandas as pd
from io import StringIO
data = StringIO(
"""    
id      user_num    user_key    start_time              end_time                  type  duration
1185ba  413         221528661   2020-12-13 15:15:00.000  2020-12-13 15:34:09.000   L     1149000.0   
1185ba  403         22251339    2020-12-13 15:34:09.000  2020-12-13 15:49:00.000   L     891000.0    
1185ba  403         22421785    2020-12-13 15:49:00.000  2020-12-13 16:10:00.000   L     1260000.0   
6666uj  403         22621565    2020-12-13 16:10:00.000  2020-12-13 16:37:00.000   L     1620000.0   
1185ba  490         221493747   2020-12-13 16:37:05.000  2020-12-13 16:49:00.000   L     715000.0    
1185ba  490         221063609   2020-12-13 16:49:00.000  2020-12-13 16:50:45.000   L     105000.0    
1185ba  413         221528653   2020-12-13 16:50:45.000  2020-12-13 16:51:06.463   L     21463.0   
1185ba  11          12470010    2020-12-13 16:51:06.463  2020-12-13 16:55:00.000   NL    233537.0    
1185ba  413         221528654   2020-12-13 16:55:00.000  2020-12-13 17:00:00.000   L     300000.0    
1185ba  413         562281418   2020-12-13 17:00:00.000  2020-12-13 17:25:00.000   L     1500000.0
1185ba  413         221528638   2020-12-13 17:25:00.000  2020-12-13 17:31:30.013   L     390013.0
""")
df = pd.read_csv(data, delimiter='\s\s+')
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

接下来，我们增加一列，标记（内的每个id，user_num组）中的时间序列中的间隙

df2 = (df.groupby(['id','user_num'], sort=False, as_index = False)
    .apply(lambda g: 
        g.assign(new_seq_start = (g['start_time'] != g['end_time'].shift()))
    )
    .reset_index(drop = True)
)
df2

这给

    id        user_num    user_key  start_time                  end_time                    type      duration  new_seq_start
--  ------  ----------  ----------  --------------------------  --------------------------  ------  ----------  ---------------
 0  1185ba         413   221528661  2020-12-13 15:15:00         2020-12-13 15:34:09         L        1149000.0  True
 1  1185ba         413   221528653  2020-12-13 16:50:45         2020-12-13 16:51:06.463000  L          21463.0  True
 2  1185ba         413   221528654  2020-12-13 16:55:00         2020-12-13 17:00:00         L         300000.0  True
 3  1185ba         413   562281418  2020-12-13 17:00:00         2020-12-13 17:25:00         L        1500000.0  False
 4  1185ba         413   221528638  2020-12-13 17:25:00         2020-12-13 17:31:30.013000  L         390013.0  False
 5  1185ba         403    22251339  2020-12-13 15:34:09         2020-12-13 15:49:00         L         891000.0  True
 6  1185ba         403    22421785  2020-12-13 15:49:00         2020-12-13 16:10:00         L        1260000.0  False
 7  6666uj         403    22621565  2020-12-13 16:10:00         2020-12-13 16:37:00         L        1620000.0  True
 8  1185ba         490   221493747  2020-12-13 16:37:05         2020-12-13 16:49:00         L         715000.0  True
 9  1185ba         490   221063609  2020-12-13 16:49:00         2020-12-13 16:50:45         L         105000.0  False
10  1185ba          11    12470010  2020-12-13 16:51:06.463000  2020-12-13 16:55:00         NL        233537.0  True

然后，我们使用此新列new_seq_start来定义一个计数器group_idx，该计数器在存在间隙时递增

df2['group_idx'] = df2.groupby(['id','user_num'])['new_seq_start'].cumsum()
df2

所以我们得到

    id        user_num    user_key  start_time                  end_time                    type      duration  new_seq_start      group_idx    duration (sum)
--  ------  ----------  ----------  --------------------------  --------------------------  ------  ----------  ---------------  -----------  ----------------
 0  1185ba         413   221528661  2020-12-13 15:15:00         2020-12-13 15:34:09         L        1149000.0  True                       1         1149000.0
 1  1185ba         413   221528653  2020-12-13 16:50:45         2020-12-13 16:51:06.463000  L          21463.0  True                       2           21463.0
 2  1185ba         413   221528654  2020-12-13 16:55:00         2020-12-13 17:00:00         L         300000.0  True                       3          300000.0
 3  1185ba         413   562281418  2020-12-13 17:00:00         2020-12-13 17:25:00         L        1500000.0  False                      3         1800000.0
 4  1185ba         413   221528638  2020-12-13 17:25:00         2020-12-13 17:31:30.013000  L         390013.0  False                      3         2190013.0
 5  1185ba         403    22251339  2020-12-13 15:34:09         2020-12-13 15:49:00         L         891000.0  True                       1          891000.0
 6  1185ba         403    22421785  2020-12-13 15:49:00         2020-12-13 16:10:00         L        1260000.0  False                      1         2151000.0
 7  6666uj         403    22621565  2020-12-13 16:10:00         2020-12-13 16:37:00         L        1620000.0  True                       1         1620000.0
 8  1185ba         490   221493747  2020-12-13 16:37:05         2020-12-13 16:49:00         L         715000.0  True                       1          715000.0
 9  1185ba         490   221063609  2020-12-13 16:49:00         2020-12-13 16:50:45         L         105000.0  False                      1          820000.0
10  1185ba          11    12470010  2020-12-13 16:51:06.463000  2020-12-13 16:55:00         NL        233537.0  True                       1          233537.0

现在我们可以groupby在原始列上加上group_idx，应用cumsum和按原始顺序排序

df2['duration (sum)'] = df2.groupby(['id','user_num','group_idx'])['duration'].cumsum()
df2.sort_values('start_time')

完毕：

    id        user_num    user_key  start_time                  end_time                    type      duration  new_seq_start      group_idx    duration (sum)
--  ------  ----------  ----------  --------------------------  --------------------------  ------  ----------  ---------------  -----------  ----------------
 0  1185ba         413   221528661  2020-12-13 15:15:00         2020-12-13 15:34:09         L        1149000.0  True                       1         1149000.0
 5  1185ba         403    22251339  2020-12-13 15:34:09         2020-12-13 15:49:00         L         891000.0  True                       1          891000.0
 6  1185ba         403    22421785  2020-12-13 15:49:00         2020-12-13 16:10:00         L        1260000.0  False                      1         2151000.0
 7  6666uj         403    22621565  2020-12-13 16:10:00         2020-12-13 16:37:00         L        1620000.0  True                       1         1620000.0
 8  1185ba         490   221493747  2020-12-13 16:37:05         2020-12-13 16:49:00         L         715000.0  True                       1          715000.0
 9  1185ba         490   221063609  2020-12-13 16:49:00         2020-12-13 16:50:45         L         105000.0  False                      1          820000.0
 1  1185ba         413   221528653  2020-12-13 16:50:45         2020-12-13 16:51:06.463000  L          21463.0  True                       2           21463.0
10  1185ba          11    12470010  2020-12-13 16:51:06.463000  2020-12-13 16:55:00         NL        233537.0  True                       1          233537.0
 2  1185ba         413   221528654  2020-12-13 16:55:00         2020-12-13 17:00:00         L         300000.0  True                       3          300000.0
 3  1185ba         413   562281418  2020-12-13 17:00:00         2020-12-13 17:25:00         L        1500000.0  False                      3         1800000.0
 4  1185ba         413   221528638  2020-12-13 17:25:00         2020-12-13 17:31:30.013000  L         390013.0  False                      3         2190013.0

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-03-3

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章