如何根据索引的最大值差异创建新列？

Daniel Arges 发表于 Dev

丹尼尔·阿吉斯（Daniel Arges）

采取以下多索引数据框：

index_1   index_2   cum_value
0         2020-01      100.00
0         2020-02       50.00 
0         2020-03      -50.00
0         2020-04      150.00
0         2020-05      200.00    
1         2020-01       25.00
1         2020-02       50.00
1         2020-03     -100.00
1         2020-04       50.00
1         2020-05      200.00

如果考虑到过去几个月中该月内的过去最大值，我需要创建一个new_col计算cum_value每个月底的差值。index_1cum_valueindex_1

结果应该是这样的：

index_1   index_2   cum_value   new_col
0         2020-01      100.00    100.00 --> first positive value on index_1 [0]
0         2020-02       50.00      0.00
0         2020-03      -50.00      0.00
0         2020-04      150.00     50.00 --> (150 - 100)
0         2020-05      200.00     50.00 --> (200 - 150)
1         2020-01       25.00     25.00 --> first positive value on index_1 [1]
1         2020-02       50.00     25.00 --> (50 - 25)
1         2020-03     -100.00      0.00
1         2020-04       50.00      0.00
1         2020-05      200.00    150.00 --> (200 - 50)

带有正值的第一行new_col必须显示该值。我不需要负的最大值。

这是计算边际价值以支付一些税款的理由。

舒巴姆·沙玛（Shubham Sharma）

代码

c = df.groupby(level=0)['cum_value'].cummax()
m = df['cum_value'].ge(c) & df['cum_value'].ge(0)
df['new_col'] = df.loc[m, 'cum_value'].groupby(level=0).diff()
df['new_col'] = df['new_col'].fillna(df['cum_value']).mask(~m, 0)

解释说明

让我们group在数据帧上level=0，即index_1，改造柱cum_value使用cummax来计算累计最大值每个level=0组：

>>> c

index_1  index_2
0        2020-01    100.0
         2020-02    100.0
         2020-03    100.0
         2020-04    150.0
         2020-05    200.0
1        2020-01     25.0
         2020-02     50.0
         2020-03     50.0
         2020-04     50.0
         2020-05    200.0
Name: cum_value, dtype: float64

现在，将cum_value列与上面计算的累积最大值进行比较，以创建布尔掩码。请注意，我们仅考虑中的正值cum_value。该布尔掩码的基本思想是，如果当前月份的值大于或等于前几个月的最大值，则该掩码的输出为True否则False。

>>> m

index_1  index_2
0        2020-01     True
         2020-02    False
         2020-03    False
         2020-04     True
         2020-05     True
1        2020-01     True
         2020-02     True
         2020-03    False
         2020-04     True
         2020-05     True
Name: cum_value, dtype: bool

由于我们只对cum_value满足上述条件的列中的值感兴趣，因此可以使用布尔掩码来过滤这些值。

>>> df.loc[m, 'cum_value']

index_1  index_2
0        2020-01    100.0
         2020-04    150.0
         2020-05    200.0
1        2020-01     25.0
         2020-02     50.0
         2020-04     50.0
         2020-05    200.0
Name: cum_value, dtype: float64

现在，group将上面过滤的值放在level=0ieindex_1和diff上，用于cum_value列上以计算当前值和先前最大值之间的差：

>>> df.loc[m, 'cum_value'].groupby(level=0).diff()

index_1  index_2
0        2020-01      NaN
         2020-04     50.0
         2020-05     50.0
1        2020-01      NaN
         2020-02     25.0
         2020-04      0.0
         2020-05    150.0
Name: cum_value, dtype: float64

最后，将NaN值填充到新创建的值中，new_col并屏蔽0不满足条件的值m：

>>> df
                 cum_value  new_col
index_1 index_2                    
0       2020-01      100.0    100.0
        2020-02       50.0      0.0
        2020-03      -50.0      0.0
        2020-04      150.0     50.0
        2020-05      200.0     50.0
1       2020-01       25.0     25.0
        2020-02       50.0     25.0
        2020-03     -100.0      0.0
        2020-04       50.0      0.0
        2020-05      200.0    150.0

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。