使用itertools，正确融合和分组，以使用Pandas对每个属性值计算事件对

debugcn 发表于 Dev

kspr

我有以下格式的表格

  Id   |   Sequence   |   Attribute A  |  Attribute B |
  ID1       [A,B,C,D]         A1              B1        
  ID2       [A,B,F,G]         A2              B3            
  ID3       [A,B,C,D]         A1              B1

我想为每个事件组合和属性值计算唯一ID的数量。

决赛桌应该像

  Pair    |  Attribute Type | Attribute Value   | ID Count
  (A,B)        Attribute A          A1              2        #Event A happens before event B in 2 unique ID's where A1 is the value of Attribute A.
  (A,C)        Attribute A          A1              2
  (A,D)        Attribute A          A1              2
  (B,C)        Attribute A          A1              2
  (B,D)        Attribute A          A1              2
  (C,D)        Attribute A          A1              2
  (A,B)        Attribute A          A2              1
  (A,F)        Attribute A          A2              1 
  (A,G)        Attribute A          A2              1 
  (B,F)        Attribute A          A2              1
  (B,G)        Attribute A          A2              1
  (F,G)        Attribute A          A2              1
  (A,B)        Attribute B          B1              2
  (A,C)        Attribute B          B1              2
  (A,D)        Attribute B          B1              2
  (B,C)        Attribute B          B1              2
  (B,D)        Attribute B          B1              2
  (C,D)        Attribute B          B1              2
  (A,B)        Attribute B          B3              1
  (A,F)        Attribute B          B3              1 
  (A,G)        Attribute B          B3              1 
  (B,F)        Attribute B          B3              1
  (B,G)        Attribute B          B3              1
  (F,G)        Attribute B          B3              1

正确的做法是什么？实际上，我将不仅具有2个属性。

这就是我走了多远

 df['Sequence Combs'] = df['Sequence'].apply(lambda x: list(itertools.combinations(x,2)))
 

  Id   |   Sequence   |          Event Combs                   |   Attribute A  |  Attribute B |
  ID1       [A,B,C,D]   [(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)]           A1              B1        
  ID2       [A,B,F,G]   [(A,B),(A,F),(A,G),(B,F),(B,G),(F,G)]           A2              B3              
  ID3       [A,B,C,D]   [(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)]           A1              B1

然后爆炸

df = df.explode('Sequence Combs')

我得到以下

  Id   |   Sequence   |  Event Combs |  Attribute A  |  Attribute B |
  ID1       [A,B,C,D]       (A,B)           A1              B1        
  ID1       [A,B,C,D]       (A,C)           A1              B1        
  ID1       [A,B,C,D]       (A,D)           A1              B1        
  ID1       [A,B,C,D]       (B,C)           A1              B1        
  ID1       [A,B,C,D]       (B,D)           A1              B1        
  ID1       [A,B,C,D]       (C,D)           A1              B1        
  ...          ...           ..             ..              ..

但是我不确定如何从这里开始，有什么想法吗？

丹妮·梅塞乔

您可以这样做：

from itertools import combinations

# create function for creating a list the 2-combinations
combs = lambda x: list(combinations(x, r=2))

# create new DataFrame with now the Sequence column is the list of the 2-combinations
res = df.assign(seq=df['Sequence'].apply(combs)).drop('Sequence', axis=1).rename(columns={'seq' : 'Sequence'})

# explode, then melt
res = res.explode('Sequence').melt(id_vars=['Id', 'Sequence'], var_name='Attribute Type', value_name='Attribute Value')

# finally group by all the columns but Id, and count
res = res.groupby(['Sequence', 'Attribute Type', 'Attribute Value'])['Id'].count()

print(res)

输出量

Sequence  Attribute Type  Attribute Value
(A, B)    Attribute A     A1                 2
                          A2                 1
          Attribute B     B1                 2
                          B3                 1
(A, C)    Attribute A     A1                 2
          Attribute B     B1                 2
(A, D)    Attribute A     A1                 2
          Attribute B     B1                 2
(A, F)    Attribute A     A2                 1
          Attribute B     B3                 1
(A, G)    Attribute A     A2                 1
          Attribute B     B3                 1
(B, C)    Attribute A     A1                 2
          Attribute B     B1                 2
(B, D)    Attribute A     A1                 2
          Attribute B     B1                 2
(B, F)    Attribute A     A2                 1
          Attribute B     B3                 1
(B, G)    Attribute A     A2                 1
          Attribute B     B3                 1
(C, D)    Attribute A     A1                 2
          Attribute B     B1                 2
(F, G)    Attribute A     A2                 1
          Attribute B     B3                 1
Name: Id, dtype: int64

如果要真正匹配预期的输出，请执行以下操作：

# finally group by all the columns but Id, and count
res = res.groupby(['Sequence', 'Attribute Type', 'Attribute Value'], as_index=False)['Id'].count().rename({'Id' : 'Id Count'}).sort_values('Attribute Type')

print(res)

输出量

   Sequence Attribute Type Attribute Value  Id
0    (A, B)    Attribute A              A1   2
1    (A, B)    Attribute A              A2   1
20   (C, D)    Attribute A              A1   2
4    (A, C)    Attribute A              A1   2
6    (A, D)    Attribute A              A1   2
18   (B, G)    Attribute A              A2   1
8    (A, F)    Attribute A              A2   1
10   (A, G)    Attribute A              A2   1
22   (F, G)    Attribute A              A2   1
12   (B, C)    Attribute A              A1   2
16   (B, F)    Attribute A              A2   1
14   (B, D)    Attribute A              A1   2
21   (C, D)    Attribute B              B1   2
19   (B, G)    Attribute B              B3   1
17   (B, F)    Attribute B              B3   1
11   (A, G)    Attribute B              B3   1
13   (B, C)    Attribute B              B1   2
9    (A, F)    Attribute B              B3   1
7    (A, D)    Attribute B              B1   2
5    (A, C)    Attribute B              B1   2
3    (A, B)    Attribute B              B3   1
2    (A, B)    Attribute B              B1   2
15   (B, D)    Attribute B              B1   2
23   (F, G)    Attribute B              B3   1

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。