我有以下格式的表格
Id | Sequence | Attribute A | Attribute B |
ID1 [A,B,C,D] A1 B1
ID2 [A,B,F,G] A2 B3
ID3 [A,B,C,D] A1 B1
我想为每个事件组合和属性值计算唯一ID的数量。
决赛桌应该像
Pair | Attribute Type | Attribute Value | ID Count
(A,B) Attribute A A1 2 #Event A happens before event B in 2 unique ID's where A1 is the value of Attribute A.
(A,C) Attribute A A1 2
(A,D) Attribute A A1 2
(B,C) Attribute A A1 2
(B,D) Attribute A A1 2
(C,D) Attribute A A1 2
(A,B) Attribute A A2 1
(A,F) Attribute A A2 1
(A,G) Attribute A A2 1
(B,F) Attribute A A2 1
(B,G) Attribute A A2 1
(F,G) Attribute A A2 1
(A,B) Attribute B B1 2
(A,C) Attribute B B1 2
(A,D) Attribute B B1 2
(B,C) Attribute B B1 2
(B,D) Attribute B B1 2
(C,D) Attribute B B1 2
(A,B) Attribute B B3 1
(A,F) Attribute B B3 1
(A,G) Attribute B B3 1
(B,F) Attribute B B3 1
(B,G) Attribute B B3 1
(F,G) Attribute B B3 1
正确的做法是什么?实际上,我将不仅具有2个属性。
这就是我走了多远
df['Sequence Combs'] = df['Sequence'].apply(lambda x: list(itertools.combinations(x,2)))
Id | Sequence | Event Combs | Attribute A | Attribute B |
ID1 [A,B,C,D] [(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)] A1 B1
ID2 [A,B,F,G] [(A,B),(A,F),(A,G),(B,F),(B,G),(F,G)] A2 B3
ID3 [A,B,C,D] [(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)] A1 B1
然后爆炸
df = df.explode('Sequence Combs')
我得到以下
Id | Sequence | Event Combs | Attribute A | Attribute B |
ID1 [A,B,C,D] (A,B) A1 B1
ID1 [A,B,C,D] (A,C) A1 B1
ID1 [A,B,C,D] (A,D) A1 B1
ID1 [A,B,C,D] (B,C) A1 B1
ID1 [A,B,C,D] (B,D) A1 B1
ID1 [A,B,C,D] (C,D) A1 B1
... ... .. .. ..
但是我不确定如何从这里开始,有什么想法吗?
您可以这样做:
from itertools import combinations
# create function for creating a list the 2-combinations
combs = lambda x: list(combinations(x, r=2))
# create new DataFrame with now the Sequence column is the list of the 2-combinations
res = df.assign(seq=df['Sequence'].apply(combs)).drop('Sequence', axis=1).rename(columns={'seq' : 'Sequence'})
# explode, then melt
res = res.explode('Sequence').melt(id_vars=['Id', 'Sequence'], var_name='Attribute Type', value_name='Attribute Value')
# finally group by all the columns but Id, and count
res = res.groupby(['Sequence', 'Attribute Type', 'Attribute Value'])['Id'].count()
print(res)
输出量
Sequence Attribute Type Attribute Value
(A, B) Attribute A A1 2
A2 1
Attribute B B1 2
B3 1
(A, C) Attribute A A1 2
Attribute B B1 2
(A, D) Attribute A A1 2
Attribute B B1 2
(A, F) Attribute A A2 1
Attribute B B3 1
(A, G) Attribute A A2 1
Attribute B B3 1
(B, C) Attribute A A1 2
Attribute B B1 2
(B, D) Attribute A A1 2
Attribute B B1 2
(B, F) Attribute A A2 1
Attribute B B3 1
(B, G) Attribute A A2 1
Attribute B B3 1
(C, D) Attribute A A1 2
Attribute B B1 2
(F, G) Attribute A A2 1
Attribute B B3 1
Name: Id, dtype: int64
如果要真正匹配预期的输出,请执行以下操作:
# finally group by all the columns but Id, and count
res = res.groupby(['Sequence', 'Attribute Type', 'Attribute Value'], as_index=False)['Id'].count().rename({'Id' : 'Id Count'}).sort_values('Attribute Type')
print(res)
输出量
Sequence Attribute Type Attribute Value Id
0 (A, B) Attribute A A1 2
1 (A, B) Attribute A A2 1
20 (C, D) Attribute A A1 2
4 (A, C) Attribute A A1 2
6 (A, D) Attribute A A1 2
18 (B, G) Attribute A A2 1
8 (A, F) Attribute A A2 1
10 (A, G) Attribute A A2 1
22 (F, G) Attribute A A2 1
12 (B, C) Attribute A A1 2
16 (B, F) Attribute A A2 1
14 (B, D) Attribute A A1 2
21 (C, D) Attribute B B1 2
19 (B, G) Attribute B B3 1
17 (B, F) Attribute B B3 1
11 (A, G) Attribute B B3 1
13 (B, C) Attribute B B1 2
9 (A, F) Attribute B B3 1
7 (A, D) Attribute B B1 2
5 (A, C) Attribute B B1 2
3 (A, B) Attribute B B3 1
2 (A, B) Attribute B B1 2
15 (B, D) Attribute B B1 2
23 (F, G) Attribute B B3 1
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句