我从这个数据帧开始:
df = pd.DataFrame(
[
["a", "aa", "2020-12-20", 10],
["a", "ab", "2020-12-26", 11],
["a", "aa", "2020-12-22", 10],
["b", "bb", "2020-12-25", 111],
["c", "bb", "2020-12-20", 20],
["d", "dd", "2020-12-05", 1111]
],
columns=["cat", "user", "date", "value"]
)
df["date"] = pd.to_datetime(df.date)
猫 | 用户 | 日期 | 值 | |
---|---|---|---|---|
0 | 一种 | a | 2020-12-20 00:00:00 | 10 |
1个 | 一种 | 从 | 2020-12-26 00:00:00 | 11 |
2 | 一种 | a | 2020-12-22 00:00:00 | 10 |
3 | b | bb | 2020-12-25 00:00:00 | 111 |
4 | C | bb | 2020-12-20 00:00:00 | 20 |
5 | d | dd | 2020-12-05 00:00:00 | 1111 |
接下来,我正在运行以下聚合:
gb = (
df.set_index("date")
.groupby("cat")
.resample("W")
.agg(
{"value": "sum", "user": ["nunique", lambda x: x.unique()]}
)
.rename({"<lambda>": "unqiue_users"}, axis=1)
)
这将在表中产生带有multiindex的表:
value user
sum nunique unqiue_users
cat date
a 2020-12-20 10 1 aa
2020-12-27 21 2 [aa, ab]
b 2020-12-27 111 1 bb
c 2020-12-20 20 1 bb
d 2020-12-06 1111 1 dd
最后,我正在尝试对最后的结果运行聚合,例如:
gb.groupby(level=0)[["value", "sum"]].mean()
我不知道如何“访问”具有多索引的列。任何想法?
对于选择MultiIndex和使用的元组,这里使用一个元素列表:
print (gb.groupby(level=0)[[("value", "sum")]].mean())
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
或者您可以mean
按级别使用简化解决方案:
print (gb[[("value", "sum")]].mean(level=0))
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
对于Series
选择省略嵌套列表:
print (gb[("value", "sum")].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: (value, sum), dtype: float64
您的解决方案应MultiIndex
在以下栏中进行更改以避免出现:
gb = (
df.set_index("date")
.groupby(["cat", pd.Grouper(freq='W')])
.agg(val = ("value", "sum"),
nuniq = ("user", "nunique"),
unqiue_users = ("user", lambda x: x.unique()))
)
print (gb)
val nuniq unqiue_users
cat date
a 2020-12-20 10 1 aa
2020-12-27 21 2 [ab, aa]
b 2020-12-27 111 1 bb
c 2020-12-20 20 1 bb
d 2020-12-06 1111 1 dd
print (gb['val'].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: val, dtype: float64
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句