如何使用data.table通过累积比较确定值

user338714 发表于 Dev

用户名

我正在寻找一种确定data.table下面每个子组的父组的方法。

    Group SubGroup Level Parent
 1:     A       A1     0     NA
 2:     A       A2     1     A1
 3:     A       A3     1     A1
 4:     A       A4     2     A3
 5:     A       A5     3     A4
 6:     A       A6     3     A4
 7:     A       A7     3     A4
 8:     A       A8     2     A3
 9:     A       A9     2     A3
10:     A      A10     2     A3

这是我正在使用的计算，但是我想知道是否有更好的方法。我的实际数据集包括多个组，因此我也想by=在计算中添加一个参数。可以假设父级是最大索引小于当前行且级别小于当前级别的子组。

tmp = data.table(Group = "A", SubGroup = paste0("A", 1:10),
                 Level = c(0, 1, 1, 2, 3, 3, 3, 2, 2, 2))
tmp[, Parent := sapply(1:nrow(tmp), function(x)
  tmp[, SubGroup[(suppressWarnings(max(which(Level[1:x] < Level[x]))))]])]

埃迪

dt = data.table(Group = "A", SubGroup = paste0("A", 1:11),
                             Level = c(0, 1, 1, 2, 3, 3, 3, 2, 2, 2, 3))

# need another grouping layer, to satisfy the row requirements
dt[, rowGroup := cumsum(c(0, diff(Level) != 0)), by = Group]

# get the parent for each Level and rowGroup
parents = dt[, .(Level = Level[.N] + 1, Parent = SubGroup[.N]), by = .(Group, rowGroup)]

setkey(parents, Group, Level, rowGroup)
setkey(dt, Group, Level, rowGroup)

# rolling merge that matches to previous rowGroup
parents[dt, roll = T][order(Group, rowGroup)]
#    Group rowGroup Level Parent SubGroup
# 1:     A        0     0     NA       A1
# 2:     A        1     1     A1       A2
# 3:     A        1     1     A1       A3
# 4:     A        2     2     A3       A4
# 5:     A        3     3     A4       A5
# 6:     A        3     3     A4       A6
# 7:     A        3     3     A4       A7
# 8:     A        4     2     A3       A8
# 9:     A        4     2     A3       A9
#10:     A        4     2     A3      A10
#11:     A        5     3    A10      A11