如何在不比较每个值的情况下有条件地选择每个组的最高值？

debugcn 发表于 Dev

DN1

我的数据如下所示：

  Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5
    1   CLNS1A   0.2811747       0               2
    1   RSF1     0.5469924       3               6
    2   CFDP1    0.4186066       1               2
    2   CHST6    0.4295135       1               3
    3   ACE      0.634           1               1
    3   NOS2     0.6345          1               1
    4   Gene1    0.7             0               1
    4   Gene2    0.61            1               0
    4   Gene3    0.62            0               1

我按Group列将基因分组，然后根据条件选择每组最佳的基因：

如果得分最高的基因与该组中其他所有基因的得分差异> 0.05，则选择得分最高的基因
如果组中排名靠前的基因与任何其他基因之间的得分差异小于0.05，则选择具有较高基因的基因，direct_count 仅在与组中与排名靠前的得分基因具有<0.05距离的那些基因之间进行选择
如果direct_count相同，则选择最高的基因secondary_count
如果所有计数均相同，则选择所有相互之间距离<0.05的基因。

示例输出如下：

 Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5  #highest direct_count
    2   CHST6    0.4295135       1               3  #highest secondary_count after matching direct_count
    3   ACE      0.634           1               1  #ACE and NOS2 have matching counts
    3   NOS2     0.6345          1               1
    4   Gene1    0.7             0               1  #highest score by >0.05 difference

目前，我尝试使用以下代码进行编码：

df<- setDT(df)
new_df <- df[, 
   {d = dist(Score, method = 'manhattan')
   if (any(d > 0.05)) 
     ind = which.max(d)
   else if (sum(max(direct_count) == direct_count) == 1L) 
     ind = which.max(direct_count)
   else if (sum(max(secondary_count) == secondary_count) == 1L) 
     ind = which.max(secondary_count)
   else 
     ind = which((outer(direct_count, direct_count, '==') & outer(secondary_count, secondary_count, '=='))[1, ])
   
   .SD[ind]
   }
   , by = Group]

但是，我正在努力调整我的第一个else if陈述以解决我的第二个状况，仅在与得分最高的基因之间的距离小于0.05的基因之间进行选择-当前，它正在与每个组中的所有基因进行比较，因此即使该组中的某个基因具有0.1分，但最大的count列是从0.7的最高得分基因中选择的，例如，如果该组中的其他基因为0.68，则满足了<0.05距离要求。

本质上，我希望条件2到4仅考虑与每个组中得分最高的基因<0.05距离的基因。

输入数据：

structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1","Gene2","Gene3"), Score = c(0.5566507, 
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.7, 0.62, 0.61), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L, 0L, 0L, 1L)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

编辑：

我提出这个问题的原因是一个特定的小组没有按我预期的那样做问题：

  Group Gene         Score      direct_count     secondary_count
1   2    CFDP1        0.5517401        1                  62
2   2    CHST6        0.5989186        1                   6
3   2    RNU6-758P    0.5644914        0                   1
4   2    Gene1        0.5672916        0                   1
5   2    TMEM170A     0.6167083        0                   2

CHST6direct_count在该组中，所有基因中得分最高的基因中<0.05的基因中最高，但仍Gene1处于选择状态。

第二个示例输入数据：

structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1", 
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502, 
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62, 
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

江户

您可以通过两种不同的解决方案来实现最终目标：dplyr和data.table。

您不需要任何复杂的ifelse条件。

解

输入

dt <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11", 
                                                                           "CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507, 
                                                                                                                                         0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 
                                                                                                                                                                                                                      0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L, 
                                                                                                                                                                                                                                                                   3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table", 
                                                                                                                                                                                                                                                                                                                   "data.frame"))

DPLYR

library(dplyr)

dt %>% 
  group_by(Group) %>% 
  filter((max(Score) - Score)<0.05) %>% 
  slice_max(direct_count, n = 1) %>% 
  slice_max(secondary_count, n = 1) %>% 
  ungroup()
#> # A tibble: 4 x 5
#>   Group Gene  Score direct_count secondary_count
#>   <int> <chr> <dbl>        <int>           <int>
#> 1     1 AQP11 0.557            4               5
#> 2     2 CHST6 0.430            1               3
#> 3     3 ACE   0.634            1               1
#> 4     3 NOS2  0.634            1               1

数据表

library(data.table)

dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#>    Group  Gene     Score direct_count secondary_count
#> 1:     1 AQP11 0.5566507            4               5
#> 2:     2 CHST6 0.4295135            1               3
#> 3:     3   ACE 0.6340000            1               1
#> 4:     3  NOS2 0.6345000            1               1

您的编辑

与您问题的结尾处的特定问题相关：这两种方法都选择CHST6，这正是您根据编写的规则所期望的。

dt <- structure(list(Group = c(2L, 2L, 2L, 2L, 2L), 
               Gene = c("CFDP1", "CHST6", "RNU6-758P", "Gene1", "TMEM170A"), 
               Score = c(0.551740109920502,  0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006),
               direct_count = c(1, 1, 0, 0, 0), 
               secondary_count = c(62, 6, 1, 1, 2)), 
          row.names = c(NA, -5L), 
          class = c("data.table", 
                    "data.frame"))


########## DPLYR

library(dplyr)

dt %>% 
  group_by(Group) %>% 
  filter((max(Score) - Score)<0.05) %>% 
  slice_max(direct_count, n = 1) %>% 
  slice_max(secondary_count, n = 1) %>% 
  ungroup()
#> # A tibble: 1 x 5
#>   Group Gene  Score direct_count secondary_count
#>   <int> <chr> <dbl>        <dbl>           <dbl>
#> 1     2 CHST6 0.599            1               6


########## DATATABLE

library(data.table)

dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#>    Group  Gene     Score direct_count secondary_count
#> 1:     2 CHST6 0.5989186            1               6

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。