R：如果行包含某个值（对于许多列），如何从数据框中删除行

debugcn 发表于 Dev

约翰·亚伯拉罕

我有一个包含许多变量（列）和行（观察值）的数据框。我想删除包含1来自数据框值的行。

我知道我可以

做这个：

    test <- data.frame("X"=1:5, "Y"=c(1,1,1,4,5))

    test[test$X>1 & test$Y>1, ]

并获得：

      X Y
    4 4 4
    5 5 5

但是我不想写data$var1 > 1 & data$var2 > 1 ...20或50个变量来做这么简单的事情。

我无需进行大量写作就可以得到相同的结果？

编辑：大oof：这里建议的三种方法都不会产生相同数量的观察结果。这是一个错误吗？可能是与NA相互作用的某种影响？

方法1）

df[!apply(df[, myCols], 1, function(x) any(x == 1)),]

> any(df == 1)
[1] TRUE

方法2）

removeRowsWithOnes <- function(df) {
  rowsToRemoveIndices <- rowSums(df == 1) > 0
  return(df[!rowsToRemoveIndices,])
}

> any(df == 1)
[1] NA

方法3）（与方法2删除的行数不同））

require(tidyverse)

df %>% 
    filter(
        across(everything(), ~ . != 1)
    )

> any(df == 1)
[1] NA

编辑2：将NA添加到df后：

df <- data.frame("x"=c(1,NA,2,2,3,NA), "y"=c(NA,1,1,4,NA,NA))

   x  y
1  1 NA
2 NA  1
3  2  1
4  2  4
5  3 NA
6 NA NA

仅方法3）产生预期的结果：

   x  y
1  2  4
2  3 NA
3 NA NA

编辑2：

请参阅@Jonas的评论：

为了使这两种方法都能工作，您可以将na.rm = TRUE添加到rowSums和任何调用中。默认情况下，此选项设置为na.rm = FALSE（请参阅文档）

塞尔坎

另一个可能的答案是使用 tidyverse

require(tidyverse)

df %>% 
    filter(
        across(everything(), ~ . != 1)
    )

它将保留数据框中所有变量中不同于1的行。

注意：如果您有NA数据，此方法也将删除这些索引。因此，我建议以下扩展；

df %>% 
    filter(
        across(everything(), ~ . != 1 | is.na(.))
    )

然后，它将保留所有不同于1的值，而不会删除NA行。否则，您可能会删除本应保留的行（取决于您所做的一切）。

比较运行时间

以乔纳斯（Jonas）为例，我尝试对所有解决方案进行基准测试。

# Using rowSums
removeRowsWithOnes <- function(df) {
    rowsToRemoveIndices <- rowSums(df == 1) > 0
    return(df[!rowsToRemoveIndices,])
}

# Using apply
removeRowsWithOnes2 <- function(df) {
    df[!apply(df, 1, function(x) any(x == 1)),]
}

# Using tidyversr
removeRowsWithOnes3 <- function(df) {df %>% 
    filter(
        across(everything(), ~ . != 1 | is.na(.))
    )}

基准测试

n <- 1e5
set.seed(5555)
bigSampleData <- do.call("cbind",lapply(LETTERS, function(nam) setNames(data.frame(sample(1:1000,n,replace = TRUE)),nam)))
microbenchmark::microbenchmark(removeRowsWithOnes(bigSampleData),removeRowsWithOnes2(bigSampleData),removeRowsWithOnes3(bigSampleData),times=10)

结果

Unit: milliseconds
                               expr       min        lq      mean    median        uq      max neval cld
  removeRowsWithOnes(bigSampleData)  35.57471  40.54827  77.64570  41.06107  60.34422 217.3363    10  b 
 removeRowsWithOnes2(bigSampleData) 217.34171 222.35136 227.90565 227.05570 229.02625 240.9274    10   c
 removeRowsWithOnes3(bigSampleData)  17.42338  22.24363  23.34607  22.88563  23.72934  32.0293    10 a

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。