R：从5个元素的组合的数据帧中提取内部较高级别的组合（1、2、3和4个元素的组）

debugcn 发表于 Dev

丹妮

对不起，我必须张贴在跟进另一个问题这一个和这另外一个。

尽管第二个问题的答案完美地解决了MWE问题，但在我的现实世界数据中，我需要做些不同的事情，并想知道是否有人可以提供帮助。

因此，这一次，我的出发点是一个plusminus_df由5个元素组成的数据框（命名为）（实际上可以是1到n），其格式如下：

> markers=LETTERS[1:5]
> plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
> names(plusminus_df)=LETTERS[1:5]
> head(plusminus_df)
  A B C D E
1 + + + + +
2 - + + + +
3 + - + + +
4 - - + + +
5 + + - + +
6 - + - + +

因此，这只是所有5个+/-组合的数据帧markers（请注意，这是一个可变数字）。在这一点上，我需要提取1、2、3和4的内部高级组合markers（请注意，这些是可变数字），并保留相同的数据帧结构（在这种情况下，我需要包括NAs）。

所以我的预期结果将是这样的：

> final_df
      A    B    C    D    E
1     + <NA> <NA> <NA> <NA>
2     - <NA> <NA> <NA> <NA>
3     +    - <NA> <NA> <NA>
4     -    - <NA> <NA> <NA>
5     +    + <NA> <NA> <NA>
6     -    + <NA> <NA> <NA>
7     +    -    - <NA> <NA>
8     -    -    - <NA> <NA>
9     +    +    + <NA> <NA>
10    -    +    + <NA> <NA>
11    +    -    + <NA> <NA>
12    -    -    + <NA> <NA>
13    +    +    - <NA> <NA>
14    -    +    - <NA> <NA>
15    +    -    -    - <NA>
16    -    -    -    - <NA>
17    +    +    +    + <NA>
...
n     +    +    +    +    +
n+1   -    +    +    +    +
n+2   +    -    +    +    +
n+3   -    -    +    +    +
n+4   +    +    -    +    +
n+5   -    +    -    +    +
...

使用1 marker（+和-），2、3、4markers和5的所有可能组合（与原始组合一样），用填充未使用markers的NA。

因此，第二个问题的答案很好地从零开始，仅从原始markers向量构建了所需的最终数据帧。但是在我的现实世界中，我实际上可以以上述形式检索5个标记组合的过滤列表plusminus_df...什么是从该标记中获得所需数据帧的最直接，最有效的方法，而无需处理凌乱的嵌套循环？

奥立佛

我不确定我已经了解了您要查找的内容，但是从第二个问题来看，您似乎正在寻找内所有列的交叉组合data.frame。

免责声明：已经提供的两个答案更具可读性，在这里我着重于速度。

当您执行通常被称为cross-join（或外部完全联接）的一个方面时，n随着效率的提高，这个方面很快就会成为一个问题。为了提高效率，它有助于将问题分解为较小的子问题，并为每个问题找到解决方案。因为我们需要在包括空集（value = NA）的列集中找到所有唯一组合，所以我们可以将这个问题简化为2个子问题。

查找每列（包括空集）的唯一值
扩展此集合以包括每列的所有组合。

使用这个想法，我们可以使用快速编造一个简单的解决方案expand.grid，unique以及lapply。唯一棘手的部分是包括null集，但是我们可以通过NA从data.frame包括所有行的行中选择行来做到这一点。

# Create null-set-included data.frame
nullset_df <- plusminus_df[c(NA, seq_len(nrow(plusminus_df))), ]
# Find all unique elements, including null set
unique_df <- lapply(nullset_df, unique)
# Combine all unique sets
expand.grid(unique_df)

或作为功能

nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)

这是相当快的（稍后会比较基准和性能图），但是我想再走一步。该data.table软件包以其高性能功能而闻名，该功能是其中的一个CJ功能，可以执行交叉联接。以下是使用的一种实现CJ

library(data.table)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))

该函数需要向量输入，迫使人们使用do.call（或类似方法），这会使函数的可读性稍差。但是，性能会有所提高吗？为了对此进行测试，我microbenchmark在两个函数上运行了a ，并在现有答案中提供了这些函数（下面的代码），结果在下面的小提琴图中可视化：

From the plot it is quite clear that @pauls answer outperforms @ekoam's answer, but the two functions above both outperform the provided answers in terms of speed. But the question said that the input might have any number of dimension, so there is also the question of how well our function scales with the number of columns and the number of unique values (here we only have "+" and "-" but what if we had more?). For this I reran the benchmark for n_columns = 3, 4, ..., 10 and n_values = 2, 4, ... 10. The 2 results are visualized with smooth curves below.
First we'll visualize the time as a function of number of columns. Note that the y axis is on logarithmic scale (base 10) for easier comparison.

From the visualization it is quite clear that, with increasing number of columns, the choice of method becomes very important. The suggestion by @ekoam becomes very slow, primarily because it delays a call to unique till the very end. The remaining 3 methods are all much faster, while nullgrid.expand.dt becomes more than 10 times faster in comparison to the remaining methods once we get more than 8 columns of data.

Next lets look at the timing compared to the number of values in each column (n-columns fixed at 5)

Again we see a similar picture. Except for a single possible outlier for nullgrid.expand, which seems to become slower than the answer by paul as the number of unique values increase, we see that nullgrid.expand.dt remains faster, although here it seems to only be saving (exp(4) - exp(3.6)) / exp(3.6) ~ 50 % (or twice as fast) compared to paul's answer by the time we reach 10 unique values.

Please note that I did not have enough RAM to run the benchmark for number of unique values or columns greater than the ones shown.

Conclusion

我们已经看到有很多方法可以解决问题，但是随着列数和唯一值的增加，方法的选择变得越来越重要。通过使用优化的库，我们可以以最小的工作量大大减少获得所有列值的交叉联接所需的时间。通过长时间的使用，Rcpp我们可能会进一步降低时间复杂度，但这不在我的回答范围内。

基准代码

# Setup:
set.seed(1234)
library(tidyverse)
library(data.table)
nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))
markers=LETTERS[1:5]
plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
names(plusminus_df)=LETTERS[1:5]

bm <- microbenchmark(
  nullgrid.expand = nullgrid.expand(plusminus_df),
  nullgrid.expand.dt = nullgrid.expand.dt(plusminus_df),
  ekoam = unique(bind_rows(apply(
    plusminus_df, 1L, 
    function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
  ))),
  paul = {
    plusminus_df %>%
      add_row() %>%
      map(unique) %>%
      expand.grid()
  }, 
  control = list(warmup = 5)
)
library(ggplot2)
autoplot(bm) + ggtitle('comparison between cross-join')

定时功能

time_function <- function(n = 5, j = 2){
  idx <- seq_len(n)
  df <- do.call(CJ, args = c(lapply(idx, function(x) as.character(seq_len(j))),
                             sorted = FALSE,
                             unique = TRUE))
  names(df) <- as.character(idx)
  microbenchmark(
    nullgrid.expand = nullgrid.expand(df),
    nullgrid.expand.dt = nullgrid.expand.dt(df),
    ekoam = unique(bind_rows(apply(
      df, 1L, 
      function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
    ))),
    paul = {
      df %>%
        add_row() %>%
        map(unique) %>%
        expand.grid()
    }, 
    times = 10,
    control = list(warmup = 5)
  )
}
res <- lapply(seq(3, 10), time_function)
for(i in seq_along(res)){
  res[[i]]$n <- seq(3, 10)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of columns') + 
  labs(y = 'log(ms)', x = 'n')
ggsave('so_2.png')

res <- lapply(c(seq(2, 10, 2)), time_function, n = 5)
for(i in seq_along(res)){
  res[[i]]$n <- seq(2, 10, 2)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of unique values') + 
  labs(y = 'log(ms)', x = 'n unique values per column')
ggsave('so_3.png')

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-5

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章