如何从数字字符中选择最大数值？

debugcn 发表于 Dev

DN1

我有一个按Gene列分组的数据集。分组到每一行的一些值只是.,为了删除它们，所以每行和每列只保留几个数字字符。

为此编码：

#Group by Gene:
data <- setDT(df2)[, lapply(.SD, paste, collapse = ", "), by = Genes]

#Remove ., from anywhere in the dataframe
dat <- data.frame(lapply(data, function(x) {
  gsub("\\.,|\\.$|\\,$|(, .$)", "", x)
}))

删除之前.,和分组之后的数据Gene如下：

Gene    col1                     col2                  col3           col4
ACE     0.3, 0.4, 0.5, 0.5       .                      ., ., .        1, 1, 1, 1, 1
NOS2    ., .                     .                      ., ., ., .     0, 0, 0, 0, 0
BRCA1   .                                               ., .           1, 1, 1, 1, 1
HER2    .                        0.1, ., .,  0.2, 0.1   .              1, 1, 1, 1, 1

删除.,我的数据后看起来像：

Gene    col1                 col2               col3     col4
ACE     0.3, 0.4, 0.5, 0.5                               1, 1, 1, 1, 1
NOS2                                                     0, 0, 0, 0, 0
BRCA1                                                    1, 1, 1, 1, 1
HER2                         0.1,      0.2, 0.1          1, 1, 1, 1, 1

我现在正在尝试选择每行和每列的最小值或最大值。

预期示例输出：

Gene    col1                 col2            col3    col4
ACE     0.5                                           1
NOS2                                                  0
BRCA1                                                 1
HER2                          0.1                     1

#For col1 I need the max value per row (so for ACE 0.5 is selected)
#For col2 I need the min value per row

需要注意的是，我的实际数据是100列和20,000行-不同的列需要每个选定基因的最大值或最小值。

然而，随着我使用的代码，我只得到了预期的输出col4和我的其他列重复选择的值的两倍（我让我0.5, 0.5和0.1, 0.1我想不通为什么）。

我用来选择最小值/最大值的代码是：

#Max value per feature and row
max2 = function(x) if(all(is.na(x))) NA else max(x,na.rm = T)
getmax = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)max2(as.numeric(x)) ) %>%
  unlist() 

#Min value per feature and row
min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)min2(as.numeric(x)) ) %>%
  unlist() 

data <- dt %>%
  mutate_at(names(dt)[2],getmax)

data <- dt %>%
  mutate_at(names(dt)[3],getmin)

data <- dt %>%
  mutate_at(names(dt)[4],getmax)

这些选择功能为什么不适用于我的所有列？所有列都是字符类。我也想知道我是否甚至需要删除.,，是否可以直接跳到选择每行和每列的最大/最小值？

输入数据示例：

structure(list(Gene = c("ACE", "NOS2", "BRCA1", "HER2"), col1 = c("0.3, 0.4, 0.5, 0.5", 
"", "", ""), col2 = c("", "", "", "  0.1,      0.2 0.,1"), col3 = c(NA, 
NA, NA, NA), col4 = c("                         1, 1, 1, 1, 1", 
"                                     0, 0, 0, 0, 0", "                                     1, 1, 1, 1, 1", 
"     1, 1, 1, 1, 1")), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

ekoam

您可以使用type.convert并将其参数设置na.strings为"."。您可能还想使用该range功能一次拍摄最小和最大。

假设你data.table看起来像这样

> dt
    Gene               col1                 col2       col3          col4
1:   ACE 0.3, 0.4, 0.5, 0.5                    .    ., ., . 1, 1, 1, 1, 1
2:  NOS2               ., .                    . ., ., ., . 0, 0, 0, 0, 0
3: BRCA1                  .                            ., . 1, 1, 1, 1, 1
4:  HER2                  . 0.1, ., .,  0.2, 0.1          . 1, 1, 1, 1, 1

考虑这样的功能

library(data.table)
library(stringr)

get_range <- function(x) {
  x <- type.convert(str_split(x, ",\\s+", simplify = TRUE), na.strings = ".")
  x <- t(apply(x, 1L, function(i) {
    i <- i[!is.na(i)]
    if (length(i) < 1L) c(NA_real_, NA_real_) else range(i)
  }))
  dimnames(x)[[2L]] <- c("min", "max")
  x
}

那你就可以

dt[, c(Gene = .(Gene), lapply(.SD, get_range)), .SDcols = -"Gene"]

输出量

    Gene col1.min col1.max col2.min col2.max col3.min col3.max col4.min col4.max
1:   ACE      0.3      0.5       NA       NA       NA       NA        1        1
2:  NOS2       NA       NA       NA       NA       NA       NA        0        0
3: BRCA1       NA       NA       NA       NA       NA       NA        1        1
4:  HER2       NA       NA      0.1      0.2       NA       NA        1        1

请注意，Gene由于该函数get_range已经矢量化，因此无需这样做。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。