对于数据框中的每一行,我想找到出现次数第二高的值以及出现次数最少的值。我怎样才能做到这一点?
df:
label v1 v2 v3 v4 v5 v6
5 3 3 3 6 6 8
5 7 1 1 1 7 0
5 3 5 6 6 6 5
我想考虑除“标签”之外的所有列
预期输出:
second largest occuring least occuring
6 8
7 0
5 3
编辑:我在答案被接受后更新了示例以减少混淆
另一个 dplyr 解决方案更具可读性,可以处理 NA 和多次出现第二大的实例的错误。此解决方案还允许您使用 dplyr 语言选择多个列。
library(dplyr)
dat = read.table(text = 'label v1 v2 v3 v4 v5 v6
5 3 3 3 2 2 1
5 2 1 1 1 2 0
5 3 5 6 6 6 5', header = T)
second_largest <- function(x,na.rm = TRUE) {
if(na.rm) { x <- na.omit(x) } # omit NA values
second_largest <- x[dense_rank(x) == 2] # return all values where the rank is equal to 2nd largest
second_largest <- max(second_largest) # keep one value out of all the second largest, or NA
return(second_largest)
}
df <- dat %>%
mutate(
second_largest = select(., v1:v6) %>% apply(1, second_largest,na.rm = TRUE), # apply second_largest func to every row
min = select(., v1:v6) %>% apply(1,min,na.rm = TRUE) # apply min to every row
)
# label v1 v2 v3 v4 v5 v6 second_largest min
# 1 5 3 3 3 2 2 1 2 1
# 2 5 2 1 1 1 2 0 1 0
# 3 5 3 5 6 6 6 5 5 3
有几点需要注意。
在 apply 语句中,1 表示该函数应该应用于行。
更新
如果您想要第二个最常用数字的值,您只需插入一个新函数即可。
second_most_frequent <- function(x, is_numeric = TRUE) {
out <- x %>%
table() %>% # Create a table of frequencies as characters
as.data.frame(stringsAsFactors = FALSE) %>%
arrange(desc(Freq)) %>% # Arrange with frequency descending
.[,1] %>% # Select the first column
.[2] # select the second most frequent (WARNING: Doesn't check for ties)
if(is_numeric){ out <- as.numeric(out) }
return(out)
}
df <- df %>%
mutate(
second_most_freq = select(., v1:v6) %>% apply(1,second_most_frequent,is_numeric = TRUE)
)
# label v1 v2 v3 v4 v5 v6 second_largest min second_most_freq
# 1 5 3 3 3 2 2 1 2 1 2
# 2 5 2 1 1 1 2 0 1 0 2
# 3 5 3 5 6 6 6 5 5 3 5
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句