我有一个包含许多列的数据。例如这是三列
df<-structure(list(V1 = structure(c(5L, 1L, 7L, 3L, 2L, 4L, 6L, 6L
), .Label = c("CPSIAAAIAAVNALHGR", "DLNYCFSGMSDHR", "FPEHELIVDPQR",
"IADPDAVKPDDWDEDAPSK", "LWADHGVQACFGR", "WGEAGAEYVVESTGVFTTMEK",
"YYVTIIDAPGHR"), class = "factor"), V2 = structure(c(5L, 2L,
7L, 3L, 4L, 6L, 1L, 1L), .Label = c("", "CPSIAAAIAAVNALHGR",
"GCITIIGGGDTATCCAK", "HVGPGVLSMANAGPNTNGSQFFICTIK", "LLELGPKPEVAQQTR",
"MVCCSAWSEDHPICNLFTCGFDR", "YYVTIIDAPGHR"), class = "factor"),
V3 = structure(c(4L, 3L, 2L, 4L, 3L, 1L, 1L, 1L), .Label = c("",
"AVCMLSNTTAIAEAWAR", "DLNYCFSGMSDHR", "FPEHELIVDPQR"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
- 第一列,我们不看任何其他列,我们只计算有多少个字符串并保留唯一的
第二列,我们保留唯一的,并且我们删除已经在第一列中的那些
第三列,我们保持唯一性,并删除第一列和第二列中的字符串
这将继续与我们所拥有的列一样多
例如对于这个数据,我们将有以下内容
Column 1 Column 2 Column 3
LWADHGVQACFGR
CPSIAAAIAAVNALHGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
YYVTIIDAPGHR GCITIIGGGDTATCCAK
FPEHELIVDPQR HVGPGVLSMANAGPNTNGSQFFICTIK
DLNYCFSGMSDHR MVCCSAWSEDHPICNLFTCGFDR
IADPDAVKPDDWDEDAPSK
WGEAGAEYVVESTGVFTTMEK
这是通过的解决方案tidyverse
,
library(tidyverse)
df1 <- df %>%
gather(var, string) %>%
filter(string != '' & !duplicated(string)) %>%
group_by(var) %>%
mutate(cnt = seq(n())) %>%
spread(var, string) %>%
select(-cnt)
这使
# A tibble: 7 x 4 cnt V1 V2 V3 * <int> <chr> <chr> <chr> 1 1 LWADHGVQACFGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR 2 2 CPSIAAAIAAVNALHGR GCITIIGGGDTATCCAK <NA> 3 3 YYVTIIDAPGHR HVGPGVLSMANAGPNTNGSQFFICTIK <NA> 4 4 FPEHELIVDPQR MVCCSAWSEDHPICNLFTCGFDR <NA> 5 5 DLNYCFSGMSDHR <NA> <NA> 6 6 IADPDAVKPDDWDEDAPSK <NA> <NA> 7 7 WGEAGAEYVVESTGVFTTMEK <NA> <NA>
您可以使用colSums
来获取字符串的数量,
colSums(!is.na(df1))
#V1 V2 V3
# 7 4 1
通过基本 R 的类似方法,将字符串保存在列表中是,
df[] <- lapply(df, as.character)
d1 <- stack(df)
d1 <- d1[d1$values != '' & !duplicated(d1$values),]
l1 <- unstack(d1, values ~ ind)
lengths(l1)
#V1 V2 V3
# 7 4 1
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句