我有一个非常大的数据,我想做的是检查列表是否具有一个字符串多个值:请考虑以下数据帧:
df1 <- structure(list(id = 1:3,
book_id = c("[\"19167120\",\"book\", \"237494310\",\"195166798\",\"book\",\"book.a\"]",
"[\"19167120\",\"237494310\",\"story\",\"book\",\"19167120\"]", "[]")),
.Names = c("id", "book_id"),
class = "data.frame",
row.names = c(NA, -3L))
这是:
id book_id
1 1 ["19167120","book", "237494310","195166798","book","book.a"]
2 2 ["19167120","237494310","story","book","19167120"]
3 3 []
我想做的是检查book_id是否有任何列表值具有多个字符串值,例如,在这里:字符串值"19167120"
对于第二行重复。并为每个单元格提取它们,然后从每个单元格中将其删除
输出:两个单独的数据框:
id book_id duplicate
1: 1 ["19167120", "book", "237494310", "195166798", "book", "book.a"] "book"
2: 2 ["19167120", "237494310", "story", "book", "19167120"] "19167120"
3: 3 [] 0
id book_id
1: 1 ["19167120", "book", "237494310", "195166798", "book.a"]
2: 2 ["19167120", "237494310", "story", "book"]
3: 3 []
我知道我应该使用anyduplicated()
和unique()
为了得到我的答案,但是我在他们周围工作,无法解决问题。
编辑:Gregor的第一个建议将是这样,但是如果有人能像我先解释的那样帮助我获得输出,我将不胜感激,
id book_id
1: 1 "19167120"
2: 1 "237494310"
3: 1 "195166798"
4: 2 "19167120"
5: 2 "237494310"
6: 2 "19167120"
> unique(df1)
id book_id
1: 1 "19167120"
2: 1 "237494310"
3: 1 "195166798"
4: 2 "19167120"
5: 2 "237494310"
> duplicated(df1)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
这是一种替代方法,与从“长”数据集开始并从那里开始的想法有关。
这是您的长数据集。
library(splitstackshape)
x <- cSplit(df1, "book_id", ",", "long")[, book_id := gsub(
"[][]", "", book_id)]
在这里,我们添加带有重复值的“重复”列:
x[, duped := paste(unique(book_id[duplicated(book_id)],
collapse = ", ")), by = id]
现在,我们可以轻松创建您的第一个所需的输出:
dupedX <- x[, list(book_id = sprintf("[%s]", paste(book_id, collapse = ", ")),
duped = paste(unique(duped), collapse = ", ")), by = id]
dupedX
# id book_id duped
# 1: 1 ["19167120", "237494310", "195166798"] NA
# 2: 2 ["19167120", "237494310", "19167120"] "19167120"
# 3: 3 [] NA
还有您的第二个:
uniqueX <- x[, list(book_id = sprintf(
"[%s]", paste(unique(book_id), collapse = ", "))), by = id]
uniqueX
# id book_id
# 1: 1 ["19167120", "237494310", "195166798"]
# 2: 2 ["19167120", "237494310"]
# 3: 3 []
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句