基于列表对象的子集数据框

debugcn 发表于 Dev

克里斯·鲁勒曼

我在第一个列中有一个包含语音数据的数据框Turn：

test <- data.frame(
  Turn = c("Hi. I'm you an' you are me cos",
          "she'd've been so happy cos with all this stuff goin' on",
          "but we're in this together, because y' know things happens",
          "so you can't, cos well, ah because you know why!",
          "not now because it's too late!"), stringsAsFactors = F)

我想在cos和/或之前至少有四个单词的because那些行上对数据帧进行子集化。为此我计算的指数cos，并because在他们的Turn：

test$Index <- sapply(strsplit(test$Turn, " "), function(x) which(x == 'cos'|x == 'because'))
test
                                                        Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6
4           so you can't, cos well, ah because you know why!  4, 7
5                             not now because it's too late!     3

一行中有多个索引。这就是为什么我这样子设置的尝试失败了的原因：

test[test$Index >= 5,]
Error in `[.data.frame`(test, test$Index >= 5, ) : 
  (list) object cannot be coerced to type 'double'

如何test通过忽略第二个列出的Index值来进行子集化？

预期结果：

test
                                                        Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6

对于任何答案，我都将不胜感激，其中包括不使用通过索引绕行而使用regex子集程序模式的答案。

编辑：

sapply通过仅选择列出对象的第一个值，该范式中的解决方案实际上非常简单：

sapply(test$Index, function(x) x[1])
[1] 4 5 6 4 3

维克多·史翠比维

我希望这会给您一个想法：

test <- data.frame(
  Turn = c("Hi. I'm you an' you are me cos",
          "she'd've been so happy cos with all this stuff goin' on",
          "but we're in this together, because y' know things happens",
          "so you can't, cos well, ah because you know why!",
          "not now because it's too late!"), stringsAsFactors = F)
rx <- "^\\s*(?:\\S+\\s+){0,3}(?:cos|because)\\b.*(*SKIP)(*F)|(?:\\S+[\\s,]+){4}\\b(cos|because)\\b"
Turn <- test[grepl(rx, test$Turn, perl=TRUE),]
split <- strsplit(Turn, "\\b(cos|because)\\b")
Index <- sapply(split, function(x) lengths(strsplit(trimws(x[[1]]), "\\s+"))+1)
test <- data.frame(Turn, Index, stringsAsFactors = F)
test

输出：

                                                       Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6

请参阅R演示和主正则表达式演示。

正则表达式详细信息：

^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F)-比赛stirng的开始，那么零三个字，然后cos或because作为整个单词和字符串的其余部分，然后跳过比赛
| - 要么
(?:\S+[\s,]+){4}\b(cos|because)\b-匹配cos或because以四个词开头。