我在第一个列中有一个包含语音数据的数据框Turn
:
test <- data.frame(
Turn = c("Hi. I'm you an' you are me cos",
"she'd've been so happy cos with all this stuff goin' on",
"but we're in this together, because y' know things happens",
"so you can't, cos well, ah because you know why!",
"not now because it's too late!"), stringsAsFactors = F)
我想在cos
和/或之前至少有四个单词的because
那些行上对数据帧进行子集化。为此我计算的指数cos
,并because
在他们的Turn
:
test$Index <- sapply(strsplit(test$Turn, " "), function(x) which(x == 'cos'|x == 'because'))
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
4 so you can't, cos well, ah because you know why! 4, 7
5 not now because it's too late! 3
一行中有多个索引。这就是为什么我这样子设置的尝试失败了的原因:
test[test$Index >= 5,]
Error in `[.data.frame`(test, test$Index >= 5, ) :
(list) object cannot be coerced to type 'double'
如何test
通过忽略第二个列出的Index
值来进行子集化?
预期结果:
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
对于任何答案,我都将不胜感激,其中包括不使用通过索引绕行而使用regex
子集程序模式的答案。
编辑:
sapply
通过仅选择列出对象的第一个值,该范式中的解决方案实际上非常简单:
sapply(test$Index, function(x) x[1])
[1] 4 5 6 4 3
我希望这会给您一个想法:
test <- data.frame(
Turn = c("Hi. I'm you an' you are me cos",
"she'd've been so happy cos with all this stuff goin' on",
"but we're in this together, because y' know things happens",
"so you can't, cos well, ah because you know why!",
"not now because it's too late!"), stringsAsFactors = F)
rx <- "^\\s*(?:\\S+\\s+){0,3}(?:cos|because)\\b.*(*SKIP)(*F)|(?:\\S+[\\s,]+){4}\\b(cos|because)\\b"
Turn <- test[grepl(rx, test$Turn, perl=TRUE),]
split <- strsplit(Turn, "\\b(cos|because)\\b")
Index <- sapply(split, function(x) lengths(strsplit(trimws(x[[1]]), "\\s+"))+1)
test <- data.frame(Turn, Index, stringsAsFactors = F)
test
输出:
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
正则表达式详细信息:
^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F)
-比赛stirng的开始,那么零三个字,然后cos
或because
作为整个单词和字符串的其余部分,然后跳过比赛|
- 要么(?:\S+[\s,]+){4}\b(cos|because)\b
-匹配cos
或because
以四个词开头。本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句