通过管道中的位置提取子字符串

debugcn 发表于 Dev

至少

我想从小id标题的每一行中提取子字符串。我总是对原始空间的第1和第3空间之间的区域感兴趣id。结果的子字符串so Zoe BostonandJane Rome将进入新列- name。

我试着在每个id中获取“空格”的位置，str_locate_all然后使用position来使用str_sub。但是，我无法正确提取位置。

data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>% 
   mutate(coor =  str_locate_all(id, "\\s"),
   name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )

罗纳克·沙

您可以使用正则表达式提取所需的内容。

假设您已将tibble存储在中data，则可以sub用来提取第一个和第二个单词。

sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"

^# -从哈希开始

\\w+ - 一个字

\\s -空白

( -开始捕获组

\\w+ - 一个字

其次是\\s-空格

\\w+ - 另一个词

) -捕获组结束。

.* -剩余的字符串。

在str_locate比较复杂的，因为它首先返回空白的位置，那么你需要选择第一个空白的结束和3日开始，然后利用str_sub这些位置之间提取文本。

library(dplyr)
library(stringr)
library(purrr)

data %>%
   mutate(coor =  str_locate_all(id, "\\s"), 
          start = map_dbl(coor, `[`, 1) + 1, 
          end = map_dbl(coor, `[`, 3) - 1,
          name = str_sub(id, start, end))

# A tibble: 2 x 2
#  id                                                          name      
#  <chr>                                                       <chr>     
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0)             Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。