Lookaround regular expression pattern in R

damico

I am stuck on creating the right regular expression pattern that will split the content of my data frame columns without making me loose any of the elements. I have to use the separate() function from the tidyr package as this is part of a longer processing pipeline. Since I don't want to loose any of the elements in the string, I am developing a lookahead/lookbehind expression.

The strings that need to be split can follow one of the following patterns:

  • only letters (e.g. 'abcd')
  • letters-dash-numbers (e.g. 'abcd-123')
  • letters-numbers (e.g. 'abcd1234')
    The column content should be split into 3 columns max, one column per group.

I would like to split every time the element changes, so after the letters and after the dash. There can be one or more letters, one or more numbers, but only ever one dash. Strings that only contain letters, don't need to be split.

Here is what I have tried:

library(tidyr) 
myDat = data.frame(drugName = c("ab-1234", 'ab-1234', 'ab-1234',
                                'placebo', 'anotherdrug', 'andanother',
                                'xyz123', 'xyz123', 'placebo', 'another',
                                'omega-3', 'omega-3', 'another', 'placebo'))
drugColNames = paste0("X", 1:3) 

# This pattern doesn't split strings that only consist of number and letters, e.g. "xyz123" is not split after the letters.
pat = '(?=-[0-9+])|(?<=[a-z+]-)'

# This pattern splits at all the right places, but the last group (the numbers), is separated and not kept together.
# pat = '(?=-[0-9+]|[0-9+])|(?<=[a-z+]-)'

splitDat = separate(myDat, drugName,
         into = drugColNames,
         sep = pat)

The output from the splitting should be:

"ab-1234" --> "ab" "-" "123"
"xyz123" --> "xyz" "123"
"omega-3" --> "omega" "-" "3"

Thanks a lot for helping out in this. :)

Ronak Shah

It would be easier to use extract here since we don't have a fixed separator which will also avoid using regex lookarounds.

tidyr::extract(myDat, drugName, drugColNames, '([a-z]+)(-)?(\\d+)?', remove = FALSE)

#      drugName          X1 X2   X3
#1      ab-1234          ab  - 1234
#2      ab-1234          ab  - 1234
#3      ab-1234          ab  - 1234
#4      placebo     placebo        
#5  anotherdrug anotherdrug        
#6   andanother  andanother        
#7       xyz123         xyz     123
#8       xyz123         xyz     123
#9      placebo     placebo        
#10     another     another        
#11     omega-3       omega  -    3
#12     omega-3       omega  -    3
#13     another     another        
#14     placebo     placebo        

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

R: regular expression lookaround(s) to grab whats between two patterns

From Dev

Regular Expression with pattern combination in R

From Dev

Regular expression in R: gsub pattern

From Dev

R: lookaround within lookaround

From Dev

Regular Expression for the Pattern?

From Dev

Regular Expression pattern issue

From Dev

Perl Regular Expression Pattern

From Dev

regular expression repeating pattern

From Dev

Javascript regular expression pattern

From Dev

String pattern, regular expression

From Dev

Regular expression to validate the pattern

From Dev

sawmill Regular expression pattern

From Dev

Custom regular expression pattern

From Dev

regular expression and pattern matching

From Dev

Regular expression not matching pattern

From Dev

Regular Expression to match pattern

From Dev

How to match regular expression exactly in R and pull out pattern

From Dev

R : regular expression to match pattern in only the first line

From Dev

R qdap::mgsub, how to pass a pattern with a regular expression?

From Dev

Regular expression - street address pattern

From Dev

Regular Expression, Get Sub Pattern

From Dev

Regular Expression - Reverse Pattern Searching

From Dev

Regular expression to match specific pattern

From Dev

regular expression that matches the below pattern

From Dev

Regular expression - starting and not ending with a pattern

From Dev

regular expression in java matcher pattern

From Dev

Regular Expression - Match String Pattern

From Dev

Regular Expression - start and end pattern

From Dev

Regular expression multiline stop at a pattern