Finding and extracting words that include a punctuation expressions in R

Fabian Reyes

I'am stuck trying to extract, from a big text (around 17000 documents), words that contain punctuation expressions. For example

"...urine bag tubing and the vent jutting above the summit also strapped with the
 white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The 
 aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A 
 cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This 
 prospective double blind,...[95] c(c(Introduction, Silicosis is a fibrotic"

I would like to extract words like the following:

 [1] c(A<sc>IMS AND</sc> M<sc>ETHODS</sc>
 [2] c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>
 [3] c(PATIENTS & METHODS,
 [4] c(c(Introduction

but not for example words like "cross-sectional", or "2013.", or "2)", or "(inability". This is the first step, my idea is to be able to get to this:

"...urine bag tubing and the vent jutting above the summit also strapped with the
 white plaster tapeFigure 2), \n\n AIMS AND OBJECTIVES, The aim of this 
 study is to ... MATERIALS AND METHODS, A cross-sectional study with a ...
 surgeries.n), \n\n PATIENTS AND METHODS, This prospective double blind,...
 [95] Introduction Silicosis is a fibrotic"

As a way to extract these words and not grabbing any words that include punctuation (like "surgeries.n)"), I have seen that they always start or include "c(" expression. But had some trouble with the regex:

grep("c(", test)
    Error en grep("c(", test) : 
    invalid regular expression 'c(', reason 'Missing ')''

also tried with:

grep("c\\(", test, value = T)

But returns the whole text file. Have also use str_match from the dap package but I don't seem to get the correct pattern (regex) code right. Have any recommendation?

Tensibai

If I understood your problem (I'm unsure your second text is expected output or just a step) I would go with gsub like this:

gsub("(c\\(|<\\/?sc>)","",text)

The regex (first parameter) will match c( or <sc> or </sc> and replace them with nothing, thus cleaning the text as you expect (again, if I understood correctly your expectation).

more on the regex involved:

  • (|) is the structure to OR condition
  • c\\( will match literally c( anywhere in the text
  • <\\/?sc> will match <sc> or </sc> as the ? after the / mean it can be there 0 or 1 time, so it's optionnal.
  • The double \\ are there so after R interpreter has removed the first backslash there's still a backslash to tell the regex interpreter we want to match a litteral ( and a litteral /

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Vim autocomplete to include punctuation between words

From Dev

Vim autocomplete to include punctuation between words

From Dev

Finding strings when targets are separated by punctuation [R]

From Dev

Finding a group of words using Regular Expressions

From Dev

Finding Combinations of Words in Articles in R

From Dev

Extracting words with non-ASCII characters by python regular expressions

From Dev

REGEX in R: extracting words from a string

From Dev

Extracting everything after first two words in R

From Dev

regex for tokenizing words and punctuation

From Dev

Counting no of words including punctuation

From Dev

r mask for grep for finding the repeated words

From Dev

Python extract whitespace-separated words that may include specific punctuation symbols

From Dev

Extracting data from a html file (R and regular expressions)

From Dev

R:Extracting words from one column into different columns

From Dev

Finding a pattern and extracting strings

From Dev

Splitting a string into words and punctuation with java

From Dev

Counting uppercase words and punctuation in a cell

From Dev

Splitting a string into words and punctuation with Ruby

From Dev

Removing punctuation between two words

From Dev

Punctuation not detected between words with no space

From Dev

Extracting divs with regular expressions

From Dev

Using regular expressions to search a word list. Finding 2 letter words instead of 3. Why?

From Dev

Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words

From Dev

Extracting words from a pattern

From Dev

Extracting words from array

From Dev

Extracting words from a pattern

From Dev

Finding words in between other words

From Dev

Finding all punctuation in a text file & print count

From Dev

Finding all punctuation in a text file & print count

Related Related

  1. 1

    Vim autocomplete to include punctuation between words

  2. 2

    Vim autocomplete to include punctuation between words

  3. 3

    Finding strings when targets are separated by punctuation [R]

  4. 4

    Finding a group of words using Regular Expressions

  5. 5

    Finding Combinations of Words in Articles in R

  6. 6

    Extracting words with non-ASCII characters by python regular expressions

  7. 7

    REGEX in R: extracting words from a string

  8. 8

    Extracting everything after first two words in R

  9. 9

    regex for tokenizing words and punctuation

  10. 10

    Counting no of words including punctuation

  11. 11

    r mask for grep for finding the repeated words

  12. 12

    Python extract whitespace-separated words that may include specific punctuation symbols

  13. 13

    Extracting data from a html file (R and regular expressions)

  14. 14

    R:Extracting words from one column into different columns

  15. 15

    Finding a pattern and extracting strings

  16. 16

    Splitting a string into words and punctuation with java

  17. 17

    Counting uppercase words and punctuation in a cell

  18. 18

    Splitting a string into words and punctuation with Ruby

  19. 19

    Removing punctuation between two words

  20. 20

    Punctuation not detected between words with no space

  21. 21

    Extracting divs with regular expressions

  22. 22

    Using regular expressions to search a word list. Finding 2 letter words instead of 3. Why?

  23. 23

    Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words

  24. 24

    Extracting words from a pattern

  25. 25

    Extracting words from array

  26. 26

    Extracting words from a pattern

  27. 27

    Finding words in between other words

  28. 28

    Finding all punctuation in a text file & print count

  29. 29

    Finding all punctuation in a text file & print count

HotTag

Archive