Finding and extracting words that include a punctuation expressions in R

debugcn Published at Dev

Fabian Reyes

I'am stuck trying to extract, from a big text (around 17000 documents), words that contain punctuation expressions. For example

"...urine bag tubing and the vent jutting above the summit also strapped with the
 white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The 
 aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A 
 cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This 
 prospective double blind,...[95] c(c(Introduction, Silicosis is a fibrotic"

I would like to extract words like the following:

 [1] c(A<sc>IMS AND</sc> M<sc>ETHODS</sc>
 [2] c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>
 [3] c(PATIENTS & METHODS,
 [4] c(c(Introduction

but not for example words like "cross-sectional", or "2013.", or "2)", or "(inability". This is the first step, my idea is to be able to get to this:

"...urine bag tubing and the vent jutting above the summit also strapped with the
 white plaster tapeFigure 2), \n\n AIMS AND OBJECTIVES, The aim of this 
 study is to ... MATERIALS AND METHODS, A cross-sectional study with a ...
 surgeries.n), \n\n PATIENTS AND METHODS, This prospective double blind,...
 [95] Introduction Silicosis is a fibrotic"

As a way to extract these words and not grabbing any words that include punctuation (like "surgeries.n)"), I have seen that they always start or include "c(" expression. But had some trouble with the regex:

grep("c(", test)
    Error en grep("c(", test) : 
    invalid regular expression 'c(', reason 'Missing ')''

also tried with:

grep("c\\(", test, value = T)

But returns the whole text file. Have also use str_match from the dap package but I don't seem to get the correct pattern (regex) code right. Have any recommendation?

Tensibai

If I understood your problem (I'm unsure your second text is expected output or just a step) I would go with gsub like this:

gsub("(c\\(|<\\/?sc>)","",text)

The regex (first parameter) will match c( or <sc> or </sc> and replace them with nothing, thus cleaning the text as you expect (again, if I understood correctly your expectation).

Comments

0 comments

From Dev

Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words

From Dev

Related Related

Article