I'am stuck trying to extract, from a big text (around 17000 documents), words that contain punctuation expressions. For example
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The
aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A
cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This
prospective double blind,...[95] c(c(Introduction, Silicosis is a fibrotic"
I would like to extract words like the following:
[1] c(A<sc>IMS AND</sc> M<sc>ETHODS</sc>
[2] c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>
[3] c(PATIENTS & METHODS,
[4] c(c(Introduction
but not for example words like "cross-sectional", or "2013.", or "2)", or "(inability". This is the first step, my idea is to be able to get to this:
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\n AIMS AND OBJECTIVES, The aim of this
study is to ... MATERIALS AND METHODS, A cross-sectional study with a ...
surgeries.n), \n\n PATIENTS AND METHODS, This prospective double blind,...
[95] Introduction Silicosis is a fibrotic"
As a way to extract these words and not grabbing any words that include punctuation (like "surgeries.n)"), I have seen that they always start or include "c(" expression. But had some trouble with the regex:
grep("c(", test)
Error en grep("c(", test) :
invalid regular expression 'c(', reason 'Missing ')''
also tried with:
grep("c\\(", test, value = T)
But returns the whole text file. Have also use str_match from the dap package but I don't seem to get the correct pattern (regex) code right. Have any recommendation?
If I understood your problem (I'm unsure your second text is expected output or just a step) I would go with gsub like this:
gsub("(c\\(|<\\/?sc>)","",text)
The regex (first parameter) will match c(
or <sc>
or </sc>
and replace them with nothing, thus cleaning the text as you expect (again, if I understood correctly your expectation).
more on the regex involved:
(|)
is the structure to OR conditionc\\(
will match literally c(
anywhere in the text<\\/?sc>
will match <sc>
or </sc>
as the ?
after the /
mean it can be there 0 or 1 time, so it's optionnal.\\
are there so after R interpreter has removed the first backslash there's still a backslash to tell the regex interpreter we want to match a litteral (
and a litteral /
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments