R-失败的RegEx模式匹配是否源自文件转换或使用tm包？

Brigitte 发表于 Dev

布里吉特

作为R和编程方面的新手，我在这个论坛上遇到的第一个问题是关于正则表达式模式匹配，特别是换行符。首先介绍一些背景。我试图在使用NLP平台GATE进一步处理文本之前使用R对文本语料库进行一些预处理。我将原始pdf文件转换为文本，如下所示（不幸的是，这些文本文件位于同一文件夹中）：

dest <- "./MyFolderWithPDFfiles"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files (x86)/xpdfbin-win-3.04/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))

然后，加载了tm程序包，然后将文本文件物理地（！）移动到另一个文件夹，我创建了一个语料库：

TextFiles <- "./MyFolderWithTXTfiles"
EU <- Corpus(DirSource(TextFiles))

然后，我想执行一系列自定义转换以清除文本。我成功替换了一个简单的字符串，如下所示：

ReplaceText <- content_transformer(function(x, from, to) gsub(from, to, x, perl=T))
EU2 <- tm_map(EU, ReplaceText, "Table of contents", "TOC")

但是，模式是一个1-3位数字的页码，后跟两个换行符和一个分页符，这给我带来了问题。我想将其替换为空白：

EU2 <- tm_map(EU, ReplaceText, "[0-9]{1,3}\n\n\f", " ")

在([0-9]{1,3})和\f单独匹配。换行符不行。如果我将文本从原始.txt文件之一复制到RegExr联机工具中并测试表达式"[0-9]{1,3}\n\n\f"，则该文本匹配。因此，换行符确实存在于原始.txt文件中。

但是，当我将其中一个.txt文件视为已读入R中的EU语料库时，即使行在页边距之前明显中断，也似乎没有换行

[3] "PROGRESS TOWARDS ACCESSION"
[4] "1"
[5] ""
[6] "\fTable of contents"

看到这一点，我尝试了其他模式，例如检测一个或多个空格("[0-9]{1,3}\s*\f")，但没有任何模式起作用。

所以我的问题是：

我是否可以正确地将文件转换并读取为R？如果是这样，换行符怎么了？
1. 如果没有换行符是正常的，如何匹配第5行上的字符？那不是空格吗？
2. （切向关注：）转换pdf文件时，是否有代码可以将它们直接放入新文件夹中？
3. 对此表示歉意，但是如何仅打印或检查文本对象的几行呢？这些tm命令并head(EU)打印整个对象，每个对象都非常长的文本。

我知道我的问题必须看起来很简单，甚至可能很愚蠢，但是必须从某个地方开始，并且广泛的搜索还没有发现一个全面解释如何使用RegExes修改R中文本对象的资源。我很沮丧，希望有人在这里会很可惜，可以帮助我。

感谢您提供的任何建议。布里吉特

ps：我认为无法在此论坛中上传附件，因此，这里是原始PDF文档之一的链接：http : //ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf
因为该文档很长，所以我创建了TXT文档前三页的代码段，将其读入R语料库（'EU'）并将其打印到控制台，就是这样：

dput(EU[[2]])
structure(list(content = c("REGULAR REPORT", "FROM THE COMMISSION ON", 
"CZECH REPUBLIC'S", "PROGRESS TOWARDS ACCESSION ***********************", 
"1", "", "\fTable of contents", "A. Introduction", "a) Preface The Context of the Progress Report", 
"b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations", 
"B. Criteria for membership", "1. Political criteria", "1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures", 
"1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities", 
"1.3. General evaluation", "2. Economic criteria", "2.1. Introduction 2.2. Economic developments since the Commission published its Opinion", 
"Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation", 
"3. Ability to assume the obligations of Membership", "3.1. Internal Market without frontiers General framework The Four Freedoms Competition", 
"3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual", 
"3.3. Economic and Fiscal Affairs Economic and Monetary Union", 
"2", "", "\fTaxation Statistics "), meta = structure(list(author = character(0), 
    datetimestamp = structure(list(sec = 50.1142621040344, min = 33L, 
        hour = 15L, mday = 3L, mon = 10L, year = 114L, wday = 1L, 
        yday = 306L, isdst = 0L), .Names = c("sec", "min", "hour", 
    "mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt", 
    "POSIXt"), tzone = "GMT"), description = character(0), heading = character(0), 
    id = "CZ1998ProgressSnippet.txt", language = "en", origin = character(0)), .Names = c("author", 
"datetimestamp", "description", "heading", "id", "language", 
"origin"), class = "TextDocumentMeta")), .Names = c("content", 
"meta"), class = c("PlainTextDocument", "TextDocument"))

Ben

是的，在R中使用文本并不总是一种顺畅的体验！但是，您可以花些力气（也许会花费太多力气）快速完成很多工作。

如果您可以共享一个PDF文件或的输出dput(EU)，则可能有助于准确确定如何使用正则表达式捕获页码。这也将为您的问题添加一个可重复的示例，这对于此处的问题至关重要，这样人们可以测试他们的答案并确保他们可以解决您的特定问题。
无需将PDF和文本文件放在单独的文件夹中，而是可以使用如下模式：
```
EU <- Corpus(DirSource(pattern = ".txt"))
```

这只会读取文本文件，而忽略PDF文件

中没有“片段视图”方法tm，这很烦人。我经常使用公正names(EU)和EU[[1]]快速的外观

更新

对于您刚刚添加的数据，我建议采用一种稍微切线的方法。在将数据传递到tm包格式之前进行正则表达式工作，如下所示：

# get the PDF
download.file("http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf", "my_pdf.pdf", method = "wget")

# get the file name of the PDF
myfiles <- list.files(path = getwd(), pattern = "pdf", full.names = TRUE)

# convert to text (not my pdftotext is in a different location to you)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))

# read plain text int R
x1 <- readLines("my_pdf.txt")

# make into a single string
x2 <- paste(x1, collapse = " ")
# do some regex...
x3 <- gsub("Table of contents", "TOC", x2)
x4 <- gsub("[0-9]{1,3}  \f", "", x3)

# convert to corpus for text mining operations
x5 <- Corpus(VectorSource(x4))

使用您提供的数据片段，dput此方法的输出为

inspect(x5)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
REGULAR REPORT FROM THE COMMISSION ON CZECH REPUBLIC'S PROGRESS TOWARDS ACCESSION *********************** TOC A. Introduction a) Preface The Context of the Progress Report b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations B. Criteria for membership 1. Political criteria 1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures 1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities 1.3. General evaluation 2. Economic criteria 2.1. Introduction 2.2. Economic developments since the Commission published its Opinion Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation 3. Ability to assume the obligations of Membership 3.1. Internal Market without frontiers General framework The Four Freedoms Competition 3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual 3.3. Economic and Fiscal Affairs Economic and Monetary Union Taxation Statistics

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-14

我来说两句

0条评论

登录后参与评论

上一篇：我想检查输入是否为python代码

来自分类Dev

Related 相关文章

文章