R tm package and cyrillic text

Dmitriy Selivanov

I am trying to do some text mining with russian text using tm package and have some issues.

preprocessing speed heavily depends on encoding.

library(tm)
rus_txt<-paste(readLines('http://lib.ru/LITRA/PUSHKIN/dubrowskij.txt',encoding='cp1251'), collapse=' ')
object.size(rus_txt)
eng_txt<-paste(readLines('http://www.gutenberg.org/cache/epub/1112/pg1112.txt',encoding='UTF-8'), collapse=' ')
object.size(eng_txt)
# text sizes nearly identical
rus_txt_utf8<-iconv(rus_txt, to='UTF-8')
system.time(rus_txt_lower<-tolower(rus_txt_utf8))
#3.17         0.00         3.19 
system.time(rus_txt_lower<-tolower(eng_txt))
#0.03         0.00         0.03
system.time(rus_txt_lower<-tolower(rus_txt))
#0.07         0.00         0.08

40 times faster! and on large corporas difference was up to 500 times!

Lets try to tokenize some text (this function used in TermDocumentMatrix):

some_text<-"Несколько  лет  тому  назад  в  одном  из своих  поместий жил старинный
русской барин, Кирила Петрович Троекуров. Его богатство, знатный род и связи
давали ему большой вес в губерниях, где  находилось его имение.  Соседи рады
были угождать малейшим его прихотям; губернские чиновники трепетали  при его
имени;  Кирила  Петрович принимал знаки  подобострастия как надлежащую дань;
дом его  всегда был полон  гостями, готовыми тешить  его барскую праздность,
разделяя  шумные,  а  иногда  и  буйные  его  увеселения.  Никто  не  дерзал
отказываться от его приглашения, или в известные  дни не являться  с должным
почтением в село  Покровское."
scan_tokenizer(some_text)
#[1] "Несколько"  "лет"        "тому"       "назад"      "в"          "одном"      "из"         "своих"     
# [9] "поместий"   "жил"        "старинный"  "русской"    "барин,"     "Кирила"     "Петрович"   "Троекуров."
#[17] "Его"        "богатство," "знатный"    "род"        "и"          "св" 

oops... Seems R core function scan() see russian lower case letter 'я' as EOF. I tried diffrent encodings but I haven't answer how to fix this.

Ok lets try to remove punctuation:

removePunctuation("жил старинный русской барин, Кирила Петрович Троекуров")
#"жил старинный русской барин Кирила Петрови Троекуров"

Hmm...where is letter 'ч'? Ok with UTF-8 encoding this works fine, but it took some time to found it. also I had issue with removeWords() function perfomance but can't reproduce it. Main question is: How to read and tokenize texts with letter 'я'? my locale:

Sys.getlocale()
#[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
Dmitriy Selivanov

1) Question: How to read and tokenize texts with letter 'я'? Answer: try to write your own tokenizer and use it. For example:

my_tokenizer <- function (x) 
{
  strsplit(iconv(x, to='UTF-8'), split='([[:space:]]|[[:punct:]])+', perl=F)[[1]]
}
 TDM <- TermDocumentMatrix(corpus,control=list(tokenize=my_tokenizer, weighting=weightTf, wordLengths = c(3,10)))

2) Performance heavily depend on... performance of tolower function. May be this is a bug, I don't know, but on every time you call it you have to convert your text into native encoding using enc2native. (of course if your text language is not english).

doc.corpus <- Corpus(VectorSource(enc2native(textVector)))

And moreover after all text preprocessing on your corpus you have to convert it again. (this is because TermDocumentMatrix and many other function in tm package internally use tolower)

tm_map(doc.corpus, enc2native)

So your full flow will look like something like this:

createCorp <-function(textVector)
{
  doc.corpus <- Corpus(VectorSource(enc2native(textVector)))
  doc.corpus <- tm_map(doc.corpus, tolower)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("russian"))
  doc.corpus <- tm_map(doc.corpus, stemDocument, "russian")
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  return(tm_map(doc.corpus, enc2native))
}
my_tokenizer <- function (x) 
{
  strsplit(iconv(x, to='UTF-8'), split='([[:space:]]|[[:punct:]])+', perl=F)[[1]]
}
TDM <- TermDocumentMatrix(corpus,control=list(tokenize=my_tokenizer, weighting=weightTf, wordLengths = c(3,10)))

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

R tmap-package图例标题位于tm_lines和

来自分类Dev

tm.package:findAssocs和Cosine

来自分类Dev

tm.package:findAssocs和Cosine

来自分类Dev

在R中如何不通过语料库/ VCorpus将稀疏或simple_triplet_matrix转换为tm-package文档术语矩阵?

来自分类Dev

如果R使用库(tm),则PypeR失败

来自分类Dev

R-使用TM分析Tripadvisor内容

来自分类Dev

在tm包R中声明双引号

来自分类Dev

R tm软件包tm.plugin.tags停止工作

来自分类Dev

使用gsub的语料库中的R tm替代词

来自分类Dev

R:使用grep和tm包的部分匹配字典词

来自分类Dev

如何使用tm从R中的DocumentTermMatrix中选择命名列

来自分类Dev

在R中的tm_map(testfile,removeNumbers)中使用Filter?

来自分类Dev

修改R's TM程序包中的停用词

来自分类Dev

R tm软件包和西里尔文字

来自分类Dev

在R中使用tm包获取关键字计数

来自分类Dev

R中带有tm包的计数器ngram

来自分类Dev

tm包:矩阵中而不是R中的列表的findAssocs()输出

来自分类Dev

R:在新闻组数据中读取tm包

来自分类Dev

在R tm中添加自定义停用词

来自分类Dev

R tm软件包和西里尔文字

来自分类Dev

R tm / qdap-根据术语获取文档

来自分类Dev

tm包:矩阵中而不是R中的列表的findAssocs()输出

来自分类Dev

无法加载R软件包“ tm.plugin.webmining”

来自分类Dev

使用bigrams在R中带有tm包的LDA

来自分类Dev

术语频率表到tm R包中的DocumentTermMatrix

来自分类Dev

R:使用grep和tm包的部分匹配字典词

来自分类Dev

r tm排序操作后提取文档ID

来自分类Dev

Wine and Cyrillic Fonts

来自分类Dev

R :: tm-创建项关联频率的表/矩阵并将值添加到树状图