我在 R 中有一个数据框语料库,它看起来像这样:在此处输入图像描述我想使用循环或函数创建 n-gram(最多 5-gram)。目前,我正在以这种方式手动进行:
样本语料结构:
{“同事们也参加了在阿伯里斯特威斯和弗林特举行的另外两场比赛,大家都玩得很开心”,“当 bing crosby 蛋奶酥以我的方式击败煤气灯和双重赔偿时,阵容被缩减到更可口的五证明奥斯卡选民一直都喜欢pabulum”,“我今天第一次感受到地震,整个工作场所都在晃动”,“她是我每天都渴望成为的那种母亲朋友和女人”,“她被处理并释放待定出庭”,“尽管我一直感到悲伤,但我也感到非常幸运和高兴能承载另一个奇迹”,“每天晚上当我们听到维尼的心跳时,我们的心都会感到非常幸福与平静”, }
`onegram <- NGramTokenizer(corpusdf, Weka_control(min=1, max=1))
onegram <- data.frame(table(onegram))
onegram <- onegram[order(onegram$Freq, decreasing = TRUE),]
colnames(onegram) <- c("Word", "Freq")
onegram [1:15,]
bigram <- NGramTokenizer(corpusdf, Weka_control(min=2, max=2, delimiters = tokendelim))
bigram <- data.frame(table(bigram))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
colnames(bigram) <- c("Word", "Freq")
bigram [1:15,]`
有任何想法吗?
我不知道函数 NGramTokenizer 并且无法让它工作。所以这里是 中的一个解决方案quanteda
,它为每次迭代生成单独的标记对象(gram_1 表示 onegram,gram_2 表示 bigram 等):
corpusdf <- data.frame(text = c("colleagues were also at the other two events in aberystwyth and flint and by all accounts had a great time", "the lineup was whittled down to a more palatable five in when the bing crosby souffle going my way bested both gaslight and double indemnity proving oscar voters have always had a taste for pabulum", "felt my first earthquake today whole building at work was shaking", "she is the kind of mother friend and woman i aspire everyday to be", "she was processed and released pending a court appearance", "watching some sunday night despite the sadness i have been feeling i also feel very blessed and happy to be carrying another miracle", "every night when we listen to poohs heartbeat our hearts feel so much happiness and peace"),
stringsAsFactors = FALSE)
library("quanteda")
tokens <- tokens(corpusdf$text, what = "word")
for (n in seq_len(5)) {
temp <- tokens_ngrams(tokens, n = n, skip = 0L, concatenator = "_")
temp <- data.frame(table(unlist(temp)),
stringsAsFactors = FALSE)
colnames(temp) <- c("Word", "Freq")
temp <- temp[order(temp$Freq, decreasing = TRUE),]
assign(paste0("gram_", n), temp)
}
head(gram_2)
输出如下所示:
> head(gram_2)
Word Freq
53 had_a 2
101 to_be 2
1 a_court 1
2 a_great 1
3 a_more 1
4 a_taste 1
更新:在我意识到NGramTokenizer
属于RWeka
包而不是之后tm
,@phiver 的答案对我有用
ngrams <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=1, max=5))
ngrams <- data.frame(table(ngrams),
stringsAsFactors = FALSE)
ngrams <- ngrams[order(ngrams$Freq, decreasing = TRUE),]
head(ngrams)
但是,这会混淆所有 ngram,如果您想对频率进行排名,这没有多大意义(onegram 自然会排在最前面)。所以这是一个循环解决方案:
for (n in seq_len(5)) {
temp <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=n, max=n))
temp <- data.frame(table(unlist(temp)),
stringsAsFactors = FALSE)
colnames(temp) <- c("Word", "Freq")
temp <- temp[order(temp$Freq, decreasing = TRUE),]
assign(paste0("gram_", n), temp)
}
head(gram_2)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句