在 R 中使用循环或函数将句子拆分为单词列？

debugcn 发表于 Dev

用户6918497

我在 R 中有一个数据框语料库，它看起来像这样：在此处输入图像描述我想使用循环或函数创建 n-gram（最多 5-gram）。目前，我正在以这种方式手动进行：

样本语料结构：

{“同事们也参加了在阿伯里斯特威斯和弗林特举行的另外两场比赛，大家都玩得很开心”，“当 bing crosby 蛋奶酥以我的方式击败煤气灯和双重赔偿时，阵容被缩减到更可口的五证明奥斯卡选民一直都喜欢pabulum”，“我今天第一次感受到地震，整个工作场所都在晃动”，“她是我每天都渴望成为的那种母亲朋友和女人”，“她被处理并释放待定出庭”，“尽管我一直感到悲伤，但我也感到非常幸运和高兴能承载另一个奇迹”，“每天晚上当我们听到维尼的心跳时，我们的心都会感到非常幸福与平静”， }

`onegram <- NGramTokenizer(corpusdf, Weka_control(min=1, max=1))
   onegram <- data.frame(table(onegram))
  onegram <- onegram[order(onegram$Freq, decreasing = TRUE),]
colnames(onegram) <- c("Word", "Freq")
onegram [1:15,]


bigram <- NGramTokenizer(corpusdf, Weka_control(min=2, max=2, delimiters = tokendelim))
bigram <- data.frame(table(bigram))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
colnames(bigram) <- c("Word", "Freq")
bigram [1:15,]`

有任何想法吗？

JBGruber

我不知道函数 NGramTokenizer 并且无法让它工作。所以这里是中的一个解决方案quanteda，它为每次迭代生成单独的标记对象（gram_1 表示 onegram，gram_2 表示 bigram 等）：

corpusdf <- data.frame(text = c("colleagues were also at the other two events in aberystwyth and flint and by all accounts had a great time", "the lineup was whittled down to a more palatable five in when the bing crosby souffle going my way bested both gaslight and double indemnity proving oscar voters have always had a taste for pabulum", "felt my first earthquake today whole building at work was shaking", "she is the kind of mother friend and woman i aspire everyday to be", "she was processed and released pending a court appearance", "watching some sunday night despite the sadness i have been feeling i also feel very blessed and happy to be carrying another miracle", "every night when we listen to poohs heartbeat our hearts feel so much happiness and peace"),
                       stringsAsFactors = FALSE)
library("quanteda")
tokens <- tokens(corpusdf$text, what = "word")
for (n in seq_len(5)) {
  temp <- tokens_ngrams(tokens, n = n, skip = 0L, concatenator = "_")
  temp <- data.frame(table(unlist(temp)),
                     stringsAsFactors = FALSE)
  colnames(temp) <- c("Word", "Freq")
  temp <- temp[order(temp$Freq, decreasing = TRUE),]
  assign(paste0("gram_", n), temp)
}

head(gram_2)

输出如下所示：

> head(gram_2)
       Word Freq
53    had_a    2
101   to_be    2
1   a_court    1
2   a_great    1
3    a_more    1
4   a_taste    1

更新：在我意识到NGramTokenizer属于RWeka包而不是之后tm，@phiver 的答案对我有用

ngrams <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=1, max=5))
ngrams <- data.frame(table(ngrams),
                     stringsAsFactors = FALSE)
ngrams <- ngrams[order(ngrams$Freq, decreasing = TRUE),]
head(ngrams)

但是，这会混淆所有 ngram，如果您想对频率进行排名，这没有多大意义（onegram 自然会排在最前面）。所以这是一个循环解决方案：

for (n in seq_len(5)) {
  temp <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=n, max=n))
  temp <- data.frame(table(unlist(temp)),
                     stringsAsFactors = FALSE)
  colnames(temp) <- c("Word", "Freq")
  temp <- temp[order(temp$Freq, decreasing = TRUE),]
  assign(paste0("gram_", n), temp)
}

head(gram_2)

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-07-18

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

在 R 中使用循环或函数将句子拆分为单词列？

在 R 中使用循环或函数将句子拆分为单词列？

r：按值将列拆分为多列

将列拆分为多列R

R-将行拆分为多行

通过循环将R中的一列拆分为两列

Linux sh脚本将句子拆分为单词

使用Tidyr将字符串长度不均匀的行拆分为R中的列

将行拆分为R中的列

将字符串拆分为R中的列

在R中使用Apply函数而不是for循环

将字符串拆分为R中的列

使用Base R将data.frame data.table中的字符串拆分为两列

如何在R中使用模式将一列拆分为多列

如何使用Python将包含句子的列表拆分为带有单词的列表？

R-从单词列创建句子ID

Postgres函数使用额外的逻辑将单词拆分为数组

在R中使用glm函数循环

使用R-正则表达式将POS标记的文本向量/因子拆分为句子

R-将行拆分为多行

按时间将数据帧分为几组，并使用R将函数应用于多列

如何使用拆分或匹配器将句子拆分为单词和标点符号？

如何使用R将向量拆分为列

将双行字符列拆分为R中的两列

在 r 中使用 for 循环将最佳函数应用于函数

使用字符指针将句子拆分为单词

使用 r 将一列拆分为两列

是否有一个 R 函数可以将一列拆分为任意数量的多个以字段命名的列？

如何使用 Java 中的拆分方法将字符串句子拆分为单词？

R：将行拆分为不同的行

将因子列拆分为 R 中的几列