带有 R 的数据帧列中的错误值

泊松

自从三天以来我就遇到了这个问题,我非常希望能找到可以帮助我找到解决方案的人:

为了对文本进行情感分析,我在数据框中存储了一个单词列表及其正负极性:

 word         positive.polarity       negative.polarity 
1 interesting                 1                 0                          
2      boring                 0                 1    

然后,对于数据框中这些单词的每个单词,我想知道在它们的上下文中(上下文是单词前面的一组 3 个单词)是否有一个助推词或一个否定词:

-booster_words <- c("more","enough", "a lot", "as", "so")
-negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")

我想创建一个新的列positive.ponderate.polarity,如果上下文中有助推器和否定词,则包含正极性值+ 4,如果上下文中只有助推器词,则包含正极性值+ 9(没有上下文中的否定词)。

这是代码:

calcPolarity <- function(sentiment_DF,sentences){
     booster_words <- c("more","enough", "a lot", "as", "so")
     negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
     reduce_words <- c("peu", "presque", "moins", "seulement")
     # pre-allocate the polarity result vector with size = number of sentences
     polarity <- rep.int(0,length(sentences))

     # loop per sentence
     for(i in 1:length(polarity)){
         sentence <- sentences[i]

         # separate each sentence in words using regular expression 
        wordsOfASentence <- unlist(regmatches(sentence,gregexpr("[[:word:]]+",sentence,perl=TRUE)))

         # get the rows of sentiment_DF corresponding to the words in the sentence using match
         # N.B. if a word occurs twice, there will be two equal rows 
         # (but I think it's correct since in this way you count its polarity twice)
         subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),]


         # Find (number) of matching word. 
         wordOfInterest <- wordsOfASentence[which(wordsOfASentence %in% levels(sentiment_DF$word))]  # No multigrepl, so working with duplicates instead. eg interesting
         regexOfInterest <- paste0("([^\\s]+\\s){0,3}", wordOfInterest, "(\\s[^\\s]+){0,3}")

         # extract a context of 3 words before the word in the dataframe
        context <-  stringr::str_extract(sentence, regexOfInterest)
         names(context) <- wordOfInterest  # Helps in forloop

         print(context)
         for(i in 1:length(context)){
             if(any(unlist(strsplit(context[i], " ")) %in% booster_words))

             {
                 print(booster_words)
                 if(any(unlist(strsplit(context[i], " ")) %in% negative_words))

                 {
                     subDF$positive.ponderate.polarity <- subDF$positive.polarity + 4

                 }
                 else 
                 {
                     subDF$positive.ponderate.polarity <- subDF$positive.polarity + 9

                 }
             }
         }



         # Debug option
         print(subDF)

         # calculate the total polarity of the sentence and store in the vector
         polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity)

    }
     return(polarity)
 }

 sentiment_DF <- data.frame(word=c('interesting','boring','pretty'),
                            positive.polarity=c(1,0,1),
                            negative.polarity=c(0,1,0))
 sentences <- c("The course was interesting, but the professor was not so boring")
 result <- calcPolarity(sentiment_DF,sentences)

当我用这句话运行它时:

"The course was interesting, but the professor was not so boring"

我得到这个结果:

         word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting                 1                 0                           5
2      boring                 0                 1                           4

但这不是正确的,正确的结果是:

 word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting                 1                 0                  1
2      boring                 0                 1                  4

我不知道我得到的价值不正确.. 有什么想法可以帮助我吗?

谢谢

编辑:

例如,如果我有这个数据框:

      word positive.polarity negative.polarity positive.ponderate.polarity   negative.ponderate.polarity
1 interesting                 1                 0                           1        1  
   2      boring                 0                 1                           4      2

结果应该是: (1+4) -(1+2)

乔克

我发现了错误。在这种情况下,建议逐行调试,并打印初始变量、每个 if 语句的结果或如果处理 if else 语句的指示符。

这里你的初始值subDF$positive.polarity是一个c(1,0)长度为 2的向量,这是sentiment_DF 中的单词数c("interesting, "boring")

当 i=1, 时context="The course was interesting",没有助推器也没有否定词 -- subDF$positive.polarityisc(1,0)subDF$positive.ponderate.polarityisNULL

i = 2时,context="was not so boring"有一个助推器和一个否定词-subDF$positive.polarityc(1,0)你要添加4到两个元素时要添加4只对应于第二元素"boring",因为这subDF$positive.ponderate.polarityc(5,4)它返回什么。

这里的技巧是那个长度subDF$positive.polaritysubDF$positive.ponderate.polarity依赖于数量sentiment_DF的一句话。更正后的代码和调试如下。以下是修复:

A. 初始化为长度相等

 subDF$positive.ponderate.polarity <- subDF$positive.polarity

B. 使用 i 进行索引,因此您只向与当前上下文元素相对应的元素添加值,而不是所有元素

  subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 4
  subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 9

C. 有一件事我没有解决,因为我不确定如何对待它……如果上下文是:“课程太无聊了”怎么办?有一个助推器,没有否定词,所以它传递给 else 语句并添加 9。这是积极的.ponderate.polarity?不会是负面的.ponderate.polarity吗?

calcPolarity(sentiment_DF, "The course was so boring")
    word positive.polarity negative.polarity positive.ponderate.polarity
2 boring                 0                 1                           9

D. 其他情况查看:

calcPolarity(sentiment_DF, "The course was interesting, but the professor was not so boring")
         word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting                 1                 0                           1
2      boring                 0                 1                           4

calcPolarity(sentiment_DF, "The course was so interesting")
         word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting                 1                 0                          10

编辑以更正注释中的极性结果:极性输出c(0,5)与原始代码相同:polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity)由于您有 2 个上下文短语,因此最后的 i 是 2,然后 Polar[1] 是您的初始值 0,并且您的总和的结果分配给 Polar[2],即 5,给您留下 c(0, 5)。而是删除 [i],应该只是polarity <- sum(subDF$positive.ponderate.polarity) -sum(subDF$negative.ponderate.polarity)

这是更正后的代码:

calcPolarity <- function(sentiment_DF,sentences){
  booster_words <- c("more","enough", "a lot", "as", "so")
  negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
  reduce_words <- c("peu", "presque", "moins", "seulement")
  # pre-allocate the polarity result vector with size = number of sentences
  polarity <- rep.int(0,length(sentences))

  # loop per sentence
  for(i in 1:length(polarity)){
sentence <- sentences[i]

# separate each sentence in words using regular expression 
wordsOfASentence <- unlist(regmatches(sentence,gregexpr("[[:word:]]+",sentence,perl=TRUE)))

# get the rows of sentiment_DF corresponding to the words in the sentence using match
# N.B. if a word occurs twice, there will be two equal rows 
# (but I think it's correct since in this way you count its polarity twice)
subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),]
print(subDF)

# Find (number) of matching word. 
wordOfInterest <- wordsOfASentence[which(wordsOfASentence %in% levels(sentiment_DF$word))]  # No multigrepl, so working with duplicates instead. eg interesting
regexOfInterest <- paste0("([^\\s]+\\s){0,3}", wordOfInterest, "(\\s[^\\s]+){0,3}")

# extract a context of 3 words before the word in the dataframe
context <-  stringr::str_extract(sentence, regexOfInterest)
names(context) <- wordOfInterest  # Helps in forloop

for(i in 1:length(context)){
  print(paste("i:", i))
  print(context)
  print("initial")
  print(subDF$positive.polarity)
  subDF$positive.ponderate.polarity <- subDF$positive.polarity
  print(subDF$positive.ponderate.polarity)

  if (any(unlist(strsplit(context[i], " ")) %in% booster_words)) {
    print(booster_words)
    length(booster_words)
    print("if level 1")
    print(subDF$positive.polarity)
    if (any(unlist(strsplit(context[i], " ")) %in% negative_words)) {
      subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 4
      print("if level 2A")
      print(subDF$positive.ponderate.polarity)
    } else {
      print("if level 2B")
      subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 9
      print(subDF$positive.ponderate.polarity)
    }

    print("level 2 result")
    print(subDF$positive.ponderate.polarity)
  }
  print("level 1 result")
  print(subDF$positive.ponderate.polarity)

    }
  }
    # Debug option
    print(subDF)

    # calculate the total polarity of the sentence and store in the vector
    polarity <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity)

  return(polarity)
}

sentiment_DF <- data.frame(word=c('interesting','boring','pretty'),
                       positive.polarity=c(1,0,1),
                       negative.polarity=c(0,1,0))
calcPolarity(sentiment_DF, "The course was interesting, but the professor was not so boring")
calcPolarity(sentiment_DF, "The course was so interesting")
calcPolarity(sentiment_DF, "The course was so boring")

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

带有多个值的字符串的行中的R隐式值到数据帧中的列

来自分类Dev

R:根据列中的值合并两个数据帧,并返回两个数据帧的所有值

来自分类Dev

根据R数据帧中其他列的值缩放列的有效方法

来自分类Dev

根据R数据帧中其他列的值缩放列的有效方法

来自分类Dev

有效地找到R中数据帧中不同行的列值计数

来自分类Dev

更改R中数据帧的列中所有出现的特定值的有效方法

来自分类Dev

将图像转换为具有R中坐标和像素值列的数据帧

来自分类Dev

按多列中的值在 R 中有效过滤数据帧

来自分类Dev

r - 删除数据帧中具有值的行

来自分类Dev

带有子集的R中的错误

来自分类Dev

R:提取数据帧中另一列(y)至少具有+1的一列(x)的值

来自分类Dev

R中带有sql的列的唯一值

来自分类Dev

R中带有sql的列的唯一值

来自分类Dev

自合并带有R滞后的数据帧?

来自分类Dev

自合并带有滞后R的数据帧?

来自分类Dev

如何从带有数字指示的列表中获取R中的数据帧

来自分类Dev

R:将数据帧列表合并为单个数据帧,并添加带有列表索引的列

来自分类Dev

R中带有foreach的两个绑定数据帧的输出列表

来自分类Dev

在R中如何使用带有向量或数据帧的ifelse()进行分类

来自分类Dev

R中带有嵌套数据帧的不完整列表

来自分类Dev

绑定R中多个列表的相应数据帧元素(带有间隙功能)

来自分类Dev

用R中的另一个数据帧完成一个带有变量的数据帧

来自分类Dev

带有随机值的R条件IfElse

来自分类Dev

带有分组数据R的ARIMAX

来自分类Dev

基于R中2个单独数据帧中的值创建新数据帧的有效方法

来自分类Dev

R中带有轮廓的文本标签

来自分类Dev

在R中创建带有索引的向量

来自分类Dev

加速R中的仿真(带有示例)

来自分类Dev

R中带有ggplot的背景带

Related 相关文章

  1. 1

    带有多个值的字符串的行中的R隐式值到数据帧中的列

  2. 2

    R:根据列中的值合并两个数据帧,并返回两个数据帧的所有值

  3. 3

    根据R数据帧中其他列的值缩放列的有效方法

  4. 4

    根据R数据帧中其他列的值缩放列的有效方法

  5. 5

    有效地找到R中数据帧中不同行的列值计数

  6. 6

    更改R中数据帧的列中所有出现的特定值的有效方法

  7. 7

    将图像转换为具有R中坐标和像素值列的数据帧

  8. 8

    按多列中的值在 R 中有效过滤数据帧

  9. 9

    r - 删除数据帧中具有值的行

  10. 10

    带有子集的R中的错误

  11. 11

    R:提取数据帧中另一列(y)至少具有+1的一列(x)的值

  12. 12

    R中带有sql的列的唯一值

  13. 13

    R中带有sql的列的唯一值

  14. 14

    自合并带有R滞后的数据帧?

  15. 15

    自合并带有滞后R的数据帧?

  16. 16

    如何从带有数字指示的列表中获取R中的数据帧

  17. 17

    R:将数据帧列表合并为单个数据帧,并添加带有列表索引的列

  18. 18

    R中带有foreach的两个绑定数据帧的输出列表

  19. 19

    在R中如何使用带有向量或数据帧的ifelse()进行分类

  20. 20

    R中带有嵌套数据帧的不完整列表

  21. 21

    绑定R中多个列表的相应数据帧元素(带有间隙功能)

  22. 22

    用R中的另一个数据帧完成一个带有变量的数据帧

  23. 23

    带有随机值的R条件IfElse

  24. 24

    带有分组数据R的ARIMAX

  25. 25

    基于R中2个单独数据帧中的值创建新数据帧的有效方法

  26. 26

    R中带有轮廓的文本标签

  27. 27

    在R中创建带有索引的向量

  28. 28

    加速R中的仿真(带有示例)

  29. 29

    R中带有ggplot的背景带

热门标签

归档