remove emoji from string in R

the_darkside

I have a list of tweets, many of which contain emojis that need to be removed. What would be the most effective method for doing this in R?

I have tried the following method which is supposed to substitute all words beginning with "\" with a blank, but I receive this error

some_tweets <- gsub("\\\w+ *", "", some_tweets)
Error: '\w' is an unrecognized escape in character string starting ""\\\w"

Here is a sample of the data:

> head(some_tweets)
[1] "ஆமா நான் பாக்கவே இல்லை \U0001f625\U0001f625\U0001f625"                               
[2] "எனக்கு அனுப்பலாமே \U0001f913\U0001f913\U0001f913"                                  
[3] "அவர் ஏன்டா ப்ளாக் பண்ணார் \U0001f602\U0001f602\U0001f602\U0001f602"                        
[4] "ஆமா"                                                                           
[5] "RT : சும்மார்றா சுன்னி.. ~ ஆதவன்"                                                      
[6] "கைலியை எல்லாம் லூஸ் பண்ணிகிட்டு உக்காந்து இருக்கேன் அடுத்து போடுங்கயா \U0001f608\U0001f608\U0001f608"


> dput(head(some_tweets))
c("ஆமா நான் பாக்கவே இல்லை \U0001f625\U0001f625\U0001f625", 
"எனக்கு அனுப்பலாமே \U0001f913\U0001f913\U0001f913", 
"அவர் ஏன்டா ப்ளாக் பண்ணார் \U0001f602\U0001f602\U0001f602\U0001f602", 
"ஆமா", "RT : சும்மார்றா சுன்னி.. ~ ஆதவன்", 
"கைலியை எல்லாம் லூஸ் பண்ணிகிட்டு உக்காந்து இருக்கேன் அடுத்து போடுங்கயா \U0001f608\U0001f608\U0001f608"
)
alistaire

Check out regular-expressions.info on Unicode, which has a thorough explanation of Unicode in regex. The part that matters here is that you can match Unicode characters with \p{xx}, where xx is the name of whatever class they're in (e.g. L for letters, M for marks). Here, it seems your emoji are in the So (shorthand for Other_Symbol) and Cn (shorthand for Unassigned) classes, so we can sub them out with:

gsub('\\p{So}|\\p{Cn}', '', some_tweets, perl = TRUE)
## [1] "ஆமா நான் பாக்கவே இல்லை "                                       
## [2] "எனக்கு அனுப்பலாமே "                                           
## [3] "அவர் ஏன்டா ப்ளாக் பண்ணார் "                                       
## [4] "ஆமா"                                                        
## [5] "RT : சும்மார்றா சுன்னி.. ~ ஆதவன்"                               
## [6] "கைலியை எல்லாம் லூஸ் பண்ணிகிட்டு உக்காந்து இருக்கேன் அடுத்து போடுங்கயா "

Note you need perl = TRUE set, as this notation is not enabled in R's default POSIX 1003.2 regex; see ?base::regex and ?grep.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

remove emoji from string in R

From Dev

Remove numbers from string in R

From Dev

R remove only "[" "]" from string

From Dev

remove r n r n from string

From Dev

Android - How to filter emoji (emoticons) from a string?

From Dev

How to remove + (plus sign) from string in R?

From Dev

R regex: remove times from character string

From Dev

Remove leading backslash from string R

From Dev

R remove first character from string

From Dev

Remove specified pattern from string in R

From Dev

Remove \r and \n from AES encrypted string

From Dev

R - Regex to Remove Last Word from String

From Dev

How to remove a specific pattern from a string in R?

From Dev

R - remove anything after number from string

From Dev

How to remove \n and \r from a string

From Dev

How to remove single quote from a string in R?

From Dev

How to remove a specific pattern from a string in R?

From Dev

Remove specified pattern from string in R

From Dev

Remove hastag/pound/octothorpe from string in R

From Dev

How to remove '\' from a string using R?

From Dev

Remove characters from a string BEFORE a word (in R)

From Dev

ObjC: how to detect and remove invalid character from half emoji?

From Dev

Remove US zip codes from a string: R regex

From Dev

remove all line breaks (enter symbols) from the string using R

From Dev

How to remove only "actual numbers" from a string of characters in R

From Dev

R: Remove leading zeroes from the beginning of a character string

From Dev

How do I remove \r\n from string in C#?

From Dev

Remove everything except period and numbers from string regex in R

From Dev

how to remove special characters and number patterns from a string in R

Related Related

HotTag

Archive