Automatically escape unicode characters

hadley

How can you display a unicode string, say:

x <- "•"

using its escaped equivalent?

y <- "\u2022"

identical(x, y)
# [1] TRUE

(I'd like to be able to do this because CRAN packages must contain only ASCII, but sometimes you want to use unicode in an error message or similar)

Xin Yin

After digging into some documentation about iconv, I think you can accomplish this using only the base package. But you need to pay extra attention to the encoding of the string.

On a system with UTF-8 encoding:

> stri_escape_unicode("你好世界")
[1] "\\u4f60\\u597d\\u4e16\\u754c"

# use big endian
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 4f 60 59 7d 4e 16 75 4c

> x <- "•"
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)    
[[1]]
[1] 20 22

But, if you are on a system with latin1 encoding, things may go wrong.

> x <- "•"
> y <- "\u2022"
> identical(x, y)
[1] FALSE
> stri_escape_unicode(x)
[1] "\\u0095" # <- oops!

# culprit
> Encoding(x)
[1] "latin1"

# and it causes problem for iconv
> iconv(x, Encoding(x), "Unicode")
Error in iconv(x, Encoding(x), "Unicode") : 
  unsupported conversion from 'latin1' to 'Unicode' in codepage 1252
> iconv(x, Encoding(x), "UTF-16BE")
Error in iconv(x, Encoding(x), "UTF-16BE") : 
  embedded nul in string: '\0•'

It is safer to cast the string into UTF-8 before converting to Unicode:

> iconv(enc2utf8(enc2native(x)), "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 20 22

EDIT: This may cause some problems for strings already in UTF-8 encoding on some particular systems. Maybe it's safer to check the encoding before conversion.

> Encoding("•")
[1] "latin1"
> enc2native("•")
[1] "•"
> enc2native("\u2022")
[1] "•"
# on a Windows with default latin1 encoding
> Encoding("测试") 
[1] "UTF-8"
> enc2native("测试") 
[1] "<U+6D4B><U+8BD5>"   # <- BAD! 

For some characters or lanuages, UTF-16 may not be enough. So probably you should be using UTF-32 since

The UTF-32 form of a character is a direct representation of its codepoint.

Based on above trial and error, below is probably one safer escape function we can write:

unicode_escape <- function(x, endian="big") {
  if (Encoding(x) != 'UTF-8') {
    x <- enc2utf8(enc2native(x))
  }
  to.enc <- ifelse(endian == 'big', 'UTF-32BE', 'UTF-32LE')

  bytes <- strtoi(unlist(iconv(x, "UTF-8", "UTF-32BE", toRaw=T)), base=16)
  # there may be some better way to do thibs.
  runes <- matrix(bytes, nrow=4)
  escaped <- apply(runes, 2, function(rb) {
    nonzero.bytes <- rb[rb > 0]
    ifelse(length(nonzero.bytes) > 1, 
           # convert back to hex
           paste("\\u", paste(as.hexmode(nonzero.bytes), collapse=""), sep=""),
           rawToChar(as.raw(nonzero.bytes))
           )
  })
  paste(escaped, collapse="")
}

Tests:

> unicode_escape("•••ERROR!!!•••")
[1] "\\u2022\\u2022\\u2022ERROR!!!\\u2022\\u2022\\u2022"
> unicode_escape("Hello word! 你好世界!")
[1] "Hello word! \\u4f60\\u597d\\u4e16\\u754c!"
> "\u4f60\u597d\u4e16\u754c"
[1] "你好世界"

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Decode unicode escape characters with perl

From Dev

Decode unicode escape characters with perl

From Dev

encode unicode characters to unicode escape sequences

From Dev

Convert special Characters to Unicode Escape Characters Scala

From Dev

JSON object parsing and how to escape unicode characters

From Dev

How to escape special characters in Groovy, excluding unicode

From Dev

Automatically escape special characters entered in text box

From Dev

Escape unicode characters in Go JSON so the output matches Python

From Dev

Interacting with files that have unicode characters in filename / escape sequence issues

From Dev

Interacting with files that have unicode characters in filename / escape sequence issues

From Dev

How to convert an ascii string with escape characters to its unicode equivalent

From Dev

Regex - Escape escape characters

From Dev

How to escape unicode characters in a java string without using third party libraries

From Dev

How to remove escape sequences from a list if list contains unicode and non-ascii characters?

From Dev

Escape unicode escaping in Java

From Dev

Unicode escape syntax in Java

From Dev

Unicode escape error

From Dev

Is this Python unicode escape error?

From Dev

Escape unicode escaping in Java

From Dev

Replacing characters with unicode characters

From Dev

Regex escape escape characters in PHP

From Dev

How to escape UNICODE string in python (to javascript escape)

From Dev

ANSI escape characters in gprolog

From Dev

Escape characters appearing in MySQL

From Dev

Postgres - escape tabs characters

From Dev

Redis ZRANGEBYLEX Escape Characters

From Dev

Compare a string to escape characters

From Dev

Freemarker escape regex characters

From Dev

Escape special characters in twig

Related Related

  1. 1

    Decode unicode escape characters with perl

  2. 2

    Decode unicode escape characters with perl

  3. 3

    encode unicode characters to unicode escape sequences

  4. 4

    Convert special Characters to Unicode Escape Characters Scala

  5. 5

    JSON object parsing and how to escape unicode characters

  6. 6

    How to escape special characters in Groovy, excluding unicode

  7. 7

    Automatically escape special characters entered in text box

  8. 8

    Escape unicode characters in Go JSON so the output matches Python

  9. 9

    Interacting with files that have unicode characters in filename / escape sequence issues

  10. 10

    Interacting with files that have unicode characters in filename / escape sequence issues

  11. 11

    How to convert an ascii string with escape characters to its unicode equivalent

  12. 12

    Regex - Escape escape characters

  13. 13

    How to escape unicode characters in a java string without using third party libraries

  14. 14

    How to remove escape sequences from a list if list contains unicode and non-ascii characters?

  15. 15

    Escape unicode escaping in Java

  16. 16

    Unicode escape syntax in Java

  17. 17

    Unicode escape error

  18. 18

    Is this Python unicode escape error?

  19. 19

    Escape unicode escaping in Java

  20. 20

    Replacing characters with unicode characters

  21. 21

    Regex escape escape characters in PHP

  22. 22

    How to escape UNICODE string in python (to javascript escape)

  23. 23

    ANSI escape characters in gprolog

  24. 24

    Escape characters appearing in MySQL

  25. 25

    Postgres - escape tabs characters

  26. 26

    Redis ZRANGEBYLEX Escape Characters

  27. 27

    Compare a string to escape characters

  28. 28

    Freemarker escape regex characters

  29. 29

    Escape special characters in twig

HotTag

Archive