What is a unicode string?

Stevanus Iskandar Published at Dev

Stevanus Iskandar

What exactly is a unicode string?

What's the difference between a regular string and unicode string?

What is utf-8?

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n Strings (Unicode)

> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'

## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string
> t == ustring                      ## It's the same as the original, yay!
True

Files Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string

tom

This answer is about Python 2. In Python 3, str is a Unicode string.

Python's str type is a collection of 8-bit characters. The English alphabet can be represented using these 8-bit characters, but symbols such as ±, ♠, Ω and ℑ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a codepoint (a number), and these codepoints can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low codepoints are encoded using a single byte, and higher codepoints are encoded as sequences of bytes.

Python's unicode type is a collection of codepoints. The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each codepoint to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high codepoints and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original codepoints by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-11-18

Comments

0 comments

From Dev

Related Related

Article

What is a unicode string?

What is a unicode string?

What is the difference between a unicode and binary string?

What is the correct way to slice a unicode string in python?

What is the best way to remove accents (normalize) in a Python unicode string?

unicode string format misterious KeyError. What's wrong?

What determines the normalized form of a Unicode string in C++?

convert string representation of unicode to unicode

Python unicode string to string?

String to unicode string

Python unittest AssertionError: unicode string is not unicode string

What is std::strtoul for unicode

Delphi: What are faster pure Pascal approachs to find the position of a character in a Unicode string?

Python2.7, what does the special characters mean in the utf-32 encoding output of a unicode string?

In Python, what would be a single, compact method to convert an int, a float or unicode to a string?

String formatting with Unicode

namedtuple with unicode string as name

Character count of Unicode string

Replace Unicode Characters in a String

Detect Unicode Character in string

Unescape unicode in character string

TypeError: must be string, not unicode

unicode string equivalent of contain

Cannot print unicode string

VBA - Convert string to UNICODE

unicode string in C extension

Unicode string literals in VBA

iOS encode string to unicode

Remove Unicode characters in a String

String comparison and unicode

Getting the unicode characters of a string