What is a unicode string?

Stevanus Iskandar

What exactly is a unicode string?

What's the difference between a regular string and unicode string?

What is utf-8?

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n Strings (Unicode)

> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'

## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string
> t == ustring                      ## It's the same as the original, yay!
True

Files Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string
tom

This answer is about Python 2. In Python 3, str is a Unicode string.

Python's str type is a collection of 8-bit characters. The English alphabet can be represented using these 8-bit characters, but symbols such as ±, ♠, Ω and ℑ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a codepoint (a number), and these codepoints can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low codepoints are encoded using a single byte, and higher codepoints are encoded as sequences of bytes.

Python's unicode type is a collection of codepoints. The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each codepoint to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high codepoints and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original codepoints by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

What is the difference between a unicode and binary string?

From Dev

What is the correct way to slice a unicode string in python?

From Java

What is the best way to remove accents (normalize) in a Python unicode string?

From Dev

unicode string format misterious KeyError. What's wrong?

From Dev

What determines the normalized form of a Unicode string in C++?

From Dev

convert string representation of unicode to unicode

From Dev

Python unicode string to string?

From Dev

String to unicode string

From Dev

Python unittest AssertionError: unicode string is not unicode string

From Dev

What is std::strtoul for unicode

From Dev

Delphi: What are faster pure Pascal approachs to find the position of a character in a Unicode string?

From Dev

Python2.7, what does the special characters mean in the utf-32 encoding output of a unicode string?

From Dev

In Python, what would be a single, compact method to convert an int, a float or unicode to a string?

From Dev

String formatting with Unicode

From Dev

namedtuple with unicode string as name

From Dev

Character count of Unicode string

From Java

Replace Unicode Characters in a String

From Dev

Detect Unicode Character in string

From Dev

Unescape unicode in character string

From Dev

TypeError: must be string, not unicode

From Dev

unicode string equivalent of contain

From Dev

Cannot print unicode string

From Dev

VBA - Convert string to UNICODE

From Dev

unicode string in C extension

From Dev

Unicode string literals in VBA

From Dev

iOS encode string to unicode

From Dev

Remove Unicode characters in a String

From Dev

String comparison and unicode

From Dev

Getting the unicode characters of a string

Related Related

HotTag

Archive