What exactly is a unicode string?
What's the difference between a regular string and unicode string?
What is utf-8?
I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?
i18n Strings (Unicode)
> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'
## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1' ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8') ## Convert bytes back to a unicode string
> t == ustring ## It's the same as the original, yay!
True
Files Unicode
import codecs
f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string
This answer is about Python 2. In Python 3, str
is a Unicode string.
Python's str
type is a collection of 8-bit characters. The English alphabet can be represented using these 8-bit characters, but symbols such as ±, ♠, Ω and ℑ cannot.
Unicode is a standard for working with a wide range of characters. Each symbol has a codepoint (a number), and these codepoints can be encoded (converted to a sequence of bytes) using a variety of encodings.
UTF-8 is one such encoding. The low codepoints are encoded using a single byte, and higher codepoints are encoded as sequences of bytes.
Python's unicode
type is a collection of codepoints. The line ustring = u'A unicode \u018e string \xf1'
creates a Unicode string with 20 characters.
When the Python interpreter displays the value of ustring
, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.
The line s = unistring.encode('utf-8')
encodes the Unicode string using UTF-8. This converts each codepoint to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str
. The size of s
is 22 bytes, because two of the characters have high codepoints and are encoded as a sequence of two bytes rather than a single byte.
When the Python interpreter displays the value of s
, it escapes four bytes that are not in the printable range (\xc6
, \x8e
, \xc3
, and \xb1
). The two pairs of bytes are not treated as single characters like before because s
is of type str
, not unicode
.
The line t = unicode(s, 'utf-8')
does the opposite of encode()
. It reconstructs the original codepoints by looking at the bytes of s
and parsing byte sequences. The result is a Unicode string.
The call to codecs.open()
specifies utf-8
as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments