std::string and UTF-8 encoded unicode

Virus721

If I understand well, it is possible to use both string and wstring to store UTF-8 text.

  • With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.

  • With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.

Right ?

So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.

If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?

Sorin

You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes). You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.

Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break. string::compare will work but do not expect to get alphabetical ordering. That is language dependent. string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.

Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Storing unicode UTF-8 string in std::string

From Dev

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

From Dev

Reading an UTF-8 encoded file into std::u32string without intermediate buffering

From Dev

Reading an UTF-8 encoded file into std::u32string without intermediate buffering

From Dev

How to convert a string encoded in utf16 to a string encoded in UTF-8?

From Dev

Converting url encoded string(utf-8) to string in python?

From Dev

How to decode to UTF-8 String from Hex encoded string

From Dev

comparing a url containing utf-8 encoded string with a string

From Dev

Perl string manipulation and utf8/unicode

From Dev

How to read a UTF-8 encoded list of string tokens into a vector?

From Dev

Getting a utf-8 encoded string from a database, then displaying in a webview

From Dev

Byte array is a valid UTF8 encoded String in Java but not in Python

From Dev

nodejs UTF-8 encoded string has black question mark

From Dev

How to get a file list as utf8 encoded string into gnuplot?

From Dev

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

From Dev

Converting "normal" std::string to utf-8

From Dev

Getting UTF-8 encoded from US-ASCII encoded string

From Dev

Getting UTF-8 encoded from US-ASCII encoded string

From Dev

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

From Dev

how do I print unicode character in C encoded with UTF8?

From Dev

How to convert string to unicode(UTF-8) string in Swift?

From Dev

convert a string that is utf-8 code to real unicode string

From Dev

Unicode to UTF-8 to Unicode?

From Dev

Unicode convert to Utf 8

From Dev

Decode an ENCODED unicode string in Python

From Dev

Decode an ENCODED unicode string in Python

From Dev

How to convert large UTF-8 encoded char* string to CStringW (UTF-16)?

From Dev

Convert Unicode to UTF-8 byte[] and save into string (Java)

From Dev

Python unicode string literals in module declared as utf-8

Related Related

  1. 1

    Storing unicode UTF-8 string in std::string

  2. 2

    Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

  3. 3

    Reading an UTF-8 encoded file into std::u32string without intermediate buffering

  4. 4

    Reading an UTF-8 encoded file into std::u32string without intermediate buffering

  5. 5

    How to convert a string encoded in utf16 to a string encoded in UTF-8?

  6. 6

    Converting url encoded string(utf-8) to string in python?

  7. 7

    How to decode to UTF-8 String from Hex encoded string

  8. 8

    comparing a url containing utf-8 encoded string with a string

  9. 9

    Perl string manipulation and utf8/unicode

  10. 10

    How to read a UTF-8 encoded list of string tokens into a vector?

  11. 11

    Getting a utf-8 encoded string from a database, then displaying in a webview

  12. 12

    Byte array is a valid UTF8 encoded String in Java but not in Python

  13. 13

    nodejs UTF-8 encoded string has black question mark

  14. 14

    How to get a file list as utf8 encoded string into gnuplot?

  15. 15

    Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

  16. 16

    Converting "normal" std::string to utf-8

  17. 17

    Getting UTF-8 encoded from US-ASCII encoded string

  18. 18

    Getting UTF-8 encoded from US-ASCII encoded string

  19. 19

    How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

  20. 20

    how do I print unicode character in C encoded with UTF8?

  21. 21

    How to convert string to unicode(UTF-8) string in Swift?

  22. 22

    convert a string that is utf-8 code to real unicode string

  23. 23

    Unicode to UTF-8 to Unicode?

  24. 24

    Unicode convert to Utf 8

  25. 25

    Decode an ENCODED unicode string in Python

  26. 26

    Decode an ENCODED unicode string in Python

  27. 27

    How to convert large UTF-8 encoded char* string to CStringW (UTF-16)?

  28. 28

    Convert Unicode to UTF-8 byte[] and save into string (Java)

  29. 29

    Python unicode string literals in module declared as utf-8

HotTag

Archive