Is it ok to write the unicode characters inside string literals in C++

FrozenHeart Published at Dev

FrozenHeart

Is it ok to write the following code?

const char* str = "§some-text";

Will str contain the correct UTF-8 representation of the § character if the source files was saved in a UTF-8 encoding?

Or is the only way to write it is to use u8-prefixed string literals?

Simple

Whether you can use Unicode characters in your source code (not just in string literals) is implementation-defined. The only way to be portable is to stick to characters in the "basic source character set" and use u8"\u00a7some-text".

[lex.phases]/1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

The "basic source character set" is:

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

0 1 2 3 4 5 6 7 8 9

_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’

Collected from the Internet

Please contact [email protected] to delete if infringement.