How do I properly handle  in UTF-8 XML?

Ehryk

I'm having some issues representing , a valid UTF-16 construct and also apparently valid in a Windows filename, in XML for consumption in SQL Server XML (2012).

Take this for example:

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣&#xFFFF;&#xFFFF;</FileName>'

-- Result: XML parsing: line 1, character 41, illegal xml character

This is, however, legal Unicode (a "Unicode Noncharacter" '￿'): http://www.fileformat.info/info/unicode/char/ffff/index.htm

So, I tried this:

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0" encoding="utf-16"?><FileName>풜〣&#xFFFF;&#xFFFF;</FileName>'

-- Result: XML parsing: line 1, character 59, illegal xml character

So - how am I supposed to represent this file name accurately in XML? I can't just remove the characters, they are indeed &#xFFFF; characters as reported by Get-ChildItem, and I will need to retain a handle to this file.

I can get the XML parsing to work by replacing &#xFFFF; with &#xEF;&#xBF;&#xBF; which is the UTF-8 representation of \uFFFF according to this link I then try to take this XML and insert it to a nvarchar column, and I need this to be the proper representation of the filename.

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣&#xEF;&#xBF;&#xBF;&#xEF;&#xBF;&#xBF;</FileName>'
SELECT F.Item.value('.', 'nvarchar(2000)') FROM @Xml.nodes('//FileName') as F(Item)

-- Returns 풜〣ï¿¿ï¿¿ (not correct)
Solomon Rutzky

&#xFFFF; (i.e. decimal 65535) is not a legal character, even according to the link provided in the question that states "This is, however, legal UTF-16". That link shows it to be illegal, non-character, and cannot be represented in any way (according to their test pages).

Also, according to Unicode.org:

Noncharacters
These codes are intended for process-internal uses.

FFFE ￾ <not a character>
• may be used to detect byte order by contrast with FEFF 
→ FEFF = zero width no-break space

FFFF ￿ <not a character>

And, according to W3C, the list of valid characters is:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */


In order to get this into XML (at least in terms of the SQL Server XML datatype), you need to first replace any occurrences of &#xFFFE; and &#xFFFF; with a custom escape sequence, such as \uFFFE; and \uFFFF; resectively. Then, when converting back to NVARCHAR you can simply replace \uFFFE; with NCHAR(65534) and \uFFFF; with NCHAR(65535) respectively.

OR, you can Base64 Encode the values (rather easy on the app code side) and decode on the way out. If you need to access this on the database side, you can create your own SQLCLR functions to Base64 Encode and Decode, or just grab the Free version of the SQL# library (which I am the author of) which includes Convert_ToBase64 and Convert_FromBase64 and can be used as follows:

DECLARE @Encoded NVARCHAR(200),
        @Decoded NVARCHAR(200);

SET @Encoded =
    SQL#.Convert_ToBase64(CONVERT(VARBINARY(200), N'f' + NCHAR(65535) + N'g'), 'None');

SELECT CONVERT(XML, N'<test>' + @Encoded + N'</test>');

SET @Decoded = SQL#.Convert_FromBase64(@Encoded);
SELECT @Encoded AS [Encoded],
       @Decoded AS [Decoded],
       DATALENGTH(@Decoded) AS [NumBytes], -- 6 bytes = 3 characters (most of the time)
       UNICODE(SUBSTRING(@Decoded, 2, 1)) AS [TaDa!] -- get value of middle character

Returns:

<test>ZgD//2cA</test>

and then:

Encoded     Decoded     NumBytes    TaDa!
ZgD//2cA    fg          6           65535

It seems that it should not be used as part of a filename (I know that is not your doing), or perhaps it is really a valid character that is being read incorrectly by Get-ChildItem.

I will need to retain a handle to this file.

Is it possible to rename the file to remove the invalid characters prior to importing the names/info into SQL Server? Just a thought.


And just FYI, you are not allowed to change the encoding via the xml declaration, at least not so easily: SQL Server 2008R2 and creating XML document

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How do I handle utf-8 vs. punycode issues in Django's csrf middleware?

From Java

How do I properly use std::string on UTF-8 in C++?

From Dev

How do I avoid double UTF-8 encoding in XML::LibXML

From Dev

How do I properly convert a UTF-8 encoded char array to a Go string when using a C-library in Go?

From Dev

How can I fix inconsistent display of UTF-8 data in Microsoft SQL Server Management Studio v18 and properly handle this data?

From Dev

How do I handle Visual Studio shutdown properly?

From Dev

How to set UTF-8 properly in Hibernate?

From Dev

How do I commit with a utf-8 message file?

From Dev

How do I unescape multiple byte character utf8

From Dev

How do I convert UTF-8 special characters in Bash?

From Dev

How do I get MinTTY working with UTF8

From Dev

How do I handle optional XML attributes in VBA?

From Dev

How do I properly read a Node Value in an XML file?

From Dev

How do i parse a xml comment properly in python

From Dev

How do I properly read a Node Value in an XML file?

From Dev

How do I properly register a protocol handler on Windows 8?

From Dev

How do I write UTF-8 data to a UTF-16LE file using PHP?

From Dev

How do I properly handle resources when emitting from a Roslyn workspace?

From Dev

How do I handle file saves properly in NetBeans platform project (plugin)

From Dev

How do I properly handle multiple VirtualHosts on one server with only one having SSL?

From Dev

How do I properly handle garbage collection and avoid memory leaks with a collection of custom object pointers...?

From Dev

How convert xml string UTF8 to UTF16?

From Dev

How to return xml as UTF-8 instead of UTF-16

From Dev

How convert xml string UTF8 to UTF16?

From Dev

How do I properly dispose and free the memory used for V8.Net.V8Engine instances?

From Dev

How to handle utf-8 text with Python 3?

From Dev

How to properly set utf8 encoding with jdbc and MySQL?

From Dev

How to properly set the charset to UTF-8 in PHP 5.6?

From Dev

How do I handle exceptions?

Related Related

  1. 1

    How do I handle utf-8 vs. punycode issues in Django's csrf middleware?

  2. 2

    How do I properly use std::string on UTF-8 in C++?

  3. 3

    How do I avoid double UTF-8 encoding in XML::LibXML

  4. 4

    How do I properly convert a UTF-8 encoded char array to a Go string when using a C-library in Go?

  5. 5

    How can I fix inconsistent display of UTF-8 data in Microsoft SQL Server Management Studio v18 and properly handle this data?

  6. 6

    How do I handle Visual Studio shutdown properly?

  7. 7

    How to set UTF-8 properly in Hibernate?

  8. 8

    How do I commit with a utf-8 message file?

  9. 9

    How do I unescape multiple byte character utf8

  10. 10

    How do I convert UTF-8 special characters in Bash?

  11. 11

    How do I get MinTTY working with UTF8

  12. 12

    How do I handle optional XML attributes in VBA?

  13. 13

    How do I properly read a Node Value in an XML file?

  14. 14

    How do i parse a xml comment properly in python

  15. 15

    How do I properly read a Node Value in an XML file?

  16. 16

    How do I properly register a protocol handler on Windows 8?

  17. 17

    How do I write UTF-8 data to a UTF-16LE file using PHP?

  18. 18

    How do I properly handle resources when emitting from a Roslyn workspace?

  19. 19

    How do I handle file saves properly in NetBeans platform project (plugin)

  20. 20

    How do I properly handle multiple VirtualHosts on one server with only one having SSL?

  21. 21

    How do I properly handle garbage collection and avoid memory leaks with a collection of custom object pointers...?

  22. 22

    How convert xml string UTF8 to UTF16?

  23. 23

    How to return xml as UTF-8 instead of UTF-16

  24. 24

    How convert xml string UTF8 to UTF16?

  25. 25

    How do I properly dispose and free the memory used for V8.Net.V8Engine instances?

  26. 26

    How to handle utf-8 text with Python 3?

  27. 27

    How to properly set utf8 encoding with jdbc and MySQL?

  28. 28

    How to properly set the charset to UTF-8 in PHP 5.6?

  29. 29

    How do I handle exceptions?

HotTag

Archive