On Jsoup identity transform

fodon Published at Dev

fodon

I want to see if I can get back the original string with Jsoup after a transform.

Document doc = Jsoup.parse("<html><body><span>&rarr;</span></body></html>");
String str = doc.toString();
System.out.println(str);

I'd like the output to be equivalent HTML (formatting aside). Here the "rarr" string is mutilated. So, what function do I have to use?

Davide Pastore

Pete Houston took me on the track in the issue 660.

You can do it using:

doc
  .outputSettings()
  .charset("ascii")
  .escapeMode(Entities.EscapeMode.extended);
String str = doc.toString();

Output would be:

<html>
 <head></head>
 <body>
  <span>&srarr;</span>
 </body>
</html>

BUT output is slightly different (&srarr; instead of →) from the input because:

according to the HTML5 named character references, http://www.w3.org/TR/2011/WD-html5-20110113/named-character-references.html

&srarr; is the same as → also, same as &RightArrow;, &ShortRightArrow; ...

as you know, jsoup do validation (escape/unescape) while processing input and mapping the HTML entities defined in entities-*.properties; well, since there are several entities' names represent in same Unicode value \u02192 in your case, the mapping is done with the first match if I'm not mistaken.

Collected from the Internet

Please contact [email protected] to delete if infringement.