|Subject:||Improve $doc->toString() to encode document correctly|
The current implementation of toString() leaves the document in perl's internal unicode encoding in many cases. This is difficult to fix, because people likely do not know how to fix the problem, and resort to random hacks that seem to work, but could be wrong. For instance, the simple task of printing many documents to STDOUT tends to invoke "wide character in print" warnings, and the end result that goes to STDOUT might be corrupt XML. Use of "toFH(\*STDOUT)" somewhat works around this issue, but it's not convenient when you aren't actually writing to a file, but need the document as string, maybe to pass to other XML-expecting APIs, or something. Or maybe you are implementing XML-DSig and need to calculate SHA-1 hashes of documents. (I know that you most often do canonicalization on the XML-DSig, and this fixes the encoding to UTF-8, but this is not always true.) My argument is however, that this must work just like toFH(\*STDOUT) works: print STDOUT $xml->toString() The solution appears to be something along the lines that if the document that comes out has perl's unicode flag set, then you must Encode::encode() it to UTF-8. I am not 100 % sure of the correctness of the solution, but it appears to do the right thing. For instance, if ISO-8859-15 is used to describe a document with euro, then the result has UTF-8 flag off (looks like it is ISO-8859-1), and the character a4 is put where Euro symbol should be, making Perl replace it with ? as it attempts to convert iso-8859-1 \xa4 to iso-8859-15 equivalent, which does not exist: use Encode; use XML::LibXML; my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\" encoding=\"ISO-8859-15\"?><x>\xa4</x>"); print Encode::encode("ISO-8859-15", $x->toString); and the output is: <?xml version="1.0" encoding="ISO-8859-15"?> <x>?</x> Now, let's try the same with UTF-8: my $x = XML::LibXML->new->parse_string("<?xml version=\"1.0\" encoding=\"UTF-8\"?><x>\xe2\x82\xac</x>"); print Encode::encode("ISO-8859-15", $x->toString); Outputs: <?xml version="1.0" encoding="UTF-8"?> <x>€</x> Ugh! Perl sees the euro symbol as a single character, instead of the original sequence of 3 octets! What this means is that to correctly stringify this document UTF-8 encoding needs to be performed when the string has Encode::is_utf8() on. Or in other words, turning the Unicode flag off "fixes" it so that we work the same way regardless what the original encoding of the document is! Why this works is that Encoding::is_utf8 apparently stays off as long as the characters put into the document have char values less than 256. This means that regardless of document content, it all gets written the same way to regular, encoding-unaware filehandles. When the higher characters are present in the stream, then the flag somehow gets turned on, and chaos ensues because the fact that document contains these high characters now require a different treatment!!! Corrupt documents may result. Warnings about prints of wide characters occur. This is no good at all. There are more small issues: sometimes $doc->getEncoding is not defined, which basically means that XML version is 1.0 and encoding is UTF-8. (According to documentation and XML specification.) However, when outputting, UTF-8 is not assumed. Without encoding declaration: print XML::LibXML->new->parse_string("<x>\xe2\x82\xac</x>")->toString' <?xml version="1.0"?> <x>€</x> I do think that getEncoding() should probably still return UTF-8, because formally this is true for a prologless XML file, and also for XML file that misses encoding information. I realize the method was probably meant as an accessor, to find out what the value in "encoding" field is, but if we aren't concerned about text representation of XML, we don't really care what was in the original file, we care about what the file's content interpreted as XML _mean_. So I do not like the fact that XML::LibXML pretends that there is "no" encoding, because text strings always are in some encoding, and UTF-8 is assumed according to the XML spec. And this is clearly what it is doing. My take on this is that XML::LibXML should either put prolog there and declare encoding as UTF-8 honestly, or just return UTF-8 from getEncoding() and omit the prolog. Either way, it now says that there is no encoding (this is impossible, the fact it doesn't appear in prolog is irrelevant), and it changes the document by adding the prolog when input did not actually have a prolog! So this is quite possibly the worst possible way to treat it. I wonder if the document piece saying that "$doc->setEncoding() is unsafe" is true any more. Maybe it depends on libxml2 version? It would appear that XML::LibXML performs character reference substitutions as appropriate, and everything works just fine as I'm testing it. Let's just fix this mess so that toString() properly encodes the document to UTF-8 when the unicode bit is on and hands out octets. Let us remove the mention about setEncoding() being unsafe, because it seems perfectly safe. And getEncoding() should return UTF-8, never undef, and the missing prolog problem could be handled by always adding a prolog to output and explicitly choosing UTF-8 encoding for the document, which is what XML standard implies the document's content is.