Skip Menu |
 

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 7645
Status: resolved
Worked: 5 hours (300 min)
Priority: 0/
Queue: XML-LibXML

People
Owner: phish [...] cpan.org
Requestors: torsten.hilbrich [...] gmx.net
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.58
Fixed in: (no value)



Subject: Default document encoding is not utf-8
Download (untitled) / with headers
text/plain 2.1k
I found a problem in the XML::LibXML modul concerning the encoding of the resulting XML document (as output by toString). The documentation for createDocument says, that the default encoding is an implicitly defined utf-8 (as the XML-1.0 standards defines). The following code first creates a string containing the single character ä (U00E4). The first document is output using the default encoding (which according to the documentation should be implicitly utf-8), the second part sets the document encoding explicit to utf-8 before outputting it. In both cases the document is sent to stdout. The binmode statement makes sure that perl is capable of utf-8 output on stdout. Here is the output I get from this code: <?xml version="1.0"?> <test contents="&#xE4;"/> <?xml version="1.0" encoding="utf-8"?> <test contents="ä"/> As you can see the 'ä' in the first output is iso-8859-1 encoded instead of the expected utf-8. The second output is correct. Here is my example code to reproduce the bug: ############################################################ binmode(STDOUT, ':utf8'); # the small letter a with diaresis (ä) as an example my $in = pack('U', 0x00e4); use XML::LibXML; my $doc = XML::LibXML::Document->new(); my $node = XML::LibXML::Element->new('test'); $node->setAttribute(contents => $in); $doc->setDocumentElement($node); # First output print $doc->toString(1); # Second output $doc->setEncoding('utf-8'); print $doc->toString(1); ############################################################ Versions of the libraries: libc6 2.3.2ds1 libxml2 2.6.11 XML::LibXML 1.58 Here are information about perl and its system environent: $ perl -v This is perl, v5.8.4 built for i386-linux-thread-multi ... $ uname -a Linux myrkr 2.6.7 #1 Sat Sep 4 20:20:27 CEST 2004 i686 GNU/Linux $ locale LANG=de_DE.UTF-8 LC_CTYPE=de_DE.UTF-8 LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_MESSAGES=POSIX LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL= If you need more information about my system please tell me. Torsten
Download bug
application/octet-stream 383b

Message body not shown because it is not plain text.

From: Torsten.Hilbrich [...] gmx.net
Download (untitled) / with headers
text/plain 381b
It seems the HTML generated do not quote the special characters: Show quoted text
> <?xml version="1.0"?> > <test contents="&#xE4;"/> > <?xml version="1.0" encoding="utf-8"?> > <test contents="ä"/>
The output should be read as (quoting the ampersand character): <?xml version="1.0"?> <test contents="&amp;#xE4;"/> <?xml version="1.0" encoding="utf-8"?> <test contents="ä"/> Torsten
From: reporter
Download (untitled) / with headers
text/plain 522b
Show quoted text
> As you can see the 'ä' in the first output is iso-8859-1 encoded > instead of the expected utf-8. The second output is correct.
I have additional information. It seems the character entity output of the first line is correct XML syntax and also correctly transformed to the ä character on parsing. So the only remaining issue is that the output is not utf-8 but rather ASCII with using character entities for all non-ASCII characters. This should possibly be documented but cannot be considered a real bug.
Download (untitled) / with headers
text/plain 181b
The problem is not related to XML::LibXML but to libxml2. this problem is fixed with libxml2 2.6.15, maybe earlier, but I have not tested it against other versions, yet. Christian


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.