This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id:
25261
Status:
rejected
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
ddascalescu [...] gmail.com
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: Parsing UTF-8 XMLs reduces Latin chars to bytes if keep_encoding => 0
Date: Fri, 2 Mar 2007 22:32:46 -0800
To: bug-XML-Twig@rt.cpan.org, "Michel Rodriguez" <mirod@xmltwig.com>
From: "Dan Dascalescu" <ddascalescu@gmail.com>
After parsing UTF8 XMLs, the ->text method of XML::Twig::Elt seems to encode Latin characters in the iso-8859-1 encoding. $twig->print dumps the correct byte sequence. The test I included in the attachment uses the following characters: ß, á, à. Hope that helps, Dan Dascalescu #! perl -w use strict; use XML::Twig; sub hex_dump($) { my $input = shift; my $result = "Input: <\n$input\n>\nHex dump:\n"; while ($input =~ /./gs) { $result .= "<$&>" . sprintf "%02X ", ord($&); } return $result; } my $filename = shift; open my $file_out, '>:raw', "$filename.out.xml"; # parse the UTF-8-encoded XML my $twig= XML::Twig->new( keep_encoding => 0 # the default; '1' fixes the issue ); $twig->parsefile($filename); # dump element text print $file_out "Element text dump:\n"; foreach my $elt ($twig->get_xpath('//seg')) { print $file_out ($elt->text), "\n"; } # dump twig print $file_out "\n\nTwig print:\n"; $twig->print($file_out); # read the XML file with the UTF-8 discipline, and pass it through to the ':raw' output file open my $file_in, '<:utf8', $filename or die $!; undef $/; print $file_out "\n\nPass-through:\n", <$file_in>; __END__ XML file: <?xml version='1.0' encoding='UTF-8' ?> <tmx> <seg>Latin chars: Schließen, á, à</seg> <seg>Thai char: ว</seg> <seg>Russian stuff: Обновить</seg> </tmx>

Message body not shown because it is not plain text.

On Sat Mar 03 01:33:12 2007, ddascalescu@gmail.com wrote:
Show quoted text
> After parsing UTF8 XMLs, the ->text method of XML::Twig::Elt seems to > encode Latin characters in the iso-8859-1 encoding. $twig->print dumps > the correct byte sequence. The test I included in the attachment uses > the following characters: ß, á, à.
Hi Dan, Indeed, it looks like the utf8 flag is not set on the string created by the text method. I have no idea why. I have to write some tests, with and without the keep_encoding option, to figure out exactly in which case the flag needs to be set. __ mirod
Closing the report, a few years late. In order to print utf8 characters, you need to specify the encoding when you open the file. writing open my $file_out, '>:utf8', "$filename.out.xml"; instead of open my $file_out, '>:raw', "$filename.out.xml"; does the right thing. I believe this is normal Perl behaviour __ mirod


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.