|Subject:||charset decoding is broken|
On 0.15, when reading from a stream, XML::SAX::PurePerl does not decode the first 4096 bytes into the Perl internal UTF-8 representation, although it sets the filehandle PerlIO encoding to UTF-8. This is a regression from 0.12. The problem is that the Unicode version of XML::SAX::PurePerl::Reader::switch_encoding_string() uses Encode::from_to(), which does not set the Perl internal UTF-8 flag. Replacing this with eg. Encode::decode() fixes the bug. This affects those bytes that are first read into the buffer before setting the PerlIO encoding. With the fix to SAX/PurePerl/Reader/UnicodeExt.pm, there's a test failure from t/14encoding.t. It turns out that there are bugs in XML::SAX::PurePerl::Productions : the $NameChar regexp shouldn't use $Letter, since that contains beginning and end anchors (^ and $). In fact, it looks like the $Letter production is unused now and $NameChar shouldn't have any anchors either. (It also looks like the binding of the anchors is broken, since /^a|b$/ means (/^a/ || /b$/), not /^(a|b)$/.) I'm attaching a proposed patch that adds a testcase for these issues and fixes them for me. The tests pass for me on 0.12 and fail on 0.15. I haven't tested on an old non-Unicode Perl; this is on Perl 5.8.8 on Debian Etch (4.0). I'm a bit uneasy that switch_encoding_string() can't be called twice now without a fatal error, but I'm not sure what is the best thing to do. Maybe just make it a no-op if the new charset is UTF-8 and Encode::is_utf8 is set? I suppose it has never worked if the charset is not UTF-8 on the second call.... FWIW, this issue has caused Debian bug #405186,. Please let me know if you need more information; I'll be happy to help in any way I can. Cheers, -- Niko Tyni email@example.com
Message body not shown because it is not plain text.