This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id:
15068
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
martin [...] snowplow.org
Cc:
AdminCc:

BugTracker
Severity:
Important
Broken in:
3.45
Fixed in:
(no value)



Subject: HTML::Parser can't handle certain large characters
HTML::Parser apparently has trouble with some strings with the utf-8 flag set on them if the utf-8 expansion contains the character 0xA0. I believe that this is caused by the fact that 0xA0 is marked as a space in hctype.h, and that at several points in the code space characters are stepped over. Unfortunately, when processing utf-8 code, this leads to a partial utf-8 character being passed along to other methods. This problem can be fixed by modifying hctype.h so that character 160 is not a space, but I'm uncertain of the other consequences of that change. The following code demonstrates the problem - note that the only character it has a problem with is \x0420, which includes an 0xA0 in its utf-8 expansion. #!perl use HTML::Parser; use strict; my $prsr = HTML::Parser->new; my $htmltxt = <<EOF; <html lang="en"> <head> <title>Minimal HTML Document</title> </head> <body> <p>This is a Russian letter: \x{041E}</p> <p>This is another Russian letter: \x{041F}</p> <p>And another: \x{0420}</p> <p>And another: \x{0421}</p> <p>And another: \x{0422}</p> </body> </html> EOF for my $c (split(//,$htmltxt)) { local $SIG{__WARN__} = sub { printf STDERR 'Character %04x%s',ord($c),":\n"; print STDERR @_; }; $prsr->parse($c); } $prsr->eof;
This problem is now fixed in CVS. \xA0 is no longer considered space.


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.