Skip Menu | You are currently an anonymous guest. | Login | Return to Main | About rt.cpan.org
 

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.

X Report information
Id: 15068
Status: resolved
Left: 0 min
Priority: 0/0
Queue: HTML-Parser

Owner: Nobody
Requestors: martin [...] snowplow.org
Cc:
AdminCc:

Severity: Important
Broken in: 3.45
Fixed in: (no value)



X History Display mode: Brief headersFull headers
#   Fri Oct 14 23:04:26 2005 guest - Ticket created  
Subject: HTML::Parser can't handle certain large characters
[text/plain 1.2k]
HTML::Parser apparently has trouble with some strings with the utf-8 flag set on them if the utf-8 expansion contains the character 0xA0. I believe that this is caused by the fact that 0xA0 is marked as a space in hctype.h, and that at several points in the code space characters are stepped over. Unfortunately, when processing utf-8 code, this leads to a partial utf-8 character being passed along to other methods. This problem can be fixed by modifying hctype.h so that character 160 is not a space, but I'm uncertain of the other consequences of that change.

The following code demonstrates the problem - note that the only character it has a problem with is \x0420, which includes an 0xA0 in its utf-8 expansion.

#!perl

use HTML::Parser;
use strict;

my $prsr = HTML::Parser->new;

my $htmltxt = <<EOF;
<html lang="en">
<head>
<title>Minimal HTML Document</title>
</head>
<body>
<p>This is a Russian letter: \x{041E}</p>
<p>This is another Russian letter: \x{041F}</p>
<p>And another: \x{0420}</p>
<p>And another: \x{0421}</p>
<p>And another: \x{0422}</p>
</body>
</html>
EOF

for my $c (split(//,$htmltxt)) {
local $SIG{__WARN__} = sub {
printf STDERR 'Character %04x%s',ord($c),":\n";
print STDERR @_;
};
$prsr->parse($c);
}
$prsr->eof;

#   Mon Oct 24 06:11:10 2005 GAAS - Correspondence added  
[text/plain 71b]
This problem is now fixed in CVS. \xA0 is no longer considered space.
#   Mon Oct 24 08:34:28 2005 GAAS - Status changed from 'new' to 'resolved'