Skip Menu |
 

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 19478
Status: rejected
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: Ben.Evans [...] morganstanley.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 3.51
Fixed in: (no value)

Attachments


Subject: HTML-Parser does not recognise <meta http-equiv=""> for charsets
Download (untitled) / with headers
text/plain 898b
HTML::Parser does not seem to be compatible with non-Western encodings when the encoding is specified via a <meta> tag. A good way to manifest this is via HTML in the ISO-2022-JP charset - see attached sample HTML. The issue here is that ISO-2022-JP when in one of its Japanese modes may contain a byte of value 60 (ASCII '<') as part of a 2-byte character. If the parser is not charset-aware, this will cause Japanese text to be silently munched into a broken tagname (probably until the next instance of byte value 62, '>' or EOL where a sane HTML parser would probably decide it has been fed a seriously mangled bit of tag soup and reset for the next line - which is the observed behaviour of HTML::Parser) Main use cases for this would be HTML parsing when it is not known ahead of time which charset the HTML is written in. Behaviour demonstrated on perl 5.8.4 on Red Hat Linux AS 3.0
Subject: hy-decode2.html
Download hy-decode2.html
text/html 213b

test$B%F%9%H(B

$B5!

$BH>3Q%+%?%+%J(B

 

Download (untitled) / with headers
text/plain 117b
The text to be parsed need to be decoded before it's passed to HTML::Parser. Use the Encode module to achieve that.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.