Skip Menu | You are currently an anonymous guest. | Login | Return to Main | About rt.cpan.org
 

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.

X Report information
Id: 7014
Status: resolved
Left: 0 min
Priority: 0/0
Queue: HTML-Parser

Owner: Nobody
Requestors: jgmyers [...] proofpoint.com
Cc:
AdminCc:

Severity: Normal
Broken in: 3.36
Fixed in: (no value)




X History Display mode: Brief headersFull headers
#   Mon Jul 19 19:30:13 2004 guest - Ticket created  
Subject: multiple bugs handling non-ASCII characters
[text/plain 413b]
HTML-Parser fails to handle non-ASCII characters in the HTML file being parsed. It fails to examine or copy the UTF8 flag, with the exception of decode_entities(). Following a unicode entity, decode_entities() in UNICODE_ENTITIES mode fails to convert ISO-8859-1 to UTF-8, leading to a result that is not utf8::valid(). hparser.c has hash lookup code that is not UTF8 safe.

The attached patch fixes all this.

[application/octet-stream 17.9k]
Message body not shown because it is too large or is not plain text.
#   Wed Jul 21 21:49:42 2004 jgmyers[...]proofpoint.com - Correspondence added  
Date: Wed, 21 Jul 2004 17:53:52 -0700
From: John Gardiner Myers <jgmyers[...]proofpoint.com>
To: bug-HTML-Parser[...]rt.cpan.org
Subject: Re: [cpan #7014] AutoReply: multiple bugs handling non-ASCII characters
[text/plain 702b]
With the previous patch applied, one can remove one of the documented bugs.

diff -ru HTML-Parser-3.36/Parser.pm HTML-Parser-3.36-work/Parser.pm
--- HTML-Parser-3.36/Parser.pm 2004-04-01 04:05:52.000000000 -0800
+++ HTML-Parser-3.36-work/Parser.pm 2004-07-21 15:32:57.000000000 -0700
@@ -996,10 +996,6 @@

=head1 BUGS

-Unicode strings are not parsed correctly. A workaround is to encode
-them as UTF-8 before passing them to the HTML::Parser. The C<Encode>
-module can do that.
-
The <style> and <script> sections do not end with the first "</", but
need the complete corresponding end tag. MSIE avoids terminating a
<script> section if the </script> occurs inside quotes. HTML::Parser

#   Fri Sep 03 10:14:24 2004 TOMI - Correspondence added  
From: Tom Insam
[text/plain 222b]
The original patch patched an auto-generated file, I've removed this from the patch, and
integrated the documentation page in the previous comment. This applies cleanly and passes
tests for me on Darwin (Mac OS X 10.3).

[application/octet-stream 18k]
Message body not shown because it is too large or is not plain text.
#   Fri Sep 03 10:15:25 2004 TOMI - Correspondence added  
From: Tom Insam
[text/plain 26b]
Also, I have a test case.

[application/x-troff 694b]
Message body not shown because it is too large or is not plain text.
#   Tue Nov 02 13:53:00 2004 guest - Correspondence added  
Subject: Revised fix
From: jgmyers[...]proofpoint.com
[text/plain 133b]
The previous patch had an uninitialized variable which would in some
situations cause the result to be gratuitously upgraded to utf8.
[application/octet-stream 17.9k]
Message body not shown because it is too large or is not plain text.
#   Wed Nov 17 09:49:18 2004 GAAS - Correspondence added  
[text/plain 98b]
I have now uploaded HTML-Parser-3.39_90 with the proposed
patch in it. Please give it a spin.

#   Wed Nov 17 14:05:35 2004 guest - Correspondence added  
From: jgmyers[...]proofpoint.com
[text/plain 29b]
Remove completed TODO item.


[application/octet-stream 720b]
Message body not shown because it is too large or is not plain text.
#   Mon Nov 29 08:51:54 2004 GAAS - Status changed from 'new' to 'resolved'