Skip Menu |
 

This queue is for tickets about the libwww-perl CPAN distribution.

Report information
The Basics
Id: 20274
Status: resolved
Priority: 0/
Queue: libwww-perl

People
Owner: Nobody in particular
Requestors: imacat [...] mail.imacat.idv.tw
Cc: scop [...] cpan.org
AdminCc:

Bug Information
Severity: Normal
Broken in: 5.805
Fixed in: (no value)

Attachments


Subject: HTML::HeadParser Complaints for Parsing Undecoded UTF-8
Download (untitled) / with headers
text/plain 996b
Hi. This is imacat from Taiwan. I got warnings when using LWP::UserAgent on web sites with UTF-8 pages. I have tried to dig into the code. It seems that HTML::HeadParser is not satisfied with undecoded UTF-8 data. I do not know why HTML::HeadParser is not satisfied. I attempted to make a patch to solve this, and the warnings are gone. But I do not know if this patch (parsing raw undecoded UTF-8) is a good idea. Maybe you can look into this issue. I have attached my patch. The error log is below. Please tell me if there is any problem. Thank you. imacat@rinse /tmp % cat /tmp/test.pl #! /usr/bin/perl -w use LWP::UserAgent; use vars qw($UA $url $r); $UA = new LWP::UserAgent; $url = "http://zh.wikipedia.org/"; $r = $UA->get($url); print "$url " . $r->status_line . "\n"; imacat@rinse /tmp % /tmp/test.pl Parsing of undecoded UTF-8 will give garbage when decoding entities at /home/imacat/lib/perl5/LWP/Protocol.pm line 115. http://zh.wikipedia.org/ 200 OK imacat@rinse /tmp %
Sorry I forgot to attach my patch. Here it is. Sorry for the disturbance.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 diff -u -r libwww-perl-5.805.orig/lib/LWP/Protocol.pm libwww-perl-5.805/lib/LWP/Protocol.pm - --- libwww-perl-5.805.orig/lib/LWP/Protocol.pm 2004-11-12 21:34:10.000000000 +0800 +++ libwww-perl-5.805/lib/LWP/Protocol.pm 2006-07-05 00:45:05.000000000 +0800 @@ -104,6 +104,7 @@ if ($parse_head && $response->content_type eq 'text/html') { require HTML::HeadParser; $parser = HTML::HeadParser->new($response->{'_headers'}); + $parser->utf8_mode(1); } my $content_size = 0; -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) iD8DBQFEqptii9gubzC5S1wRAvHqAJ4zsxBTvoVFV+9MVX9cDK1rz0SgRgCeLbOM LWBd8tpfdtF/yELWyPAsTQo= =sfar -----END PGP SIGNATURE-----
Download (untitled) / with headers
text/plain 168b
Well, I have reviewed the HTML::Parser POD and its code again. I believe my patch on $parser->utf8_mode(1) is the correct answer. Could you please fix it? Thank you.
From: imacat [...] mail.imacat.idv.tw
Download (untitled) / with headers
text/plain 323b
Hi. This is imacat from Taiwan. Here is a revised patch that work with older Perl < 5.8 that does not have UTF-8 mode, or older HTML::Parser < 3.40 that does not have utf8_mode. The previous patch does not work with older Perl < 5.8 or HTML::Parser < 3.40. Please use this patch instead of the previous one. Thank you.
Download libwww-perl-5.805-u8parse-2.diff.asc
application/octet-stream 775b

Message body not shown because it is not plain text.

Applied. In 5.806.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.