Skip Menu |
 

This queue is for tickets about the HTML-Strip CPAN distribution.

Report information
The Basics
Id: 42834
Status: open
Priority: 0/
Queue: HTML-Strip

People
Owner: Nobody in particular
Requestors: eugenek [...] 45-98.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 1.06
Fixed in: (no value)



Subject: HTML::Strip breaks UTF-8
Download (untitled) / with headers
text/plain 156b
Breaks UTF-8. See attached file. In 1.html - correct utf-8 in russian. Just run 1.pl - it outputs broken utf8. perl v5.8.5 on RHEL 4, v5.8.8 on Etch - same
Subject: html_strip_bug_utf8.zip
Download html_strip_bug_utf8.zip
application/zip 1.8k

Message body not shown because it is not plain text.

From: eugenek [...] 45-98.org
Fixed test case. Looks like "—" thing is the reason of this bug!
Download more_test.zip
application/zip 1.2k

Message body not shown because it is not plain text.

From: pat [...] aers.ca
Download (untitled) / with headers
text/plain 475b
On Tue Jan 27 12:13:58 2009, gnudist wrote: Show quoted text
> Fixed test case. Looks like "—" thing is the reason of this bug!
I am still seeing "broken" UTF-8. Or, more specifically Double Encoded UTF-8. In the attached example, there are two UTF-8 3 byte characters, and they both turn into 6 byte characters on return. Original: E2 80 99 (RIGHT SINGLE QUOTATION MARK) Returns as: C3 A2 C2 80 C2 99 Original: E2 80 9D (RIGHT SINGLE QUOTATION MARK) Returns as: C3 A2 C2 80 C2 9D
Download broken.tar
application/octet-stream 10k

Message body not shown because it is not plain text.

Will there be a fix someday for this?
Download (untitled) / with headers
text/plain 662b
Workaround: Show quoted text
----- BEGIN CODE ----- use strict; use warnings; use open ':std', ':locale'; use LWP::UserAgent qw( ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; my $ua = LWP::UserAgent->new(); my $response = $ua->get($url); die $response->status_line() if !$response->is_success(); my $decoded_html = $response->decoded_content(); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); print $decoded_text;
----- END CODE -----
On Thu 16. juli 2009 12:20:41, plyn wrote: Show quoted text
> On Tue Jan 27 12:13:58 2009, gnudist wrote:
> > Fixed test case. Looks like "—" thing is the reason of this
bug! Show quoted text
> > I am still seeing "broken" UTF-8. Or, more specifically Double Encoded > UTF-8. > > In the attached example, there are two UTF-8 3 byte characters, and
they Show quoted text
> both turn into 6 byte characters on return. > > Original: E2 80 99 (RIGHT SINGLE QUOTATION MARK) > Returns as: C3 A2 C2 80 C2 99 > > Original: E2 80 9D (RIGHT SINGLE QUOTATION MARK) > Returns as: C3 A2 C2 80 C2 9D >
Easily confirmed: $ perl -wle 'use utf8; use HTML::Strip; my $str = "←↓→"; print "utf8_flag: " . utf8::is_utf8($str); my $str2 = HTML::Strip->new()- Show quoted text
>parse($str); print "utf8_flag: " . utf8::is_utf8($str2);'
utf8_flag: 1 utf8_flag: Work around for real code: use Encode; use utf8; use HTML::Strip; my $str = "←↓→"; my $utf8_was_on = Encode::is_utf8($str); my $str2 = HTML::Strip->new()->parse($str); $utf8_was_on && ($HTML::Strip::VERSION <= 1.06) && Encode::_utf8_on ($str2);
From: ashley [...] netspot.com.au
Download (untitled) / with headers
text/plain 374b
None of the workarounds work in my case. See my attached test script. If you comment out the "use encoding 'utf8'" line, the encode_utf8() will get the correct string (s²). However with the "use encoding 'utf8'" line there, I can't get the correct string! Even trying all of the above workarounds. Even using HTML::Entities to decode the entities has the same problem!
Subject: testhtmlstrip.pl
Download testhtmlstrip.pl
text/x-perl 284b
use encoding 'utf8'; use Encode; use HTML::Strip; my $htmlstrip = HTML::Strip->new(); my $match = {}; $text = 's&sup2;'; $text = $htmlstrip->parse($text); print "not encoded: " . $text . "\n"; print "encoded: " . encode_utf8($text) . "\n"; print "STRIPPED TEXT: " . $text . "\n";
Subject: possible workaround - HTML::Strip breaks UTF-8
Download (untitled) / with headers
text/plain 158b
I discussed this in detail with Zefram and ilmari. Here's a possible workaround, which seems to work at least in my case: https://gist.github.com/910818
RT-Send-CC: ashley [...] netspot.com.au, pat [...] aers.ca
Download (untitled) / with headers
text/plain 262b
On Fri Apr 08 18:03:42 2011, OSFAMERON wrote: Show quoted text
> I discussed this in detail with Zefram and ilmari. Here's a possible > workaround, which seems to > work at least in my case: > > https://gist.github.com/910818
and here's a github repo with that workaround


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.