Skip Menu |
 

This queue is for tickets about the HTML-Format CPAN distribution.

Report information
The Basics
Id: 9700
Status: open
Priority: 0/
Queue: HTML-Format

People
Owner: nigel.metheringham [...] gmail.com
Requestors: lulu [...] lululand.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 2.04
Fixed in: (no value)



MIME-Version: 1.0
X-Mailer: MIME-tools 5.415 (Entity 5.415)
Subject: FormatText.pm corrupts multi-byte Unicode characters
Content-Type: multipart/mixed; boundary="----------=_1105674078-13286-0"
Content-Length: 0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 220
Download (untitled) / with headers
text/plain 220b
An HTML file containing multi-byte Unicode text will have some of the text corrupted. I have attached a sample HTML file that demonstrates the problem. I am running Perl 5.8.5, Linux FC3, i686, using HTML-Format 2.0.4.
Content-Type: text/html; name="file1.html"
Content-Disposition: inline; filename="file1.html"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 5548
Download file1.html
text/html 5.4k

Here's another Unicode test.
Spanish:  ¿Dónde Está la Unicode?
 
French:  Il a été affligé par une maladie grave à 13 ans.
 
German:  Bleigießen, Wörterbuch über
 
Norwegian:  FrÃ¥n och med 1/1 2005 är det fri entré till museets utställningar.
 
Swedish:  atomvÃ¥pen vært større
 
 
Chinese:  十峰中文学校
 
Vietnamese:  giá sản phẩm TV kỹ thuật số 
 
 Arabic: بيانات صحفية حكومي
 
 
 
 
MIME-Version: 1.0
X-Mailer: MIME-tools 5.415 (Entity 5.415)
From: lulu [...] lululand.com
Content-Type: multipart/mixed; boundary="----------=_1105674335-13362-0"
Content-Length: 0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 654
Download (untitled) / with headers
text/plain 654b
The problem, according to comments in the sub function of Formatter.pm, is from a tr that attempts to handle soft hyphens. Commenting out that line fixes the problem. I think it is probably best to not corrupt multi-byte characters than to translate hyphens to spaces. I have attached a patch. This patch is applied on top of a patch I had previously submitted for bug #9602. [guest - Thu Jan 13 22:41:18 2005]: Show quoted text
> An HTML file containing multi-byte Unicode text will have some of the > text corrupted. > > I have attached a sample HTML file that demonstrates the problem. > > I am running Perl 5.8.5, Linux FC3, i686, using HTML-Format 2.0.4.
Content-Type: text/x-patch; name="9700.patch"
Content-Disposition: inline; filename="9700.patch"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 470
Download 9700.patch
text/x-diff 470b
--- FormatText.pm 2005-01-13 19:33:09.000000000 -0800 +++ FormatText.pm.sav 2005-01-13 19:35:21.000000000 -0800 @@ -188,10 +188,7 @@ my $self = shift; my $text = shift; - # uncomment the following if you want soft-hyphen translation. - # (according to Formatter.pm) - # however, it will corrupt multi-byte unicode characters. -# $text =~ tr/\xA0\xAD/ /d; + $text =~ tr/\xA0\xAD/ /d; if (defined $self->{vspace}) { if ($self->{out}) {
MIME-Version: 1.0
X-Mailer: MIME-tools 5.415 (Entity 5.415)
From: lulu [...] lululand.com
Content-Type: multipart/mixed; boundary="----------=_1105771503-20899-0"
Content-Length: 0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 102
Download (untitled) / with headers
text/plain 102b
I created the previous patch incorrectly. Attached is the corrected version. My sincere apologies.
Content-Type: text/x-patch; name="9700.patch"
Content-Disposition: inline; filename="9700.patch"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 470
Download 9700.patch
text/x-diff 470b
--- FormatText.pm.sav 2005-01-13 19:35:21.000000000 -0800 +++ FormatText.pm 2005-01-13 19:33:09.000000000 -0800 @@ -188,7 +188,10 @@ my $self = shift; my $text = shift; - $text =~ tr/\xA0\xAD/ /d; + # uncomment the following if you want soft-hyphen translation. + # (according to Formatter.pm) + # however, it will corrupt multi-byte unicode characters. +# $text =~ tr/\xA0\xAD/ /d; if (defined $self->{vspace}) { if ($self->{out}) {
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Charset: utf8
Message-Id: <rt-3.6.HEAD-8673-1211529090-1129.9700-0-0 [...] rt.cpan.org>
Content-Type: text/plain
Content-Transfer-Encoding: binary
From: martin.ferrari [...] gmail.com
X-RT-Original-Encoding: utf-8
Content-Length: 591
Download (untitled) / with headers
text/plain 591b
On Sat Jan 15 01:45:03 2005, guest wrote: Show quoted text
> > I created the previous patch incorrectly. Attached is the corrected > version. My sincere apologies.
From what I understand, this is a bug in HTML::TreeBuilder, which doesn't set the utf8 flag when reading utf8 content. See this example: $ perl -Iblib/lib -e ' use encoding "utf-8", STDOUT => "utf-8"; use utf8; use HTML::Element; use HTML::FormatText; $e = new HTML::Element("p"); $e->push_content("fóo"); print utf8::is_utf8($e->as_XML) ? "is" : "is not"," UTF-8\n"; print HTML::FormatText->format_string($e->as_XML);' is UTF-8 fóo
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-19313-1297963349-389.9700-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 266
Download (untitled) / with headers
text/plain 266b
Is this still an issue with current perls and/or current HTML::TreeBuilder? [a failing test would be really useful here] If I don't hear anything back on this I'll close it down - I've just taken on maintenance of this module and am trying to clear the RT queue.
MIME-Version: 1.0
X-Spam-Status: No, score=-2.698 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=ham
X-Spam-Flag: NO
X-RT-Interface: API
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.182.129.226 with SMTP id nz2mr1377527obb.5.1415876125044; Thu, 13 Nov 2014 02:55:25 -0800 (PST)
Message-ID: <CAOrrMb7y37MuBThGS3AAhvD=A-nEv8VPgDNU792NiO1CGGbfPw [...] mail.gmail.com>
Content-Type: multipart/alternative; boundary="089e0149c4fc7ab6680507bb5934"
X-Spam-Score: -2.698
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 8C89E24049C for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Thu, 13 Nov 2014 05:55:33 -0500 (EST)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sgT3VrXii1II for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Thu, 13 Nov 2014 05:55:32 -0500 (EST)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id DEEB9240208 for <bug-HTML-Format [...] rt.cpan.org>; Thu, 13 Nov 2014 05:55:31 -0500 (EST)
Received: (qmail 24943 invoked by alias); 13 Nov 2014 10:55:31 -0000
Received: from mail-oi0-f53.google.com (HELO mail-oi0-f53.google.com) (209.85.218.53) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Thu, 13 Nov 2014 02:55:29 -0800
Received: by mail-oi0-f53.google.com with SMTP id i138so2577855oig.26 for <bug-HTML-Format [...] rt.cpan.org>; Thu, 13 Nov 2014 02:55:25 -0800 (PST)
Received: by 10.76.155.229 with HTTP; Thu, 13 Nov 2014 02:55:04 -0800 (PST)
Delivered-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
Subject: [rt.cpan.org #9700] Problem still exists in 2.11
Return-Path: <pongtawat.c [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=fIkDh79ORhHCALP1bItwMGA9oYj5z5lZTHkgeOykItc=; b=ImQhqeLPae7PbEqa+NNR9qUOmDCPHaJvsOiL2/kt61eJb9dsyHKhFk0byrma74NQHh wCG6t9/CBZpRVaUxNo7iq2V/sGYcVHUaYpfnw7uD9M6ceRlH1D3Etf2LvdSzafkAn835 U/P7Ow/dgb+u62lWHA1hodzmWdCljYIwdPB5y7WOG0Y+XswjxgVh+qwnV1MI1KVvLMko ARzQUghH2qRK5PjlJSLQHVWHOy5OzOFpKye0Q+iE2F5xrCHGJ4uSgaL3/0aZN1Hhne0d kuwCu2P/PQ4bxabmingCyElCujIfgWw1B73JFBLfO3KqUVlkQb4EBDmavwYV45gFAYih yvHg==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
X-RT-Mail-Extension: html-format
Date: Thu, 13 Nov 2014 17:55:04 +0700
X-Spam-Level:
To: bug-HTML-Format [...] rt.cpan.org
From: Pongtawat Chippimolchai <pongtawat.c [...] gmail.com>
RT-Message-ID: <rt-4.0.18-12444-1415876134-1707.9700-0-0 [...] rt.cpan.org>
Content-Length: 0
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
Content-Length: 233
Download (untitled) / with headers
text/plain 233b
I just ran into the problem describe by this bug in HTML-Format 2.11. FormatText still corrupts Thai UTF-8 contents as the tr line is still there. It could be easily solved by comment out that tr line. HTML-Format 2.11, Perl 5.14.2
content-type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 343


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.