Skip Menu |
 

This queue is for tickets about the HTML-Format CPAN distribution.

Report information
The Basics
Id: 69426
Status: open
Priority: 0/
Queue: HTML-Format

People
Owner: Nobody in particular
Requestors: jik [...] kamens.brookline.ma.us
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 2.05
Fixed in: (no value)



Subject: ’ in HTML input yields garbage character in PostScript output
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 471
Download (untitled) / with headers
text/plain 471b
Test script: Show quoted text
---cut here--- #!/usr/bin/perl use HTML::TreeBuilder; use HTML::FormatPS; $html = "<html><body>it&rsquo;s an apostrophe</body></html>"; $tree = HTML::TreeBuilder->new_from_content($html); $formatter = HTML::FormatPS->new(); $ps = $formatter->format($tree); binmode STDOUT; print $ps;
---cut here--- Redirect the output of the script to test.ps and then view test.ps and you'll see that there's a garbage character where the apostrophe is supposed to be.
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-12437-1310570598-1039.69426-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 121
Download (untitled) / with headers
text/plain 121b
This should be fixed in 2.08 See https://github.com/nigelm/html- format/commit/58fc839da0a0102d80c43acc1376347c7e56153e
From jik [...] kamens.us Wed Jul 13 17: 19:01 2011
X-Scanned-BY: MIMEDefang 2.70 on 10.100.65.33
MIME-Version: 1.0
X-Spam-Status: No, score=-6.899 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-12437-1310570599-1881.69426-6-0 [...] rt.cpan.org>
X-Spam-Flag: NO
References: <RT-Ticket-69426 [...] rt.cpan.org> <rt-3.8.HEAD-12437-1310570599-1881.69426-6-0 [...] rt.cpan.org>
X-Virus-Checked: Checked by ClamAV on 16.mx.develooper.com
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <4E1E0BB6.7060000 [...] kamens.us>
Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha1; boundary="------------ms050007040709080101050201"
X-Spam-Score: -6.899
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 2FD28240570 for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Wed, 13 Jul 2011 17:19:01 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id F3JJ9SRYlLWE for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Wed, 13 Jul 2011 17:18:56 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 8698F2404BF for <bug-HTML-Format [...] rt.cpan.org>; Wed, 13 Jul 2011 17:18:56 -0400 (EDT)
Received: (qmail 2962 invoked by uid 103); 13 Jul 2011 21:18:55 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 Jul 2011 21:18:55 -0000
Received: from jik3.kamens.brookline.ma.us (HELO jik3.kamens.brookline.ma.us) (128.177.28.63) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Wed, 13 Jul 2011 14:18:50 -0700
Received: from jik2.kamens.brookline.ma.us (jik2-openvpn [10.100.65.34]) (authenticated bits=0) by jik3.kamens.brookline.ma.us (8.13.8/8.13.8) with ESMTP id p6DLIkSd027638 for <bug-HTML-Format [...] rt.cpan.org>; Wed, 13 Jul 2011 17:18:46 -0400
Delivered-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:8.0a1) Gecko/20110713 Thunderbird/8.0a1
Subject: Re: [rt.cpan.org #69426] &rsquo; in HTML input yields garbage character in PostScript output
Return-Path: <jik [...] kamens.us>
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
X-RT-Mail-Extension: html-format
Date: Wed, 13 Jul 2011 17:18:46 -0400
X-Spam-Level:
To: bug-HTML-Format [...] rt.cpan.org
From: Jonathan Kamens <jik [...] kamens.us>
RT-Message-ID: <rt-3.8.HEAD-12440-1310591942-271.69426-0-0 [...] rt.cpan.org>
Content-Length: 0
Content-Type: multipart/alternative; boundary="------------000707010603040700030604"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 236
Download (untitled) / with headers
text/plain 236b
You fixed &rsquo;, but it looks like you didn't fix &rdquo; or &ldquo;, and I don't know whether you fixed &rdquo;. Is it possible to do a more comprehensive fix that covers all the HTML entities that could cause problems? Thanks.
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 464
Content-Description: S/MIME Cryptographic Signature
content-type: application/pkcs7-signature; name="smime.p7s"
content-disposition: attachment; filename="smime.p7s"
Content-Transfer-Encoding: base64
Content-Length: 3920
Download smime.p7s
application/pkcs7-signature 3.8k

Message body not shown because it is not plain text.

From jik [...] kamens.us Wed Jul 13 17: 19:42 2011
X-Scanned-BY: MIMEDefang 2.70 on 10.100.65.33
MIME-Version: 1.0
X-Spam-Status: No, score=-6.899 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-12437-1310570599-1881.69426-6-0 [...] rt.cpan.org>
X-Spam-Flag: NO
References: <RT-Ticket-69426 [...] rt.cpan.org> <rt-3.8.HEAD-12437-1310570599-1881.69426-6-0 [...] rt.cpan.org>
X-Virus-Checked: Checked by ClamAV on 16.mx.develooper.com
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <4E1E0BE5.60307 [...] kamens.us>
Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha1; boundary="------------ms070209080609020009060507"
X-Spam-Score: -6.899
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 89FEB240570 for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Wed, 13 Jul 2011 17:19:42 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XNZbPOiVSDDc for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Wed, 13 Jul 2011 17:19:40 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 6AEB02404BF for <bug-HTML-Format [...] rt.cpan.org>; Wed, 13 Jul 2011 17:19:40 -0400 (EDT)
Received: (qmail 3017 invoked by uid 103); 13 Jul 2011 21:19:39 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 Jul 2011 21:19:39 -0000
Received: from jik3.kamens.brookline.ma.us (HELO jik3.kamens.brookline.ma.us) (128.177.28.63) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Wed, 13 Jul 2011 14:19:37 -0700
Received: from jik2.kamens.brookline.ma.us (jik2-openvpn [10.100.65.34]) (authenticated bits=0) by jik3.kamens.brookline.ma.us (8.13.8/8.13.8) with ESMTP id p6DLJXdg028120 for <bug-HTML-Format [...] rt.cpan.org>; Wed, 13 Jul 2011 17:19:34 -0400
Delivered-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:8.0a1) Gecko/20110713 Thunderbird/8.0a1
Subject: Re: [rt.cpan.org #69426] &rsquo; in HTML input yields garbage character in PostScript output
Return-Path: <jik [...] kamens.us>
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
X-RT-Mail-Extension: html-format
Date: Wed, 13 Jul 2011 17:19:33 -0400
X-Spam-Level:
To: bug-HTML-Format [...] rt.cpan.org
From: Jonathan Kamens <jik [...] kamens.us>
RT-Message-ID: <rt-3.8.HEAD-12434-1310591983-613.69426-0-0 [...] rt.cpan.org>
Content-Length: 0
Content-Type: multipart/alternative; boundary="------------030509020700080201060700"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 61
Sorry, I meant to say I don't know whether you fixed &lsquo;
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 231
Content-Description: S/MIME Cryptographic Signature
content-type: application/pkcs7-signature; name="smime.p7s"
content-disposition: attachment; filename="smime.p7s"
Content-Transfer-Encoding: base64
Content-Length: 3920
Download smime.p7s
application/pkcs7-signature 3.8k

Message body not shown because it is not plain text.

MIME-Version: 1.0
In-Reply-To: <rt-3.8.HEAD-12434-1310591983-613.69426-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
References: <RT-Ticket-69426 [...] rt.cpan.org> <rt-3.8.HEAD-12437-1310570599-1881.69426-6-0 [...] rt.cpan.org> <4E1E0BE5.60307 [...] kamens.us> <rt-3.8.HEAD-12434-1310591983-613.69426-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-12438-1310649564-1678.69426-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 1048
On Wed Jul 13 17:19:43 2011, jik@kamens.us wrote: Show quoted text
> Sorry, I meant to say I don't know whether you fixed &lsquo;
&lsquo; is fixed in 2.08 The double quote sets cannot be fixed without just mapping both open/close (right/left) quote sets to &quot; which would have people screaming about that too. The postcript is using latin1 encoding. If you look at the latin1 character set - http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.html - you will see that there is only one double quote character. So to make this work correctly we would have to either:- change the postscript encoding (along with the embedded code font encoding vector) use a hacked latin1 encoding with 2 glyths replaced with double quote chars special case the double quote chars so the string is rendered differently any of these is a bit of a hack (best one is just making it handle unicode throughout - but thats a ton of work and would mean a huge boilerplate encoding vector). Alternative solutions welcome, but I don't think there is a reasonable fix.
From jik [...] kamens.us Thu Jul 14 13: 49:46 2011
X-Scanned-BY: MIMEDefang 2.70 on 128.177.28.63
MIME-Version: 1.0
X-Spam-Status: No, score=-6.899 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-12438-1310649565-1179.69426-6-0 [...] rt.cpan.org>
X-Spam-Flag: NO
References: <RT-Ticket-69426 [...] rt.cpan.org> <rt-3.8.HEAD-12437-1310570599-1881.69426-6-0 [...] rt.cpan.org> <4E1E0BE5.60307 [...] kamens.us> <rt-3.8.HEAD-12434-1310591983-613.69426-6-0 [...] rt.cpan.org> <rt-3.8.HEAD-12438-1310649565-1179.69426-6-0 [...] rt.cpan.org>
X-Virus-Checked: Checked by ClamAV on 16.mx.develooper.com
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <4E1F2C25.6070109 [...] kamens.us>
Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha1; boundary="------------ms080502060003080709000305"
X-Spam-Score: -6.899
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 31F38240572 for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Thu, 14 Jul 2011 13:49:46 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id u2kYrGP4+1mk for <cpan-bug+HTML-Format [...] hipster.bestpractical.com>; Thu, 14 Jul 2011 13:49:41 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 8DA132402F9 for <bug-HTML-Format [...] rt.cpan.org>; Thu, 14 Jul 2011 13:49:41 -0400 (EDT)
Received: (qmail 18758 invoked by uid 103); 14 Jul 2011 17:49:40 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 14 Jul 2011 17:49:40 -0000
Received: from jik3.kamens.brookline.ma.us (HELO jik3.kamens.brookline.ma.us) (128.177.28.63) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Thu, 14 Jul 2011 10:49:35 -0700
Received: from [10.5.37.18] (206.83.68.6.ptr.us.xo.net [206.83.68.6]) (authenticated bits=0) by jik3.kamens.brookline.ma.us (8.13.8/8.13.8) with ESMTP id p6EHnVoL004914 for <bug-HTML-Format [...] rt.cpan.org>; Thu, 14 Jul 2011 13:49:31 -0400
Delivered-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20110624 Thunderbird/5.0
Subject: Re: [rt.cpan.org #69426] &rsquo; in HTML input yields garbage character in PostScript output
Return-Path: <jik [...] kamens.us>
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+HTML-Format [...] hipster.bestpractical.com
X-RT-Mail-Extension: html-format
Date: Thu, 14 Jul 2011 13:49:25 -0400
X-Spam-Level:
To: bug-HTML-Format [...] rt.cpan.org
From: Jonathan Kamens <jik [...] kamens.us>
RT-Message-ID: <rt-3.8.HEAD-12438-1310665787-25.69426-0-0 [...] rt.cpan.org>
Content-Length: 0
Content-Type: multipart/alternative; boundary="------------000202020309050907000307"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 129
Download (untitled) / with headers
text/plain 129b
Any of the options you listed is better than what happens now, which is that &ldquo; and &rdquo; show up as garbage characters.
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 306
Content-Description: S/MIME Cryptographic Signature
content-type: application/pkcs7-signature; name="smime.p7s"
content-disposition: attachment; filename="smime.p7s"
Content-Transfer-Encoding: base64
Content-Length: 3920
Download smime.p7s
application/pkcs7-signature 3.8k

Message body not shown because it is not plain text.

MIME-Version: 1.0
In-Reply-To: <rt-3.8.HEAD-12438-1310665787-25.69426-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
References: <RT-Ticket-69426 [...] rt.cpan.org> <rt-3.8.HEAD-12437-1310570599-1881.69426-6-0 [...] rt.cpan.org> <4E1E0BE5.60307 [...] kamens.us> <rt-3.8.HEAD-12434-1310591983-613.69426-6-0 [...] rt.cpan.org> <rt-3.8.HEAD-12438-1310649565-1179.69426-6-0 [...] rt.cpan.org> <4E1F2C25.6070109 [...] kamens.us> <rt-3.8.HEAD-12438-1310665787-25.69426-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-12441-1310736283-522.69426-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 644
Download (untitled) / with headers
text/plain 644b
On Thu Jul 14 13:49:47 2011, jik@kamens.us wrote: Show quoted text
> Any of the options you listed is better than what happens now, which is > that &ldquo; and &rdquo; show up as garbage characters.
The unmappable characters should now be replaced by ? chars - the Encode to latin1 should do that. However have changed all the double quote code points to map to " which is wrong, but the best that can be done without significant re-architecting. Would love someone to do the work of reimplementing the whole thing into unicode throughout but I took this on as a basic maintainer, and do not intend to get into serious rewrite work. 2.09 has just uploaded


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.