Skip Menu |
 

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the XML-Atom-SimpleFeed CPAN distribution.

Report information
The Basics
Id: 19722
Status: resolved
Priority: 0/
Queue: XML-Atom-SimpleFeed

People
Owner: ARISTOTLE [...] cpan.org
Requestors: karjala [...] karjala.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Can't make it to work with international charsets
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Type: text/plain; charset="utf8"
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 232
Download (untitled) / with headers
text/plain 232b
Encoding is "us-ascii", and any international content I add appears garbled in my RSS reader. I think two things nned to be changed to fix the situation: encoding="UTF-8", and not create ç-type entities out of non-ascii bytes.
X-Scanned-BY: AMaViS-ng at bestpractical
MIME-Version: 1.0
X-Y-GMX-Trusted: 0
X-Spam-Status: No, hits=-2.6 required=8.0 tests=BAYES_00,SPF_PASS
In-Reply-To: <rt-3.5.HEAD-2046-1149569781-1069.19722-4-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Authenticated: #163624
Received-SPF: pass (x1.develooper.com: domain of pagaltzis [...] gmx.de designates 213.165.64.20 as permitted sender)
Content-Type: text/plain; charset="utf-8"
X-RT-Original-Encoding: us-ascii
Received: from localhost (localhost.localdomain [127.0.0.1]) by diesel.bestpractical.com (Postfix) with ESMTP id C68364D818F for <cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com>; Tue, 6 Jun 2006 20:56:39 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 77F744D80B0 for <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>; Tue, 6 Jun 2006 20:56:39 -0400 (EDT)
Received: (qmail 7846 invoked by alias); 7 Jun 2006 00:56:39 -0000
Received: from mail.gmx.net (HELO mail.gmx.net) (213.165.64.20) by la.mx.develooper.com (qpsmtpd/0.28) with SMTP; Tue, 06 Jun 2006 17:56:26 -0700
Received: (qmail invoked by alias); 07 Jun 2006 00:56:18 -0000
Received: from xdsl-213-196-243-94.netcologne.de (EHLO klangraum) [213.196.243.94] by mail.gmx.net (mp026) with SMTP; 07 Jun 2006 02:56:18 +0200
Delivered-To: cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com
Subject: Re: [rt.cpan.org #19722] Can't make it to work with international charsets
User-Agent: Mutt/1.4.2.1i
Return-Path: <pagaltzis [...] gmx.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com
Date: Wed, 7 Jun 2006 02:56:30 +0200
Message-Id: <20060607005630.GB6092 [...] klangraum>
To: bug-XML-Atom-SimpleFeed [...] rt.cpan.org
From: "A. Pagaltzis" <pagaltzis [...] gmx.de>
X-RT-Original-Encoding: utf-8
RT-Message-ID: <rt-3.5.HEAD-2040-1149641804-1857.19722-0-0 [...] rt.cpan.org>
Content-Length: 687
Download (untitled) / with headers
text/plain 687b
Show quoted text
> Subject: XML-Atom-SimpleFeed > Date: Tue, 06 Jun 2006 04:14:32 +0300 > > Is there a way to change the encoding of a feed to UTF-8? I'm > asking because I have a greek feed, which I fill with unicode > data from a database, and your module creates a feed with > "us-ascii" and turns all the greek letters into &#207; etc > entities, which makes the rss file unreadable when opening with > a text editor. > > If there is no way to change the encoding, could you please > change the default encoding of your module to UTF-8, as UTF-8 > is standard nowdays for XML? > > Thank you. > > - Alex > > P.S. Another nice addition might be to produce tidy XML code, > with newlines and tabs.
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Message-Id: <rt-3.5.HEAD-2040-1149642770-567.19722-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 1394
Download (untitled) / with headers
text/plain 1.3k
On Tue Jun 06 00:56:21 2006, guest wrote: Show quoted text
> Encoding is "us-ascii", and any international content I add appears > garbled in my RSS reader.
Garbled? You mean the content displays incorrectly in an aggregator? If so, then that would be a serious problem and I'd like to ask for some sample code that reproduces the problem. Or are you referring to the same issue as in the other mail, ie. the source is merely unreadable, though it gets decoded as it should by aggregators? Show quoted text
> I think two things nned to be changed to fix the situation: > encoding="UTF-8", and not create &#231;-type entities out of non-ascii > bytes.
I didn't think about feeds with content in non-Latin-based scripts, that is true. They would be unreadable when viewed as source. My decision to output only "us-ascii" as encoding was based on the fact that many servers are misconfigured and will produce wrong headers; it is also harder to handle things perfectly correctly on the Perl side, making sure that non-ASCII / non-UTF8 strings are properly upgraded to Unicode. Restricting output to ASCII seemed like the easiest way to ensure minimum possible breakage. I guess I'll have to think of a way to make it easy to get encodings right... it's not as simple a question as it seems, because I want the module to be somewhat robust about encodings even in the face of people who don't know exactly what they are doing.
X-Scanned-BY: AMaViS-ng at bestpractical
MIME-Version: 1.0
X-Spam-Status: No, hits=-2.6 required=8.0 tests=BAYES_00
In-Reply-To: <rt-3.5.HEAD-2040-1149642770-567.19722-6-0 [...] rt.cpan.org>
Received-SPF: neutral (x1.develooper.com: local policy)
References: <RT-Ticket-19722 [...] rt.cpan.org> <rt-3.5.HEAD-2040-1149642770-567.19722-6-0 [...] rt.cpan.org>
X-Virus-Checked: Checked
Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg=sha1; boundary="------------ms060509090705020907000508"
Received: from localhost (localhost.localdomain [127.0.0.1]) by diesel.bestpractical.com (Postfix) with ESMTP id 26A514D81A2 for <cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com>; Tue, 6 Jun 2006 22:01:52 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id B69D04D809F for <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>; Tue, 6 Jun 2006 22:01:51 -0400 (EDT)
Received: (qmail 29898 invoked by alias); 7 Jun 2006 02:01:50 -0000
Received: from fesscrpp2.tellas.gr (HELO fesscrpp2.tellas.gr) (62.169.194.3) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Tue, 06 Jun 2006 19:01:38 -0700
Received: from olive.karjala.org (84.254.12.128) by fesscrpp2.tellas.gr (7.0.028) id 43B54DD4007E6FA3 for bug-XML-Atom-SimpleFeed [...] rt.cpan.org; Wed, 7 Jun 2006 05:01:05 +0300
Received: from [192.168.1.12] (helo=[127.0.0.1]) by olive.karjala.org with esmtp (Exim 3.36 #1 (Debian)) id 1FnnLx-00087Y-00 for <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>; Wed, 07 Jun 2006 05:01:01 +0300
Delivered-To: cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com
Subject: Re: [rt.cpan.org #19722] Can't make it to work with international charsets
User-Agent: Thunderbird 1.5.0.4 (Windows/20060516)
Return-Path: <karjala [...] karjala.org>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com
Date: Wed, 07 Jun 2006 05:00:26 +0300
Message-Id: <4486333A.9090506 [...] karjala.org>
To: bug-XML-Atom-SimpleFeed [...] rt.cpan.org
From: Alexander Karelas <karjala [...] karjala.org>
RT-Message-ID: <rt-3.5.HEAD-2046-1149645719-365.19722-0-0 [...] rt.cpan.org>
Content-Length: 0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 1890
Download (untitled) / with headers
text/plain 1.8k
I don't know much about the problems that occur when the servers are misconfigured. But the way I found to solve this problem, is to change the module to act as follows: - say UTF-8 instead of us-ascii and - don't encode the characters from \x80 onwards And that's all. I don't know if that helps at all... - Alex Aristotle Pagaltzis via RT wrote: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=19722 > > > On Tue Jun 06 00:56:21 2006, guest wrote: >
>> Encoding is "us-ascii", and any international content I add appears >> garbled in my RSS reader. >>
> > Garbled? You mean the content displays incorrectly in an aggregator? If > so, then that would be a serious problem and I'd like to ask for some > sample code that reproduces the problem. Or are you referring to the > same issue as in the other mail, ie. the source is merely unreadable, > though it gets decoded as it should by aggregators? > >
>> I think two things nned to be changed to fix the situation: >> encoding="UTF-8", and not create &#231;-type entities out of non-ascii >> bytes. >>
> > I didn't think about feeds with content in non-Latin-based scripts, that > is true. They would be unreadable when viewed as source. > > My decision to output only "us-ascii" as encoding was based on the fact > that many servers are misconfigured and will produce wrong headers; it > is also harder to handle things perfectly correctly on the Perl side, > making sure that non-ASCII / non-UTF8 strings are properly upgraded to > Unicode. Restricting output to ASCII seemed like the easiest way to > ensure minimum possible breakage. > > I guess I'll have to think of a way to make it easy to get encodings > right... it's not as simple a question as it seems, because I want the > module to be somewhat robust about encodings even in the face of people > who don't know exactly what they are doing. > >
Content-Description: S/MIME Cryptographic Signature
content-type: application/x-pkcs7-signature; name="smime.p7s"
content-disposition: attachment; filename="smime.p7s"
Content-Transfer-Encoding: base64
Content-Length: 3245
Download smime.p7s
application/x-pkcs7-signature 3.1k

Message body not shown because it is not plain text.

MIME-Version: 1.0
In-Reply-To: <rt-3.5.HEAD-2046-1149645719-365.19722-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Message-Id: <rt-3.5.HEAD-2038-1149647754-1723.19722-0-0 [...] rt.cpan.org>
References: <RT-Ticket-19722 [...] rt.cpan.org> <rt-3.5.HEAD-2040-1149642770-567.19722-6-0 [...] rt.cpan.org> <4486333A.9090506 [...] karjala.org> <rt-3.5.HEAD-2046-1149645719-365.19722-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 761
Download (untitled) / with headers
text/plain 761b
On Tue Jun 06 22:01:59 2006, KARJALA wrote: Show quoted text
> I don't know if that helps at all...
I already know that much. And your suggestion will work in the simple case that all input is consistently encoded and the output the string correctly (seems to be the case in your code). But I'm not sure it's enough to deal with more complex scenarios robustly: how do I react if the caller gives me several non-Unicode strings in different encodings? How do I *detect* it? Just encoding everything and capping output at "us-ascii" at least ensures that the feed will always be well-formed XML no matter what. I will have to think about this. It's clearly going to be an issue for others too; people publishing in Asian scripts f.ex. will very likely need to use UTF-16. Hrmf.
X-Scanned-BY: AMaViS-ng at bestpractical
MIME-Version: 1.0
X-Spam-Status: No, hits=-2.6 required=8.0 tests=BAYES_00
In-Reply-To: <rt-3.5.HEAD-2038-1149647754-1723.19722-6-0 [...] rt.cpan.org>
Received-SPF: neutral (x1.develooper.com: local policy)
References: <RT-Ticket-19722 [...] rt.cpan.org> <rt-3.5.HEAD-2040-1149642770-567.19722-6-0 [...] rt.cpan.org> <4486333A.9090506 [...] karjala.org> <rt-3.5.HEAD-2046-1149645719-365.19722-6-0 [...] rt.cpan.org> <rt-3.5.HEAD-2038-1149647754-1723.19722-6-0 [...] rt.cpan.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
X-RT-Original-Encoding: utf-8
Received: from localhost (localhost.localdomain [127.0.0.1]) by diesel.bestpractical.com (Postfix) with ESMTP id A63304D80B3 for <cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com>; Tue, 6 Jun 2006 23:16:14 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 17AFD4D809F for <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>; Tue, 6 Jun 2006 23:16:13 -0400 (EDT)
Received: (qmail 24423 invoked by alias); 7 Jun 2006 03:16:13 -0000
Received: from fesscrpp1.tellas.gr (HELO fesscrpp1.tellas.gr) (62.169.194.2) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Tue, 06 Jun 2006 20:16:02 -0700
Received: from olive.karjala.org (84.254.12.128) by fesscrpp1.tellas.gr (7.0.028) id 43B53EDA0088EC69 for bug-XML-Atom-SimpleFeed [...] rt.cpan.org; Wed, 7 Jun 2006 06:15:32 +0300
Received: from [192.168.1.12] (helo=[127.0.0.1]) by olive.karjala.org with esmtp (Exim 3.36 #1 (Debian)) id 1FnoW0-00009V-00 for <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>; Wed, 07 Jun 2006 06:15:28 +0300
Delivered-To: cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com
Subject: Re: [rt.cpan.org #19722] Can't make it to work with international charsets
User-Agent: Thunderbird 1.5.0.4 (Windows/20060516)
Return-Path: <karjala [...] karjala.org>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com
Date: Wed, 07 Jun 2006 06:14:58 +0300
Message-Id: <448644B2.8030600 [...] karjala.org>
To: bug-XML-Atom-SimpleFeed [...] rt.cpan.org
Content-Transfer-Encoding: 7bit
From: Alexander Karelas <karjala [...] karjala.org>
X-RT-Original-Encoding: utf-8
RT-Message-ID: <rt-3.5.HEAD-2028-1149650180-1927.19722-0-0 [...] rt.cpan.org>
Content-Length: 436
Download (untitled) / with headers
text/plain 436b
I don't know if it's possible to detect the encoding. Maybe you could ask the user to provide during object creation: (1) a optional global encoding for the feed (which defaults to utf-8), and (2) an optional encoding for each feed item (which defaults to the global encoding), and then have one of the ready modules in CPAN translate the items' texts from encoding #2 to encoding #1. That's the only solution I can come up with.
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Message-Id: <rt-3.6.HEAD-2076-1157730373-587.19722-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
From: JROCKWAY [...] cpan.org
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 1091
Unless you're working on this right now, I think I'll have a patch for this problem soon. What your module is doing is it's encoding the individual octets of utf-8 characters as entities. You can look at http://blog.jrock.us/feeds/article/%E9%9B%BB%E8%BB%8A%E7%94%B7/xml For an example of this. All the Japanese comes out garbled because the entities represent individual octets of the multi-byte character sequence. My solution to this is to just set the encoding to utf-8 and dump the raw octets that perl uses internally (utf-8). If you want, I could add a charset config option and try to have Encode do a charset conversion (and throw an exception if it's not possible to represent the content in memory in that charset). Regards, Jonathan Rockway On Tue Jun 06 00:56:21 2006, guest wrote: Show quoted text
> Encoding is "us-ascii", and any international content I add appears > garbled in my RSS reader. > > I think two things nned to be changed to fix the situation: > encoding="UTF-8", and not create &#231;-type entities out of non-ascii > bytes.
-- Jonathan Rockway <jrockway@cpan.org>
MIME-Version: 1.0
X-Y-GMX-Trusted: 0
X-Spam-Status: No, hits=-2.6 required=8.0 tests=BAYES_00,SPF_PASS
In-Reply-To: <rt-3.6.HEAD-2076-1157730373-587.19722-5-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Authenticated: #163624
Received-SPF: pass (x1.develooper.com: domain of pagaltzis [...] gmx.de designates 213.165.64.20 as permitted sender)
References: <RT-Ticket-19722 [...] rt.cpan.org> <rt-3.6.HEAD-2076-1157730373-587.19722-5-0 [...] rt.cpan.org>
Content-Type: text/plain; charset=utf-8
X-RT-Original-Encoding: utf-8
Received: from la.mx.develooper.com (ss1.fabel.dk [63.251.223.179]) by diesel.bestpractical.com (Postfix) with SMTP id 2E15D4D809E for <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>; Fri, 8 Sep 2006 18:55:12 -0400 (EDT)
Received: (qmail 28891 invoked by alias); 8 Sep 2006 22:55:11 -0000
Received: from mail.gmx.net (HELO mail.gmx.net) (213.165.64.20) by la.mx.develooper.com (qpsmtpd/0.28) with SMTP; Fri, 08 Sep 2006 15:55:08 -0700
Received: (qmail invoked by alias); 08 Sep 2006 22:55:02 -0000
Received: from xdsl-84-44-230-66.netcologne.de (EHLO klangraum) [84.44.230.66] by mail.gmx.net (mp017) with SMTP; 09 Sep 2006 00:55:02 +0200
Delivered-To: cpan-bug+xml-atom-simplefeed [...] diesel.bestpractical.com
Subject: Re: [rt.cpan.org #19722] Can't make it to work with international charsets
User-Agent: Mutt/1.4.2.1i
Return-Path: <pagaltzis [...] gmx.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: bug-XML-Atom-SimpleFeed [...] rt.cpan.org
Date: Sat, 9 Sep 2006 00:55:24 +0200
Message-Id: <20060908225524.GB17294 [...] klangraum>
To: Jonathan Rockway via RT <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>
Content-Transfer-Encoding: 8bit
From: "A. Pagaltzis" <pagaltzis [...] gmx.de>
X-RT-Original-Encoding: utf-8
RT-Message-ID: <rt-3.6.HEAD-30554-1157756117-818.19722-0-0 [...] rt.cpan.org>
Content-Length: 1604
Download (untitled) / with headers
text/plain 1.5k
Hi Jonathan, * Jonathan Rockway via RT <bug-XML-Atom-SimpleFeed@rt.cpan.org> [2006-09-08 17:50]: Show quoted text
> All the Japanese comes out garbled because the entities > represent individual octets of the multi-byte character > sequence. > > My solution to this is to just set the encoding to utf-8 and > dump the raw octets that perl uses internally (utf-8).
I suggest you mark the string as UTF-8 prior to passing it in. Then X::A::SF will do the right thing without any further adjustments. Show quoted text
> If you want, I could add a charset config option and try to > have Encode do a charset conversion (and throw an exception if > it's not possible to represent the content in memory in that > charset).
I have something like that already planned, but I’m still thinking about it. I consciously chose to limit the output to US-ASCII because then it doesn’t matter what the caller does with the string: it will never be double-encoded in any form. But obviously, for people whose language is sufficiently far outside the US-ASCII charset, the result is an unreadable entity forest, so I will need to provide some way to specify an encoding. An issue with that is that with the current internal design, such an option could only be set by the constructor. I am wondering whether to enshrine this limitation in the API or to change the design. I know the encodings problem seems very simple to solve, but there’s more to it than you’d think. I want to make the API as transparent as possible WRT encodings, and that’s going to take some thinking. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
MIME-Version: 1.0
In-Reply-To: <rt-3.6.HEAD-30554-1157756117-818.19722-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
References: <RT-Ticket-19722 [...] rt.cpan.org> <rt-3.6.HEAD-2076-1157730373-587.19722-5-0 [...] rt.cpan.org> <20060908225524.GB17294 [...] klangraum> <rt-3.6.HEAD-30554-1157756117-818.19722-0-0 [...] rt.cpan.org>
Message-Id: <rt-3.6.HEAD-454-1157989678-1758.19722-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
From: JROCKWAY [...] cpan.org
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 606
Download (untitled) / with headers
text/plain 606b
You are absolutely right -- your module works fine. I must not be setting the utf8 flag somewhere -- OpenBSD has no concept of locales in its C library, so I have to do everything myself. Grumble, grumble. :) Show quoted text
> * Jonathan Rockway via RT <bug-XML-Atom-SimpleFeed@rt.cpan.org> [2006- > 09-08 17:50]:
> > All the Japanese comes out garbled because the entities > > represent individual octets of the multi-byte character > > sequence. > > > > My solution to this is to just set the encoding to utf-8 and > > dump the raw octets that perl uses internally (utf-8).
-- Jonathan Rockway <jrockway@cpan.org>
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Charset: utf8
Content-Type: multipart/mixed; boundary="----------=_1245773102-13950-1172"
Message-ID: <rt-3.6.HEAD-13950-1245773102-615.19722-0-0 [...] rt.cpan.org>
X-RT-Original-Encoding: utf-8
Content-Length: 0
Content-Disposition: inline
Content-Type: text/plain
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 1237
Download (untitled) / with headers
text/plain 1.2k
Attached documentation patch clarifies the findings about encoding, I reused some wording from above. Therein I also advise to delegate output transformation to specialised tools. Hopefully this is enough to close this bug. I wish to add my proverbial mustard to some other topics from this thread. Show quoted text
> how do I react if > the caller gives me several non-Unicode strings in different encodings?
You cannot outsmart people who are such terminally confused. No one else tries to. The documentation should make explicit that this module accepts Text strings (in perlunitut jargon) and DTRT with them. If someone wants to ignore that, then just let him: garbage in, garbage out. Show quoted text
> How do I *detect* it?
»Oh, that way madness lies; let me shun that.« The best heuristic unsurprisingly comes from NSUniversalDetector <http://www.mozilla.org/projects/intl/detectorsrc.html>, <http://search.cpan.org/dist/Encode-Detect/>, but IMO this has no place in X::A::SF. Show quoted text
> people publishing in Asian scripts f.ex. will very likely > need to use UTF-16.
No, my observation is that on the web, national encodings are the rule: GB18030 (often misdeclared as GB2312), Shift-JIS, Big5... Indic scripts standardised on UTF-8 due to their late-comer status.
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Type: multipart/mixed; boundary="----------=_1245773102-13950-1171"
Charset: utf8
Content-Length: 0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 0
Content-Type: text/x-patch; name="0001-RT-19722-docs-about-how-encoding-is-handled.patch"
Content-Disposition: inline; filename="0001-RT-19722-docs-about-how-encoding-is-handled.patch"
Content-Transfer-Encoding: binary
Content-Length: 1887
From 0861c3e55de3a07829c793560554c655b8ea6b82 Mon Sep 17 00:00:00 2001 From: =?utf-8?q?Lars=20D=C9=AA=E1=B4=87=E1=B4=84=E1=B4=8B=E1=B4=8F=E1=B4=A1=20=E8=BF=AA=E6=8B=89=E6=96=AF?= <daxim@cpan.org> Date: Tue, 23 Jun 2009 16:46:45 +0200 Subject: [PATCH] RT #19722: docs about how encoding is handled --- lib/XML/Atom/SimpleFeed.pm | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/lib/XML/Atom/SimpleFeed.pm b/lib/XML/Atom/SimpleFeed.pm index d401148..2c81def 100644 --- a/lib/XML/Atom/SimpleFeed.pm +++ b/lib/XML/Atom/SimpleFeed.pm @@ -700,6 +700,8 @@ The C<source> element is not and may never be supported. Nothing is done to ensure that text constructs with type C<xhtml> and entry contents using either that or an XML media type are well-formed. So far, this is by design. You should strongly consider using an XML writer if you want to include content with such types in your feed. +The XML representation of the feed is encoded in C<us-ascii> only, characters outside this repertoire are encoded as decimal numeric character references, e.g. C<&#12345;>. This makes output files robust against misconfigured webservers that produce wrong headers. As this module does not depend on an external XML writer, but uses a minimal serialiser internally, it also helps reduce its complexity. Encoding should not matter; feed consuming software will just do the right thing. But sometimes it is convenient to be able to read the XML source without the confusing entities. In that case, filter it through an external tool for pretty-printing, e.g. C<xmllint --format --encode utf-8>, or programmatically through an XML library, e.g. L<XML::LibXML::Document/"setEncoding">. + If you find bugs or you have feature requests, please report them to L<mailto:bug-xml-atom-simplefeed@rt.cpan.org>, or through the web interface at L<http://rt.cpan.org>. -- 1.6.3
MIME-Version: 1.0
In-Reply-To: <rt-3.6.HEAD-13950-1245773102-615.19722-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Charset: utf8
References: <rt-3.6.HEAD-13950-1245773102-615.19722-0-0 [...] rt.cpan.org>
Content-Type: text/plain
Message-ID: <rt-3.6.HEAD-13950-1245780586-712.19722-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 3046
Download (untitled) / with headers
text/plain 2.9k
Show quoted text
> Attached documentation patch clarifies the findings about > encoding, I reused some wording from above. Therein I also > advise to delegate output transformation to specialised tools. > Hopefully this is enough to close this bug.
Thanks. I’m not sure whether I want to apply this, though. Since the time when this ticket was filed, I have learned quite a few things and changed my position on others. The plan is now to make the charset configurable. I have long wanted to refactor the internals, as they currently produce fragments of XML as you call methods, which are glued together in the end. This approach makes the internals very inflexible. As part of this, I have accepted the reality that there are no good XML emitter modules on CPAN, and taken it upon myself to write on, whose API closely follows HTML::Tiny. Once that is done, SimpleFeed is due for a complete (though incremental) overhaul. And at that time, this ticket will finally be fully addressed. Show quoted text
> > how do I react if the caller gives me several non-Unicode > > strings in different encodings?
> > You cannot outsmart people who are such terminally confused. No > one else tries to. The documentation should make explicit that > this module accepts Text strings (in perlunitut jargon) and > DTRT with them. If someone wants to ignore that, then just let > him: garbage in, garbage out.
Yes. I understand that now. Show quoted text
> > How do I *detect* it?
> > »Oh, that way madness lies; let me shun that.« The best > heuristic unsurprisingly comes from NSUniversalDetector > <http://www.mozilla.org/projects/intl/detectorsrc.html>, > <http://search.cpan.org/dist/Encode-Detect/>, but IMO this has > no place in X::A::SF.
Oh no. I was not trying to actually do something useful with those strings; I was just wondering if it was possible to detect such an error and throw an exception or something. But I have since learned that strings in Perl are completely untyped – that there isn’t even a distinction between text strings and octet strings (like a naïve understanding of the UTF8 flag suggests). So it is in fact entirely impossible to determine the semantics of a string by examining the string. Hence all I can do is as you say: document that the module expects text strings for input and produces an octet sequence as output. Show quoted text
> > people publishing in Asian scripts f.ex. will very likely > > need to use UTF-16.
> > No, my observation is that on the web, national encodings are > the rule: GB18030 (often misdeclared as GB2312), Shift-JIS, > Big5... Indic scripts standardised on UTF-8 due to their > late-comer status.
Aha. Well, I was not making an observation, really. The fact is that XML parsers are not required to support any of those national encodings (not even Latin-1, I think); but they are required to support the various UTF variants, so in that sense UTF-16 is the conservative option. Anyway, this doesn’t actually matter for any of the points at hand. What matters is that I finally have a plan for how I want to proceed with the module.
MIME-Version: 1.0
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-18087-1442925693-61.19722-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 154
Download (untitled) / with headers
text/plain 154b
It’s y’all’s lucky day: this is now *finally* fixed. Please find release 0.9000 on your local CPAN mirror once it’s there. It only took 10 years!


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.