Skip Menu |
 

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the PPI CPAN distribution.

Report information
The Basics
Id: 15353
Status: resolved
Priority: 0/
Queue: PPI

People
Owner: Nobody in particular
Requestors: chris+rt [...] chrisdolan.net
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.103
Fixed in: (no value)



MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Subject: Failure on Unicode byte order mark
Content-Type: multipart/mixed; boundary="----------=_1130476522-4221-1"
Content-Length: 0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 605
Download (untitled) / with headers
text/plain 605b
PPI parsing fails if a .pm file starts with the Unicode byte-order mark (BOM -- http://www.unicode.org/faq/utf_bom.html#BOM) Attached is a simplified Japanese UTF-8 module that uses Locale::Maketext. That file has a BOM that looks like 0xefbbbf, namely the UTF-8 BOM. Note: I gzipped the attachment to prevent RT and/or browsers from mangling the BOM. If you try to parse that document as follows, you get an error message: perl -MPPI::Document -e 'PPI::Document->new("ja.pm")||print"$PPI::Document::errstr\n"' Error at line 1, character 0 Perl 5.8.6 handles this file just fine. -- Chris
Content-Type: application/x-gzip; name="ja.pm.gz"
Content-Disposition: inline; filename="ja.pm.gz"
Content-Transfer-Encoding: base64
Content-Length: 197
Download ja.pm.gz
application/x-gzip 197b

Message body not shown because it is not plain text.

Return-Path: <adam [...] phase-n.com>
X-Original-To: bug-PPI [...] rt.cpan.org
Delivered-To: cpan-bug+ppi [...] diesel.bestpractical.com
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id ED1624D8256 for <bug-PPI [...] rt.cpan.org>; Fri, 28 Oct 2005 02:49:44 -0400 (EDT)
Received: (qmail 8044 invoked by alias); 28 Oct 2005 06:49:27 -0000
X-Spam-Check-BY: la.mx.develooper.com
Received-SPF: neutral (x1.develooper.com: local policy)
Received: from smtp01.syd.iprimus.net.au (HELO smtp01.syd.iprimus.net.au) (210.50.30.196) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Thu, 27 Oct 2005 23:49:21 -0700
Received: from [192.168.1.100] (58.178.11.151) by smtp01.syd.iprimus.net.au (7.2.065.1) id 4300EE7401A60906 for bug-PPI [...] rt.cpan.org; Fri, 28 Oct 2005 16:49:14 +1000
Message-ID: <4361C9F0.2080206 [...] phase-n.com>
Date: Fri, 28 Oct 2005 16:49:20 +1000
From: Adam Kennedy <adam [...] phase-n.com>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: bug-PPI [...] rt.cpan.org
Subject: Re: [cpan #15353] Failure on Unicode byte order mark
References: <rt-15353-46046.18.6052595712908 [...] cpan.org>
In-Reply-To: <rt-15353-46046.18.6052595712908 [...] cpan.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
RT-Send-Cc:
X-RT-Original-Encoding: iso-8859-1
Content-Length: 949
Download (untitled) / with headers
text/plain 949b
PPI does not support unicode, only the non-English characters from the latin-1 characterset. Adam K Guest via RT wrote: Show quoted text
> This message about PPI was sent to you by guest <> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > PPI parsing fails if a .pm file starts with the Unicode byte-order mark (BOM -- http://www.unicode.org/faq/utf_bom.html#BOM) > > Attached is a simplified Japanese UTF-8 module that uses Locale::Maketext. That file has a BOM that looks like 0xefbbbf, namely the UTF-8 BOM. Note: I gzipped the attachment to prevent RT and/or browsers from mangling the BOM. > > If you try to parse that document as follows, you get an error message: > > perl -MPPI::Document -e 'PPI::Document->new("ja.pm")||print"$PPI::Document::errstr\n"' > > Error at line 1, character 0 > > Perl 5.8.6 handles this file just fine. > > -- Chris
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
X-RT-Original-Encoding: iso-8859-1
Content-Length: 996
Download (untitled) / with headers
text/plain 996b
[adam@phase-n.com - Fri Oct 28 02:49:47 2005]: Show quoted text
> PPI does not support unicode, only the non-English characters from the > latin-1 characterset.
Thanks for the clarification Adam. I've been thinking about this for a couple of days. How about a new token class called PPI::Token::BOM which is a subclass of ::Whitespace? The document would start with its initial state set to ::BOM instead of ::Whitespace. If no BOM was present, it would go on parsing as usual, switching the type to ::Whitespace. In the first version it could accept the UTF-8 BOM and choke on other BOMs. My reasoning behind this is that most Unicode perl is only unicode because the strings contain Unicode. With the exception of the BOM, most UTF-8 documents are PPI-friendly because they use only ASCII outside of strings. If you think this is a good idea, I'd be happy to write a first-draft patch and test. I've read the code of ::Whitespace, so I do understand the magnitude of this proposed change. -- Chris
Return-Path: <adam [...] phase-n.com>
X-Original-To: bug-PPI [...] rt.cpan.org
Delivered-To: cpan-bug+ppi [...] diesel.bestpractical.com
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 473184D8149 for <bug-PPI [...] rt.cpan.org>; Mon, 31 Oct 2005 11:32:07 -0500 (EST)
Received: (qmail 30056 invoked by alias); 31 Oct 2005 16:31:50 -0000
X-Spam-Check-BY: la.mx.develooper.com
Received-SPF: neutral (x1.develooper.com: local policy)
Received: from starfury.linearg.com (HELO starfury.linearg.com) (202.90.48.2) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Mon, 31 Oct 2005 08:31:42 -0800
Received: from localhost (localhost [127.0.0.1]) by starfury.linearg.com (Postfix) with ESMTP id 4CC7480B04E9 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 03:31:35 +1100 (EST)
Received: from starfury.linearg.com ([127.0.0.1]) by localhost (starfury [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 27227-04 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 03:31:34 +1100 (EST)
Received: from [172.31.0.178] (hq-nat.linearg.net [202.90.48.125]) by starfury.linearg.com (Postfix) with ESMTP id 9B453804C4C1 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 03:31:34 +1100 (EST)
Message-ID: <436646CB.60105 [...] phase-n.com>
Date: Tue, 01 Nov 2005 03:31:07 +1100
From: Adam Kennedy <adam [...] phase-n.com>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: bug-PPI [...] rt.cpan.org
Subject: Re: [cpan #15353] Failure on Unicode byte order mark
References: <rt-15353-46203.16.7264488866294 [...] cpan.org>
In-Reply-To: <rt-15353-46203.16.7264488866294 [...] cpan.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at linearg.com
RT-Send-Cc:
X-RT-Original-Encoding: iso-8859-1
Content-Length: 1579
Download (untitled) / with headers
text/plain 1.5k
The main problem here is that there's not much point in supporting one particular character from unicode if we don't support a more complete subset... or is there? I'm afraid some of the specifics of the unicode issues escape me, but that's my main issue... what's the point of just adding BOM? Adam K via RT wrote: Show quoted text
> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Fri Oct 28 02:49:47 2005]: > >
>>PPI does not support unicode, only the non-English characters from the >>latin-1 characterset.
> > > Thanks for the clarification Adam. I've been thinking about this for a > couple of days. How about a new token class called PPI::Token::BOM > which is a subclass of ::Whitespace? The document would start with its > initial state set to ::BOM instead of ::Whitespace. If no BOM was > present, it would go on parsing as usual, switching the type to > ::Whitespace. In the first version it could accept the UTF-8 BOM and > choke on other BOMs. > > My reasoning behind this is that most Unicode perl is only unicode > because the strings contain Unicode. With the exception of the BOM, > most UTF-8 documents are PPI-friendly because they use only ASCII > outside of strings. > > If you think this is a good idea, I'd be happy to write a first-draft > patch and test. I've read the code of ::Whitespace, so I do understand > the magnitude of this proposed change. > > -- Chris
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
From: cdolan [...] cpan.org
X-RT-Original-Encoding: iso-8859-1
Content-Length: 1403
Download (untitled) / with headers
text/plain 1.3k
[adam@phase-n.com - Mon Oct 31 11:32:11 2005]: Show quoted text
> The main problem here is that there's not much point in supporting one > particular character from unicode if we don't support a more complete > subset... or is there? > > I'm afraid some of the specifics of the unicode issues escape me, but > that's my main issue... what's the point of just adding BOM? > > Adam K
Hi Adam, BOM support would make Locale::Maketext-based modules parseable. Those contain many L10N strings, but minimal Perl. The .pm file needs to be non-Latin-1 to support the strings, and many editors add the BOM automatically. Another potential case of Unicode docs that are nearly PPI-parseable are ones with Unicode in the POD, but just ASCII in the code. For example, if the author's name is not representable in ASCII. Looking at PPI::Token::Pod, PPI::Token::Quote::* and PPI::Token::_QuoteEngine*, it looks like they are already as Unicode-friendly as Perl is, since they only scan for special characters instead of validating at every one. So, in the simple case of a UTF-8 document that used the ASCII subset for all code, BOM support is the sole limiting factor for PPI. Note that you may not see many UTF-8 docs with BOMs on CPAN because localization is usually relegated to the application, not the libraries. So if there is a lack of BOM errors for PPI on CPAN, that may be a selection bias. Thanks, -- Chris
Return-Path: <adam [...] phase-n.com>
X-Original-To: bug-PPI [...] rt.cpan.org
Delivered-To: cpan-bug+ppi [...] diesel.bestpractical.com
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 8B8844D80D0 for <bug-PPI [...] rt.cpan.org>; Mon, 31 Oct 2005 12:25:31 -0500 (EST)
Received: (qmail 13764 invoked by alias); 31 Oct 2005 17:25:14 -0000
X-Spam-Check-BY: la.mx.develooper.com
Received-SPF: neutral (x1.develooper.com: local policy)
Received: from starfury.linearg.com (HELO starfury.linearg.com) (202.90.48.2) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Mon, 31 Oct 2005 09:25:11 -0800
Received: from localhost (localhost [127.0.0.1]) by starfury.linearg.com (Postfix) with ESMTP id EE69280B04E9 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 04:25:06 +1100 (EST)
Received: from starfury.linearg.com ([127.0.0.1]) by localhost (starfury [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 28083-05 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 04:25:06 +1100 (EST)
Received: from [172.31.0.178] (hq-nat.linearg.net [202.90.48.125]) by starfury.linearg.com (Postfix) with ESMTP id 38D88804C58F for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 04:25:06 +1100 (EST)
Message-ID: <43665356.5080400 [...] phase-n.com>
Date: Tue, 01 Nov 2005 04:24:38 +1100
From: Adam Kennedy <adam [...] phase-n.com>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: bug-PPI [...] rt.cpan.org
Subject: Re: [cpan #15353] Failure on Unicode byte order mark
References: <rt-15353-46208.7.96546714089558 [...] cpan.org>
In-Reply-To: <rt-15353-46208.7.96546714089558 [...] cpan.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at linearg.com
RT-Send-Cc:
X-RT-Original-Encoding: iso-8859-1
Content-Length: 2259
Download (untitled) / with headers
text/plain 2.2k
Not really, I haven't run the tinderbox in a while, but I purged I10N errors from the tinderbox process. And yeah, a German guy pointed out that for latin-1 support it only need to be supported in POD, comments and the quote engine for strings. He wrote up the latin-1 unit test scripts. If you think that the BOM stuff is the only thing stopping the majority of Unicode, then go ahead and try for a patch to it. If you want I can add you to the developer list for the parseperl repository and you can just work it up in a branch on the live module? Adam K via RT wrote: Show quoted text
> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Mon Oct 31 11:32:11 2005]: > >
>>The main problem here is that there's not much point in supporting one >>particular character from unicode if we don't support a more complete >>subset... or is there? >> >>I'm afraid some of the specifics of the unicode issues escape me, but >>that's my main issue... what's the point of just adding BOM? >> >>Adam K
> > > Hi Adam, > > BOM support would make Locale::Maketext-based modules parseable. Those > contain many L10N strings, but minimal Perl. The .pm file needs to be > non-Latin-1 to support the strings, and many editors add the BOM > automatically. > > Another potential case of Unicode docs that are nearly PPI-parseable are > ones with Unicode in the POD, but just ASCII in the code. For example, > if the author's name is not representable in ASCII. > > Looking at PPI::Token::Pod, PPI::Token::Quote::* and > PPI::Token::_QuoteEngine*, it looks like they are already as > Unicode-friendly as Perl is, since they only scan for special characters > instead of validating at every one. So, in the simple case of a UTF-8 > document that used the ASCII subset for all code, BOM support is the > sole limiting factor for PPI. > > Note that you may not see many UTF-8 docs with BOMs on CPAN because > localization is usually relegated to the application, not the libraries. > So if there is a lack of BOM errors for PPI on CPAN, that may be a > selection bias. > > Thanks, > -- Chris
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
From: cdolan [...] cpan.org
X-RT-Original-Encoding: iso-8859-1
Content-Length: 815
Download (untitled) / with headers
text/plain 815b
[adam@phase-n.com - Mon Oct 31 12:25:35 2005]: Show quoted text
> Not really, I haven't run the tinderbox in a while, but I purged I10N > errors from the tinderbox process. > > And yeah, a German guy pointed out that for latin-1 support it only > need > to be supported in POD, comments and the quote engine for strings. > > He wrote up the latin-1 unit test scripts. > > If you think that the BOM stuff is the only thing stopping the > majority > of Unicode, then go ahead and try for a patch to it. > > If you want I can add you to the developer list for the parseperl > repository and you can just work it up in a branch on the live module? > > Adam K
Sounds good to me. For reference, I'm usually chris @ chrisdolan.net. I make no predictions on an ETA for the patch, but I'll try to work on it soon. Thanks! -- Chris
Return-Path: <adam [...] phase-n.com>
X-Original-To: bug-PPI [...] rt.cpan.org
Delivered-To: cpan-bug+ppi [...] diesel.bestpractical.com
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 1FE884D80D0 for <bug-PPI [...] rt.cpan.org>; Mon, 31 Oct 2005 12:47:30 -0500 (EST)
Received: (qmail 20333 invoked by alias); 31 Oct 2005 17:47:11 -0000
X-Spam-Check-BY: la.mx.develooper.com
Received-SPF: neutral (x1.develooper.com: local policy)
Received: from starfury.linearg.com (HELO starfury.linearg.com) (202.90.48.2) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Mon, 31 Oct 2005 09:47:08 -0800
Received: from localhost (localhost [127.0.0.1]) by starfury.linearg.com (Postfix) with ESMTP id 24F8D804C88C for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 04:47:04 +1100 (EST)
Received: from starfury.linearg.com ([127.0.0.1]) by localhost (starfury [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 28593-02 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 04:46:58 +1100 (EST)
Received: from [172.31.0.178] (hq-nat.linearg.net [202.90.48.125]) by starfury.linearg.com (Postfix) with ESMTP id C72AA804C4D4 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 04:46:58 +1100 (EST)
Message-ID: <43665877.6020507 [...] phase-n.com>
Date: Tue, 01 Nov 2005 04:46:31 +1100
From: Adam Kennedy <adam [...] phase-n.com>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: bug-PPI [...] rt.cpan.org
Subject: Re: [cpan #15353] Failure on Unicode byte order mark
References: <rt-15353-46211.19.368093546928 [...] cpan.org>
In-Reply-To: <rt-15353-46211.19.368093546928 [...] cpan.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at linearg.com
RT-Send-Cc:
X-RT-Original-Encoding: iso-8859-1
Content-Length: 1268
Download (untitled) / with headers
text/plain 1.2k
Timeline is fine, if you contain it in a branch and work at your own pace, however long it takes is totally fine by me. What is your SourceForge account, and I'll add it to CVS permissions? Adam K via RT wrote: Show quoted text
> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Mon Oct 31 12:25:35 2005]: > >
>>Not really, I haven't run the tinderbox in a while, but I purged I10N >>errors from the tinderbox process. >> >>And yeah, a German guy pointed out that for latin-1 support it only >>need >>to be supported in POD, comments and the quote engine for strings. >> >>He wrote up the latin-1 unit test scripts. >> >>If you think that the BOM stuff is the only thing stopping the >>majority >>of Unicode, then go ahead and try for a patch to it. >> >>If you want I can add you to the developer list for the parseperl >>repository and you can just work it up in a branch on the live module? >> >>Adam K
> > > Sounds good to me. For reference, I'm usually chris @ chrisdolan.net. > I make no predictions on an ETA for the patch, but I'll try to work on > it soon. > > Thanks! > -- Chris
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
From: cdolan [...] cpan.org
X-RT-Original-Encoding: iso-8859-1
Content-Length: 291
Download (untitled) / with headers
text/plain 291b
[adam@phase-n.com - Mon Oct 31 12:47:32 2005]: Show quoted text
> Timeline is fine, if you contain it in a branch and work at your own > pace, however long it takes is totally fine by me. > > What is your SourceForge account, and I'll add it to CVS permissions? > > Adam K
I'm chrisdolan @ SF. -- Chris
Return-Path: <adam [...] phase-n.com>
X-Original-To: bug-PPI [...] rt.cpan.org
Delivered-To: cpan-bug+ppi [...] diesel.bestpractical.com
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 07C4C4D80A7 for <bug-PPI [...] rt.cpan.org>; Mon, 31 Oct 2005 13:09:43 -0500 (EST)
Received: (qmail 26521 invoked by alias); 31 Oct 2005 18:09:26 -0000
X-Spam-Check-BY: la.mx.develooper.com
Received-SPF: neutral (x1.develooper.com: local policy)
Received: from starfury.linearg.com (HELO starfury.linearg.com) (202.90.48.2) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Mon, 31 Oct 2005 10:09:24 -0800
Received: from localhost (localhost [127.0.0.1]) by starfury.linearg.com (Postfix) with ESMTP id 6A3AE804C88C for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 05:09:20 +1100 (EST)
Received: from starfury.linearg.com ([127.0.0.1]) by localhost (starfury [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 28689-06 for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 05:09:19 +1100 (EST)
Received: from [172.31.0.178] (hq-nat.linearg.net [202.90.48.125]) by starfury.linearg.com (Postfix) with ESMTP id B2347801986B for <bug-PPI [...] rt.cpan.org>; Tue, 1 Nov 2005 05:09:19 +1100 (EST)
Message-ID: <43665DB3.2040609 [...] phase-n.com>
Date: Tue, 01 Nov 2005 05:08:51 +1100
From: Adam Kennedy <adam [...] phase-n.com>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: bug-PPI [...] rt.cpan.org
Subject: Re: [cpan #15353] Failure on Unicode byte order mark
References: <rt-15353-46215.12.7957375820488 [...] cpan.org>
In-Reply-To: <rt-15353-46215.12.7957375820488 [...] cpan.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at linearg.com
RT-Send-Cc:
X-RT-Original-Encoding: iso-8859-1
Content-Length: 575
Download (untitled) / with headers
text/plain 575b
OK, added. Go for your life. Adam K via RT wrote: Show quoted text
> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Mon Oct 31 12:47:32 2005]: > >
>>Timeline is fine, if you contain it in a branch and work at your own >>pace, however long it takes is totally fine by me. >> >>What is your SourceForge account, and I'll add it to CVS permissions? >> >>Adam K
> > > I'm chrisdolan @ SF. > > -- Chris
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
From: cdolan [...] cpan.org
X-RT-Original-Encoding: iso-8859-1
Content-Length: 553
Download (untitled) / with headers
text/plain 553b
I couldn't stop thinking about this, so I implemented it. I committed it on a CVS branch called "Branch_unicode_support". I tested my patch using Perl::Critic and it now succeeds to parse basic UTF-8 files that were failing before. So, if this patch reaches mainline PPI, I consider this bug closed. Note that my patch unexpectedly caused one test to pass: UTF-8 characters in the middle of barewords. UTF-8 characters at the beginning of barewords still fails. I think that's because \w is already Unicode-friedly in PPI::Token::Words -- Chris
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
X-RT-Original-Encoding: iso-8859-1
Content-Length: 843
Download (untitled) / with headers
text/plain 843b
Hi Adam, Just a reminder that the BOM code is still in the CVS branch mentioned below. In light of the Unicode improvements that A.Tang has been pushing, I think the BOM code has more relevance. Best wishes, Chris [CLOTHO - Mon Oct 31 15:14:03 2005]: Show quoted text
> I couldn't stop thinking about this, so I implemented it. I committed > it on a CVS branch called "Branch_unicode_support". > > I tested my patch using Perl::Critic and it now succeeds to parse > basic > UTF-8 files that were failing before. So, if this patch reaches > mainline PPI, I consider this bug closed. > > Note that my patch unexpectedly caused one test to pass: UTF-8 > characters in the middle of barewords. UTF-8 characters at the > beginning of barewords still fails. I think that's because \w is > already Unicode-friedly in PPI::Token::Words > > -- Chris
Return-Path: <adam [...] phase-n.com>
X-Original-To: bug-PPI [...] rt.cpan.org
Delivered-To: cpan-bug+ppi [...] diesel.bestpractical.com
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 974114D8103 for <bug-PPI [...] rt.cpan.org>; Tue, 13 Dec 2005 07:12:27 -0500 (EST)
Received: (qmail 29858 invoked by alias); 13 Dec 2005 12:12:01 -0000
X-Spam-Check-BY: la.mx.develooper.com
Received-SPF: neutral (x1.develooper.com: local policy)
Received: from starfury.linearg.com (HELO starfury.linearg.com) (202.90.48.2) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Tue, 13 Dec 2005 04:11:49 -0800
Received: from localhost (localhost [127.0.0.1]) by starfury.linearg.com (Postfix) with ESMTP id 04906804C4DC for <bug-PPI [...] rt.cpan.org>; Tue, 13 Dec 2005 23:11:24 +1100 (EST)
Received: from starfury.linearg.com ([127.0.0.1]) by localhost (starfury [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 22842-10 for <bug-PPI [...] rt.cpan.org>; Tue, 13 Dec 2005 23:11:23 +1100 (EST)
Received: from [172.31.0.178] (hq-nat.linearg.net [202.90.48.125]) by starfury.linearg.com (Postfix) with ESMTP id 58D7C8015DE4 for <bug-PPI [...] rt.cpan.org>; Tue, 13 Dec 2005 23:11:23 +1100 (EST)
Message-ID: <439EB9F8.3010201 [...] phase-n.com>
Date: Tue, 13 Dec 2005 23:09:28 +1100
From: Adam Kennedy <adam [...] phase-n.com>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: bug-PPI [...] rt.cpan.org
Subject: Re: [cpan #15353] Failure on Unicode byte order mark
References: <rt-15353-49199.3.04427126151261 [...] cpan.org>
In-Reply-To: <rt-15353-49199.3.04427126151261 [...] cpan.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at linearg.com
RT-Send-Cc:
X-RT-Original-Encoding: iso-8859-1
Content-Length: 1339
Download (untitled) / with headers
text/plain 1.3k
According to Audrey (formerly Autrijus, as of 1 week ago) the code didn't work... so she added her Unicode stuff to the main branch rather than to the branch. We might want to talk in Freenode #perl6 about this? Any comments? Adam K via RT wrote: Show quoted text
> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > Hi Adam, > > Just a reminder that the BOM code is still in the CVS branch mentioned > below. In light of the Unicode improvements that A.Tang has been > pushing, I think the BOM code has more relevance. > > Best wishes, > Chris > > > [CLOTHO - Mon Oct 31 15:14:03 2005]: > >
>>I couldn't stop thinking about this, so I implemented it. I committed >>it on a CVS branch called "Branch_unicode_support". >> >>I tested my patch using Perl::Critic and it now succeeds to parse >> basic >>UTF-8 files that were failing before. So, if this patch reaches >>mainline PPI, I consider this bug closed. >> >>Note that my patch unexpectedly caused one test to pass: UTF-8 >>characters in the middle of barewords. UTF-8 characters at the >>beginning of barewords still fails. I think that's because \w is >>already Unicode-friedly in PPI::Token::Words >> >> -- Chris
> >
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-14818-1266724874-1099.15353-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 44
Confirming this case appears to be resolved.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.