Skip Menu |
 
Update: The rt.cpan.org bug tracker service is no longer shutting down.

This queue is for tickets about the Unicode-CaseFold CPAN distribution.

Report information
The Basics
Id: 77122
Status: rejected
Priority: 0/
Queue: Unicode-CaseFold

People
Owner: Nobody in particular
Requestors: RSAVAGE [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 0.02
Fixed in: (no value)



Subject: Output of fc kills Encode::decode
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 2452
Download (untitled) / with headers
text/plain 2.3k
Hi I'm processing subcountry names in Estonia, from: http://en.wikipedia.org/wiki/ISO_3166-2:EE I got to that page from the list of all countries: http://en.wikipedia.org/wiki/ISO_3166-2 Code: for my $element (@$table) { $i++; $self -> log(debug => "code: $$element{code}"); $self -> log(debug => "name: $$element{name}"); $self -> log(debug => "decode: " . decode('utf8', $$element{name})); $self -> log(debug => "decode fc: " . decode('utf8', fc $$element{name})); $sth -> execute($country_id, $$element{code}, decode('utf8', fc $$element{name}), decode('utf8', $$element{name}), $i); } Output: debug: code: EE-37. debug: name: Harjumaa. debug: decode: Harjumaa. debug: decode fc: harjumaa. debug: code: EE-39. debug: name: Hiiumaa. debug: decode: Hiiumaa. debug: decode fc: hiiumaa. debug: code: EE-44. debug: name: Ida-Virumaa. debug: decode: Ida-Virumaa. debug: decode fc: ida-virumaa. debug: code: EE-49. debug: name: Jõgevamaa. debug: decode: Jõgevamaa. Cannot decode string with wide characters at /home/ron/perl5/perlbrew/perls/perl-5.14.2/lib/5.14.2/x86_64-linux- thread-multi/Encode.pm line 176. So, the call to fc returns something unacceptable to decode, when the name is Jõgevamaa. I rigged the code to skip Estonia, and the code works in all other countries and their subcountries. I then rigged the code to skip Jõgevamaa, and the next place it dies is: debug: code: EE-65. debug: name: Põlvamaa. debug: decode: Põlvamaa. Cannot decode string with wide characters at /home/ron/perl5/perlbrew/perls/perl-5.14.2/lib/5.14.2/x86_64-linux- thread-multi/Encode.pm line 176. I.e The names corresponding to the codes EE-51, EE-57 and EE-59 are all handled ok. I rigged it to skip Põlvamaa, and the next place it dies is: debug: code: EE-86. debug: name: Võrumaa. debug: decode: Võrumaa. Cannot decode string with wide characters at /home/ron/perl5/perlbrew/perls/perl-5.14.2/lib/5.14.2/x86_64-linux- thread-multi/Encode.pm line 176. So, each problem is 'o' with a tilde above it. When I rigged to code to skip these 3 cases, everything worked. This is Debian 6, 64 bit. Perl V 5.14.2. Encode V 2.44. Unicode::CaseFold V 0.02. Unicode::Normalize V 1.14. Installing Perl V 5.15.9... Versions of Encode, Unicode::CaseFold, Unicode::Normalize are the same. Same problem :-(. Cheers Ron
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-20176-1336924130-690.77122-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 514
Download (untitled) / with headers
text/plain 514b
On Thu May 10 23:58:19 2012, RSAVAGE wrote: Show quoted text
> $self -> log(debug => "decode: " . decode('utf8', > $$element{name})); > $self -> log(debug => "decode fc: " . decode('utf8', fc > $$element{name}));
This isn't a bug in Unicode::CaseFold, except possibly the lack of a better error message (I will see what perl 5.16 does, and try to imitate it). In any case, decode('utf8', fc $bytes) is invalid. You should be writing fc decode('utf8', $bytes) instead, as fc works on character- strings, not byte-strings.
From ron [...] savage.net.au Sun May 13 19: 54:23 2012
MIME-Version: 1.0
X-Spam-Status: No, score=-6.899 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, SPF_FAIL=0.001] autolearn=ham
X-Sender: ron [...] savage.net.au
In-Reply-To: <rt-3.8.HEAD-20176-1336924130-1113.77122-6-0 [...] rt.cpan.org>
X-Spam-Flag: NO
References: <RT-Ticket-77122 [...] rt.cpan.org> <rt-3.8.HEAD-20176-1336924130-1113.77122-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <4FB048CB.3030905 [...] savage.net.au>
Content-Type: text/plain; charset=UTF-8; format=flowed
X-RT-Original-Encoding: utf-8
X-Spam-Score: -6.899
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 9D6E4240356 for <cpan-bug+Unicode-CaseFold [...] hipster.bestpractical.com>; Sun, 13 May 2012 19:54:23 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CZOMqPBFZ9vv for <cpan-bug+Unicode-CaseFold [...] hipster.bestpractical.com>; Sun, 13 May 2012 19:54:22 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id CDFFD24032C for <bug-Unicode-CaseFold [...] rt.cpan.org>; Sun, 13 May 2012 19:54:21 -0400 (EDT)
Received: (qmail 6145 invoked by uid 103); 13 May 2012 23:54:20 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 May 2012 23:54:20 -0000
Received: from mail1.qnetau.com (HELO mail1.qnetau.com) (202.146.209.5) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Sun, 13 May 2012 16:54:17 -0700
Received: (qmail 41089 invoked by uid 399); 13 May 2012 23:54:09 -0000
Received: from unknown (HELO ?192.168.1.2?) (ron [...] savage.net.au [...] 124.168.79.122) by mail1.qnetau.com with ESMTPAM; 13 May 2012 23:54:09 -0000
Delivered-To: cpan-bug+Unicode-CaseFold [...] hipster.bestpractical.com
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20120317 Icedove/3.0.11
Subject: Re: [rt.cpan.org #77122] Output of fc kills Encode::decode
Return-Path: <ron [...] savage.net.au>
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+Unicode-CaseFold [...] hipster.bestpractical.com
X-RT-Mail-Extension: unicode-casefold
Date: Mon, 14 May 2012 09:50:35 +1000
X-Spam-Level:
X-Originating-Ip: 124.168.79.122
To: bug-Unicode-CaseFold [...] rt.cpan.org
Content-Transfer-Encoding: 7bit
From: Ron Savage <ron [...] savage.net.au>
RT-Message-ID: <rt-3.8.HEAD-20178-1336953264-1440.77122-0-0 [...] rt.cpan.org>
Content-Length: 731
Download (untitled) / with headers
text/plain 731b
Hi Andrew On 14/05/12 01:48, Andrew Rodland via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=77122> > > On Thu May 10 23:58:19 2012, RSAVAGE wrote:
>> $self -> log(debug => "decode: " . decode('utf8', >> $$element{name})); >> $self -> log(debug => "decode fc: " . decode('utf8', fc >> $$element{name}));
> > This isn't a bug in Unicode::CaseFold, except possibly the lack of a > better error message (I will see what perl 5.16 does, and try to imitate > it). In any case, decode('utf8', fc $bytes) is invalid. You should be > writing fc decode('utf8', $bytes) instead, as fc works on character- > strings, not byte-strings.
OK. Thanx for the reply. -- Ron Savage http://savage.net.au/ Ph: 0421 920 622


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.