This queue is for tickets about the Text-Unidecode CPAN distribution.

Report information
The Basics
Id:
87119
Status:
open
Priority:
Low/Low

People
Owner:
sburke [...] cpan.org
Requestors:
harisekhon [...] gmail.com
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



From harisekhon@gmail.com Sun Jul 21 08: 54:47 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-6.22 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
X-Spam-Flag: NO
content-type: text/plain; charset="utf-8"
Message-ID: <CAMc9hMYTXGvVWD+sUvwnXed0H6hHS7SEm1dTPswTkEkiiF3QuA@mail.gmail.com>
X-Received: by 10.68.220.1 with SMTP id ps1mr20692056pbc.30.1374411274581; Sun, 21 Jul 2013 05:54:34 -0700 (PDT)
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Spam-Score: -6.22
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 137E124044F for <cpan-bug+Text-Unidecode@hipster.bestpractical.com>; Sun, 21 Jul 2013 08:54:47 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xgG15tzapKgR for <cpan-bug+Text-Unidecode@hipster.bestpractical.com>; Sun, 21 Jul 2013 08:54:46 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id BABA3240433 for <bug-Text-Unidecode@rt.cpan.org>; Sun, 21 Jul 2013 08:54:45 -0400 (EDT)
Received: (qmail 23667 invoked by alias); 21 Jul 2013 12:54:44 -0000
Received: from mail-pb0-f50.google.com (HELO mail-pb0-f50.google.com) (209.85.160.50) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sun, 21 Jul 2013 05:54:38 -0700
Received: by mail-pb0-f50.google.com with SMTP id wz7so6092312pbc.37 for <bug-Text-Unidecode@rt.cpan.org>; Sun, 21 Jul 2013 05:54:34 -0700 (PDT)
Received: by 10.68.253.2 with HTTP; Sun, 21 Jul 2013 05:53:54 -0700 (PDT)
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i=@gmail.com
Delivered-To: cpan-bug+Text-Unidecode@hipster.bestpractical.com
Subject: Characters converting to "a" character instead of representative character or empty string otherwise
Return-Path: <harisekhon@gmail.com>
X-RT-Mail-Extension: text-unidecode
X-Original-To: cpan-bug+Text-Unidecode@hipster.bestpractical.com
X-Spam-Check-BY: la.mx.develooper.com
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=nPrfe+kNEzUIwmHqFAcrqNiGKUPMNFjJ3POxpQ64QDU=; b=DvsL4TLiat3BMuU+A72PhCkSL7Ip6tdhXjrS7A2tCUn3l0FQES1IDySyqiIfjYLxOK duGIRF5IgZENBv8Tq8LQGwVAGGqE9+Y3hwQ0QlhjCANoJuCuPrbPEGIAvhxnPd3Scsfm jtYSzqr8oFSdf06/QQI/HqdskxLUxQ8ymxpUhCXnID82+35Dsv1uklBVxDGm5LULx9yL Bfw64M0BdYXhktqmVo8NUilIeWZGXxWzYMNSuc/48/cWsEDSU1CYIckhiA8qwwlCn579 8yZla9kQ7gOGVd/6KWU24G0WX+0vY7Wsd5qt+xNwfWcBM7AvnZGMVk9gAixS6vL3O+HF 9NAQ==
Date: Sun, 21 Jul 2013 13:53:54 +0100
X-Spam-Level:
To: bug-Text-Unidecode@rt.cpan.org
From: Hari Sekhon <harisekhon@gmail.com>
X-RT-Original-Encoding: iso-8859-1
X-RT-Interface: Email
Content-Length: 428
Hi Sean, Some annoying characters seem to convert to the character "a" instead of the correct character or just nothing if they aren't representable in ASCII. For example: ?~@~]\?~@~] which appears on a web page as "\" but gets converted to a\a. If this case double quote backslash double quote is representable in ASCII. I find this happening a lot with space dash space copied from websites as well. Thanks Hari Sekhon
MIME-Version: 1.0
In-Reply-To: <CAMc9hMYTXGvVWD+sUvwnXed0H6hHS7SEm1dTPswTkEkiiF3QuA@mail.gmail.com>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <CAMc9hMYTXGvVWD+sUvwnXed0H6hHS7SEm1dTPswTkEkiiF3QuA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.13-11555-1374412664-109.87119-0-0@rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 419
Show quoted text
> Some annoying characters seem to convert to the character "a" instead > of the correct character or just nothing if they aren't representable > in ASCII. For example: > > ?~@~]\?~@~] > > which appears on a web page as "\"
Thank you for your bug report! But hm, I can't reproduce the error. Can you give me a short Perl program that demonstrates the problem? I'm suspecting this is a problem to do with encodings.
From harisekhon@gmail.com Sun Jul 21 09: 33:13 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-6.22 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-4.0.13-11555-1374412664-843.87119-6-0@rt.cpan.org>
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87119@rt.cpan.org> <CAMc9hMYTXGvVWD+sUvwnXed0H6hHS7SEm1dTPswTkEkiiF3QuA@mail.gmail.com> <rt-4.0.13-11555-1374412664-843.87119-6-0@rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.66.240.104 with SMTP id vz8mr26582311pac.143.1374413582366; Sun, 21 Jul 2013 06:33:02 -0700 (PDT)
Message-ID: <CAMc9hMYngCLoJ=hjBV7j-YJuynvEaNZs7yXzCG568jDn8UYjvw@mail.gmail.com>
Content-Type: multipart/mixed; boundary="047d7b111f155a50f904e2059958"
X-Spam-Score: -6.22
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i=@gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 3C78224044F for <cpan-bug+Text-Unidecode@hipster.bestpractical.com>; Sun, 21 Jul 2013 09:33:13 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id f6NehpCoNtCw for <cpan-bug+Text-Unidecode@hipster.bestpractical.com>; Sun, 21 Jul 2013 09:33:12 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id C1F73240433 for <bug-Text-Unidecode@rt.cpan.org>; Sun, 21 Jul 2013 09:33:11 -0400 (EDT)
Received: (qmail 25795 invoked by alias); 21 Jul 2013 13:33:11 -0000
Received: from mail-pd0-f177.google.com (HELO mail-pd0-f177.google.com) (209.85.192.177) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sun, 21 Jul 2013 06:33:06 -0700
Received: by mail-pd0-f177.google.com with SMTP id p10so5906945pdj.22 for <bug-Text-Unidecode@rt.cpan.org>; Sun, 21 Jul 2013 06:33:02 -0700 (PDT)
Received: by 10.68.253.2 with HTTP; Sun, 21 Jul 2013 06:32:22 -0700 (PDT)
Delivered-To: cpan-bug+Text-Unidecode@hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87119] Characters converting to "a" character instead of representative character or empty string otherwise
Return-Path: <harisekhon@gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=pMNiIzP+uigbiI4SjSx/UEYn1LEPAlFTeeBAD4zBYyg=; b=VvHRhYVVMO1w6I0FW6SQ3BJUmGF0k7fiFOjQn7jwNPDvlzumgwqQ+HNrgFZNgAOrZK h5v4rqiDdBBEy4XADkuOVUn6XnymdrqF0O6/jM+Og9SrkLI9YV0UqvTCFhINFU6fRWYj NzjAk9BGCIx68BnldhSyJP3b96ruYXqv9qG0cYr6YBmfO2yXygy25vbzCxtPWz7V8jUF HAuNPzD1835UTACzPa9cwz7HdG6SjRJnU4Zf4V2DjCA9Wqy68iBCoI/8BKy6a9h/v61X 4qRmkVJuNzAwDqwSapVEJWYi38RI8Rx3tVzqIRkQCtB/gaSdsFePoJgTIPh2rv+QkzxQ incQ==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+Text-Unidecode@hipster.bestpractical.com
X-RT-Mail-Extension: text-unidecode
Date: Sun, 21 Jul 2013 14:32:22 +0100
X-Spam-Level:
To: bug-Text-Unidecode@rt.cpan.org
From: Hari Sekhon <harisekhon@gmail.com>
RT-Message-ID: <rt-4.0.13-8279-1374413593-1812.87119-0-0@rt.cpan.org>
Content-Length: 0
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
Content-Length: 736
Hi Sean, See attached unidecode_example.pl where I can copy/pasted the string straight in to a variable and called unidecode on the variable. Thanks Hari On 21 July 2013 14:17, Sean M. Burke via RT <bug-Text-Unidecode@rt.cpan.org> wrote:
Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=87119 > >
>> Some annoying characters seem to convert to the character "a" instead >> of the correct character or just nothing if they aren't representable >> in ASCII. For example: >> >> ?~@~]\?~@~] >> >> which appears on a web page as "\"
> > Thank you for your bug report! > But hm, I can't reproduce the error. Can you give me a short Perl program that demonstrates the problem? > I'm suspecting this is a problem to do with encodings. >
Content-Type: application/octet-stream; name="unidecode_example.pl"
X-Attachment-ID: f_hjeab8r40
Content-Disposition: attachment; filename="unidecode_example.pl"
Content-Transfer-Encoding: base64
Content-Length: 379

Message body is not shown because sender requested not to inline it.

MIME-Version: 1.0
In-Reply-To: <rt-4.0.13-8279-1374413593-1812.87119-0-0@rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <RT-Ticket-87119@rt.cpan.org> <CAMc9hMYTXGvVWD+sUvwnXed0H6hHS7SEm1dTPswTkEkiiF3QuA@mail.gmail.com> <rt-4.0.13-11555-1374412664-843.87119-6-0@rt.cpan.org> <CAMc9hMYngCLoJ=hjBV7j-YJuynvEaNZs7yXzCG568jDn8UYjvw@mail.gmail.com> <rt-4.0.13-8279-1374413593-1812.87119-0-0@rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.16-8128-1375361317-1830.87119-0-0@rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 2044
Ah, this is a thing where the string looks like utf8 to you but is flat bytes to Perl. Add this line to your code: print "It is ", length($str), " characters long.\n"; And it'll say It is 33 characters long. But use this program: #!/usr/bin/perl use Text::Unidecode; # copied from a website where it appears as value="\"@timestamp\":\"" my $str = 'value=”\”@timestamp\”:\”"'; utf8::decode($str); binmode(STDOUT, ":utf8") || die "WHUT $!"; # Read perldoc: # perlunitut, perluniintro, perlrun, bytes, perlunicode perluni # where there's explanations of perl -CL and other fun stuff # that might, or might not, be more DWIM than having to # call utf8::decode as above. print 'string as it appears on website : value="\"@timestamp\":\""' . "\n"; print "raw string as copy/pasted in Mac terminal: $str\n"; print "It is ", length($str), " characters long.\n"; print "string returned by unidecode() : " . unidecode($str) . "\n"; And that works, and it says: It is 25 characters long. string returned by unidecode() : value="\"@timestamp\":\"" The "a"s were coming from the fact that the byte values for the ” you have is e2 80 9d. Now, 80 and 9d are no good in Unicode so each of them are empty-string, but e2 is "â" ...which Unidecode turns into "a", and that's why it looks like Unidecode is turning a “ character into an a character. BTW, in mystery cases like this, I often throw in a thing like this to make sure that what I consider characters and what Perl considers characters are syncing up, or not: foreach my $char (split '', $str) { printf "\tChar %0x : \"%s\" => u:\"%s\"\n", ord($char), $char, unidecode($char); } Am I making sense? I often explain things poorly and can't tell. And "perldoc utf8" sometimes leaves me more confused than before I read it! I often just go thru the various functions and call one or the other until I get whichever one does the job... and then I see that its documentation *now* (in 20/20 hindsight) makes perfect sense. OH UNICODE!


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.