This queue is for tickets about the Text-Unidecode CPAN distribution.

Report information
The Basics
Id:
87119
Status:
open
Priority:
Low/Low

People
Owner:
sburke [...] cpan.org
Requestors:
harisekhon [...] gmail.com
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: Characters converting to "a" character instead of representative character or empty string otherwise
Date: Sun, 21 Jul 2013 13:53:54 +0100
To: bug-Text-Unidecode@rt.cpan.org
From: Hari Sekhon <harisekhon@gmail.com>
Hi Sean, Some annoying characters seem to convert to the character "a" instead of the correct character or just nothing if they aren't representable in ASCII. For example: ?~@~]\?~@~] which appears on a web page as "\" but gets converted to a\a. If this case double quote backslash double quote is representable in ASCII. I find this happening a lot with space dash space copied from websites as well. Thanks Hari Sekhon
Show quoted text
> Some annoying characters seem to convert to the character "a" instead > of the correct character or just nothing if they aren't representable > in ASCII. For example: > > ?~@~]\?~@~] > > which appears on a web page as "\"
Thank you for your bug report! But hm, I can't reproduce the error. Can you give me a short Perl program that demonstrates the problem? I'm suspecting this is a problem to do with encodings.
Subject: Re: [rt.cpan.org #87119] Characters converting to "a" character instead of representative character or empty string otherwise
Date: Sun, 21 Jul 2013 14:32:22 +0100
To: bug-Text-Unidecode@rt.cpan.org
From: Hari Sekhon <harisekhon@gmail.com>
Hi Sean, See attached unidecode_example.pl where I can copy/pasted the string straight in to a variable and called unidecode on the variable. Thanks Hari On 21 July 2013 14:17, Sean M. Burke via RT <bug-Text-Unidecode@rt.cpan.org> wrote:
Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=87119 > >
>> Some annoying characters seem to convert to the character "a" instead >> of the correct character or just nothing if they aren't representable >> in ASCII. For example: >> >> ?~@~]\?~@~] >> >> which appears on a web page as "\"
> > Thank you for your bug report! > But hm, I can't reproduce the error. Can you give me a short Perl program that demonstrates the problem? > I'm suspecting this is a problem to do with encodings. >

Message body is not shown because sender requested not to inline it.

Ah, this is a thing where the string looks like utf8 to you but is flat bytes to Perl. Add this line to your code: print "It is ", length($str), " characters long.\n"; And it'll say It is 33 characters long. But use this program: #!/usr/bin/perl use Text::Unidecode; # copied from a website where it appears as value="\"@timestamp\":\"" my $str = 'value=”\”@timestamp\”:\”"'; utf8::decode($str); binmode(STDOUT, ":utf8") || die "WHUT $!"; # Read perldoc: # perlunitut, perluniintro, perlrun, bytes, perlunicode perluni # where there's explanations of perl -CL and other fun stuff # that might, or might not, be more DWIM than having to # call utf8::decode as above. print 'string as it appears on website : value="\"@timestamp\":\""' . "\n"; print "raw string as copy/pasted in Mac terminal: $str\n"; print "It is ", length($str), " characters long.\n"; print "string returned by unidecode() : " . unidecode($str) . "\n"; And that works, and it says: It is 25 characters long. string returned by unidecode() : value="\"@timestamp\":\"" The "a"s were coming from the fact that the byte values for the ” you have is e2 80 9d. Now, 80 and 9d are no good in Unicode so each of them are empty-string, but e2 is "â" ...which Unidecode turns into "a", and that's why it looks like Unidecode is turning a “ character into an a character. BTW, in mystery cases like this, I often throw in a thing like this to make sure that what I consider characters and what Perl considers characters are syncing up, or not: foreach my $char (split '', $str) { printf "\tChar %0x : \"%s\" => u:\"%s\"\n", ord($char), $char, unidecode($char); } Am I making sense? I often explain things poorly and can't tell. And "perldoc utf8" sometimes leaves me more confused than before I read it! I often just go thru the various functions and call one or the other until I get whichever one does the job... and then I see that its documentation *now* (in 20/20 hindsight) makes perfect sense. OH UNICODE!


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.