Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the JSON CPAN distribution.

Report information
The Basics
Id:
86244
Status:
open
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
adolf.szabo [...] gmail.com
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: utf8 flag wrong
Date: Tue, 18 Jun 2013 22:24:57 +0200
To: bug-JSON@rt.cpan.org
From: Adolf Szabo <adolf.szabo@gmail.com>
Hi,

My problem is that JSON->new()->decode($str) always sets utf8 flag to ON for each string in the hash, no matter what I specify (ascii, latin1, utf8(0) or utf8(1). This is not only an annoyance, but I think a bug too. Let me give you an example:

Here is a sample json file, with $h->{TITL} containing őa as string. We will focus on the second character, the ascii 'a' for now:

aszabo@mepc:/tmp$ hexdump -C test.txt
00000000  7b 22 54 49 54 4c 22 3a  22 c5 91 61 22 7d 0a     |{"TITL":"..a"}.|
0000000f
aszabo@mepc:/tmp$ cat a.pl
use strict;
use warnings;
use Encode;
use JSON;

local $/=undef;
my $str=<STDIN>;

my $h=JSON->new()->utf8(1)->decode($str);
#my $h=JSON->new()->utf8(0)->decode($str);
my $c=substr($h->{TITL},1,1);
printf("%s [%d], utf8 flag is %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF');

exit;
aszabo@mepc:/tmp$ cat test.txt | perl a.pl
a [97], utf8 flag is ON

This is as expected so far. Now I enable utf8(0) line, and repeat:

aszabo@mepc:/tmp$ cat test.txt | perl a.pl
� [145], utf8 flag is ON

This is wrong: utf8 flag is set to ON, however $h->{TITL} is not in perl's internal encoding format as second character should return 'a', not second byte of first character. This utf8 flag is a problem later on when I use regexp on the strings of the hash etc.

Please let me know what you think.

Thx, Adolf


This is not a bug. First, because you set utf8(0), your input data is regarded as bytes. "\X{c5}\x{91}\x{61}" => dump data is \x{91} The result is expected. Second, you shouldn't look UTF8 flag. JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself. The second result is latin-1 characters even if UTF8 flag is on. Please see to http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22%3f On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote:
Show quoted text
> Hi, > > My problem is that JSON->new()->decode($str) always sets utf8 flag to ON > for each string in the hash, no matter what I specify (ascii, latin1, > utf8(0) or utf8(1). This is not only an annoyance, but I think a bug too. > Let me give you an example: > > Here is a sample json file, with $h->{TITL} containing őa as string. We > will focus on the second character, the ascii 'a' for now: > > aszabo@mepc:/tmp$ hexdump -C test.txt > 00000000 7b 22 54 49 54 4c 22 3a 22 c5 91 61 22 7d 0a > |{"TITL":"..a"}.| > 0000000f > aszabo@mepc:/tmp$ cat a.pl > use strict; > use warnings; > use Encode; > use JSON; > > local $/=undef; > my $str=<STDIN>; > > my $h=JSON->new()->utf8(1)->decode($str); > #my $h=JSON->new()->utf8(0)->decode($str); > my $c=substr($h->{TITL},1,1); > printf("%s [%d], utf8 flag is > %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF'); > > exit; > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > a [97], utf8 flag is ON > > This is as expected so far. Now I enable utf8(0) line, and repeat: > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > � [145], utf8 flag is ON > > This is wrong: utf8 flag is set to ON, however $h->{TITL} is not in perl's > internal encoding format as second character should return 'a', not second > byte of first character. This utf8 flag is a problem later on when I use > regexp on the strings of the hash etc. > > Please let me know what you think. > > Thx, Adolf
Subject: Re: [rt.cpan.org #86244] utf8 flag wrong
Date: Thu, 20 Jun 2013 05:56:18 +0200
To: bug-JSON@rt.cpan.org
From: Adolf Szabo <adolf.szabo@gmail.com>
Usually I do not mess with perl's internals. Unless I face a problem. Here is my specific problem:

The character in question is the Polish ą  (\xC4 \x85). When this is the last character of a string and I execute

$str=~s/\s+\z//;

nothing is removed (as expected). But after using JSON lib the second byte of the char is removed resulting in a broken utf8 char:

my $h=JSON->new()->decode($s);
$h->{TITL}=~s/\s+\z//;

aszabo@mepc:/tmp$ hexdump -C test.txt
00000000  7b 22 54 49 54 4c 22 3a  22 c4 85 22 7d 0a        |{"TITL":".."}.|


Please explain what did I do wrong then.

Thx



On Thu, Jun 20, 2013 at 5:02 AM, Makamaka Hannyaharamitu via RT <bug-JSON@rt.cpan.org> wrote:
Show quoted text
<URL: https://rt.cpan.org/Ticket/Display.html?id=86244 >

This is not a bug.

First, because you set utf8(0),
your input data is regarded as bytes.
"\X{c5}\x{91}\x{61}" => dump data is \x{91}
The result is expected.

Second, you shouldn't look UTF8 flag.
JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself.
The second result is latin-1 characters even if UTF8 flag is on.

Please see to
http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22%3f




On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote:
> Hi,
>
> My problem is that JSON->new()->decode($str) always sets utf8 flag to ON
> for each string in the hash, no matter what I specify (ascii, latin1,
> utf8(0) or utf8(1). This is not only an annoyance, but I think a bug too.
> Let me give you an example:
>
> Here is a sample json file, with $h->{TITL} containing őa as string. We
> will focus on the second character, the ascii 'a' for now:
>
> aszabo@mepc:/tmp$ hexdump -C test.txt
> 00000000  7b 22 54 49 54 4c 22 3a  22 c5 91 61 22 7d 0a
> |{"TITL":"..a"}.|
> 0000000f
> aszabo@mepc:/tmp$ cat a.pl
> use strict;
> use warnings;
> use Encode;
> use JSON;
>
> local $/=undef;
> my $str=<STDIN>;
>
> my $h=JSON->new()->utf8(1)->decode($str);
> #my $h=JSON->new()->utf8(0)->decode($str);
> my $c=substr($h->{TITL},1,1);
> printf("%s [%d], utf8 flag is
> %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF');
>
> exit;
> aszabo@mepc:/tmp$ cat test.txt | perl a.pl
> a [97], utf8 flag is ON
>
> This is as expected so far. Now I enable utf8(0) line, and repeat:
>
> aszabo@mepc:/tmp$ cat test.txt | perl a.pl
> � [145], utf8 flag is ON
>
> This is wrong: utf8 flag is set to ON, however $h->{TITL} is not in perl's
> internal encoding format as second character should return 'a', not second
> byte of first character. This utf8 flag is a problem later on when I use
> regexp on the strings of the hash etc.
>
> Please let me know what you think.
>
> Thx, Adolf




I got your point (U+0085 matches \s). I said that utf8(0) causes expecting bytes. But it is mistaken. As document says, utf8(0) expects UNICODE. http://search.cpan.org/~makamaka/JSON-2.59/lib/JSON.pm#utf8 So, the resolution is setting utf8(1). Does it answer your question? On 2013-6月-19 水 23:56:37, adolf.szabo@gmail.com wrote:
Show quoted text
> Usually I do not mess with perl's internals. Unless I face a problem. > Here > is my specific problem: > > The character in question is the Polish ą (\xC4 \x85). When this is > the > last character of a string and I execute > > $str=~s/\s+\z//; > > nothing is removed (as expected). But after using JSON lib the second > byte > of the char is removed resulting in a broken utf8 char: > > my $h=JSON->new()->decode($s); > $h->{TITL}=~s/\s+\z//; > > aszabo@mepc:/tmp$ hexdump -C test.txt > 00000000 7b 22 54 49 54 4c 22 3a 22 c4 85 22 7d 0a > |{"TITL":".."}.| > > > Please explain what did I do wrong then. > > Thx > > > > On Thu, Jun 20, 2013 at 5:02 AM, Makamaka Hannyaharamitu via RT < > bug-JSON@rt.cpan.org> wrote: >
> > <URL: https://rt.cpan.org/Ticket/Display.html?id=86244 > > > > > This is not a bug. > > > > First, because you set utf8(0), > > your input data is regarded as bytes. > > "\X{c5}\x{91}\x{61}" => dump data is \x{91} > > The result is expected. > > > > Second, you shouldn't look UTF8 flag. > > JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself. > > The second result is latin-1 characters even if UTF8 flag is on. > > > > Please see to > > http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-
> flag%22%3f
> > > > > > > > > > On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote:
> > > Hi, > > > > > > My problem is that JSON->new()->decode($str) always sets utf8 flag
> to ON
> > > for each string in the hash, no matter what I specify (ascii,
> latin1,
> > > utf8(0) or utf8(1). This is not only an annoyance, but I think a
> bug too.
> > > Let me give you an example: > > > > > > Here is a sample json file, with $h->{TITL} containing őa as
> string. We
> > > will focus on the second character, the ascii 'a' for now: > > > > > > aszabo@mepc:/tmp$ hexdump -C test.txt > > > 00000000 7b 22 54 49 54 4c 22 3a 22 c5 91 61 22 7d 0a > > > |{"TITL":"..a"}.| > > > 0000000f > > > aszabo@mepc:/tmp$ cat a.pl > > > use strict; > > > use warnings; > > > use Encode; > > > use JSON; > > > > > > local $/=undef; > > > my $str=<STDIN>; > > > > > > my $h=JSON->new()->utf8(1)->decode($str); > > > #my $h=JSON->new()->utf8(0)->decode($str); > > > my $c=substr($h->{TITL},1,1); > > > printf("%s [%d], utf8 flag is > > > %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF'); > > > > > > exit; > > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > > a [97], utf8 flag is ON > > > > > > This is as expected so far. Now I enable utf8(0) line, and repeat: > > > > > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl > > > � [145], utf8 flag is ON > > > > > > This is wrong: utf8 flag is set to ON, however $h->{TITL} is not
> in
> > perl's
> > > internal encoding format as second character should return 'a',
> not
> > second
> > > byte of first character. This utf8 flag is a problem later on when
> I use
> > > regexp on the strings of the hash etc. > > > > > > Please let me know what you think. > > > > > > Thx, Adolf
> > > > > > > >
Subject: Re: [rt.cpan.org #86244] utf8 flag wrong
Date: Thu, 20 Jun 2013 09:58:27 +0200
To: bug-JSON@rt.cpan.org
From: Adolf Szabo <adolf.szabo@gmail.com>
Yes, U+0085 is indeed looks to be a space char. From perl 5.14 I can use /a modifier to make it work:

$h->{TITL}=~s/\s+\z//a;

However right now I'm stuck with 5.8.8 and a bunch of legacy code, which was designed before utf8 became widespread. And I also tried using utf8(1) as you suggest, but then for each string in the hash I need to call

$h->{TITL}=Encode::encode_utf8($h->{TITL})

to let rest of the code work, or I get 'Wide character in ...' warnings everywhere.

So my question is why JSON lib does not provide a way to get strings back in the plain old way, something like

$h=JSON->new()->latin1(1)->decode($str);

that would return strings in $h as one-byte==one-char

Thx, Adolf


On Thu, Jun 20, 2013 at 9:13 AM, Makamaka Hannyaharamitu via RT <bug-JSON@rt.cpan.org> wrote:
Show quoted text
<URL: https://rt.cpan.org/Ticket/Display.html?id=86244 >

I got your point (U+0085 matches \s).

I said that utf8(0) causes expecting bytes.
But it is mistaken. As document says,
utf8(0) expects UNICODE.

http://search.cpan.org/~makamaka/JSON-2.59/lib/JSON.pm#utf8

So, the resolution is setting utf8(1).
Does it answer your question?



On 2013-6月-19 水 23:56:37, adolf.szabo@gmail.com wrote:
> Usually I do not mess with perl's internals. Unless I face a problem.
> Here
> is my specific problem:
>
> The character in question is the Polish ą  (\xC4 \x85). When this is
> the
> last character of a string and I execute
>
> $str=~s/\s+\z//;
>
> nothing is removed (as expected). But after using JSON lib the second
> byte
> of the char is removed resulting in a broken utf8 char:
>
> my $h=JSON->new()->decode($s);
> $h->{TITL}=~s/\s+\z//;
>
> aszabo@mepc:/tmp$ hexdump -C test.txt
> 00000000  7b 22 54 49 54 4c 22 3a  22 c4 85 22 7d 0a
> |{"TITL":".."}.|
>
>
> Please explain what did I do wrong then.
>
> Thx
>
>
>
> On Thu, Jun 20, 2013 at 5:02 AM, Makamaka Hannyaharamitu via RT <
> bug-JSON@rt.cpan.org> wrote:
>
> > <URL: https://rt.cpan.org/Ticket/Display.html?id=86244 >
> >
> > This is not a bug.
> >
> > First, because you set utf8(0),
> > your input data is regarded as bytes.
> > "\X{c5}\x{91}\x{61}" => dump data is \x{91}
> > The result is expected.
> >
> > Second, you shouldn't look UTF8 flag.
> > JSON(JSON::XS/PP)'s UNICODE handling depends on Perl itself.
> > The second result is latin-1 characters even if UTF8 flag is on.
> >
> > Please see to
> > http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-
> flag%22%3f
> >
> >
> >
> >
> > On 2013-6月-18 火 16:25:12, adolf.szabo@gmail.com wrote:
> > > Hi,
> > >
> > > My problem is that JSON->new()->decode($str) always sets utf8 flag
> to ON
> > > for each string in the hash, no matter what I specify (ascii,
> latin1,
> > > utf8(0) or utf8(1). This is not only an annoyance, but I think a
> bug too.
> > > Let me give you an example:
> > >
> > > Here is a sample json file, with $h->{TITL} containing őa as
> string. We
> > > will focus on the second character, the ascii 'a' for now:
> > >
> > > aszabo@mepc:/tmp$ hexdump -C test.txt
> > > 00000000  7b 22 54 49 54 4c 22 3a  22 c5 91 61 22 7d 0a
> > > |{"TITL":"..a"}.|
> > > 0000000f
> > > aszabo@mepc:/tmp$ cat a.pl
> > > use strict;
> > > use warnings;
> > > use Encode;
> > > use JSON;
> > >
> > > local $/=undef;
> > > my $str=<STDIN>;
> > >
> > > my $h=JSON->new()->utf8(1)->decode($str);
> > > #my $h=JSON->new()->utf8(0)->decode($str);
> > > my $c=substr($h->{TITL},1,1);
> > > printf("%s [%d], utf8 flag is
> > > %s\n",$c,ord($c),Encode::is_utf8($c)?'ON':'OFF');
> > >
> > > exit;
> > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl
> > > a [97], utf8 flag is ON
> > >
> > > This is as expected so far. Now I enable utf8(0) line, and repeat:
> > >
> > > aszabo@mepc:/tmp$ cat test.txt | perl a.pl
> > > � [145], utf8 flag is ON
> > >
> > > This is wrong: utf8 flag is set to ON, however $h->{TITL} is not
> in
> > perl's
> > > internal encoding format as second character should return 'a',
> not
> > second
> > > byte of first character. This utf8 flag is a problem later on when
> I use
> > > regexp on the strings of the hash etc.
> > >
> > > Please let me know what you think.
> > >
> > > Thx, Adolf
> >
> >
> >
> >






This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.