Skip Menu |
 

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 87267
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: MARKF [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 2.51
  • 2.40
  • 2.41
  • 2.42
  • 2.43
  • 2.44
  • 2.45
  • 2.46
  • 2.47
  • 2.48
  • 2.49
  • 2.50
Fixed in: 2.54



Subject: decode_utf8 doesn't do the same as decode("utf8")
Download (untitled) / with headers
text/plain 454b
The decode_utf8 doesn't do the same as decode("utf8",...) for all inputs despite the documentation explicitly saying that $string = decode_utf8($octets [, CHECK]); Equivalent to "$string = decode("utf8", $octets [, CHECK])". It acts differently when $octets has the UTF-8 flag turned on. decode("utf8",...) treats each character in the string as a byte. decode_utf8 simply returns the string unaltered. Failing test suite attached.
Subject: decode_utf_bug.t
Download decode_utf_bug.t
text/x-perl 1.2k
#!/usr/bin/env perl use strict; use warnings; use Encode; use Test::More tests => 4; # decode_utf8(...) and decode('utf8',...) are MEANT TO BE THE SAME # from the perldoc for Encode: # # $string = decode_utf8($octets [, CHECK]); # Equivalent to "$string = decode("utf8", $octets [, CHECK])". ####### # decode_utf8($bytes) ####### { my $bytes = "test:\x{ee}\x{80}\x{80}"; my $chars = Encode::decode_utf8($bytes); is($chars, "test:\x{e000}", "decode_utf8 without utf-8 flag"); } { my $bytes = "test:\x{ee}\x{80}\x{80}"; # do something that makes the utf-8 flag turn on without # altering the contents of the string $bytes .= "\x{2603}"; chop $bytes; my $chars = Encode::decode_utf8($bytes); is($chars, "test:\x{e000}", "decode_utf8 with utf-8 flag"); } ####### # decode("utf8",$bytes) ####### { my $bytes = "test:\x{ee}\x{80}\x{80}"; my $chars = Encode::decode("utf-8",$bytes); is($chars, "test:\x{e000}", "decode('utf8',...) without utf-8 flag"); } { my $bytes = "test:\x{ee}\x{80}\x{80}"; # do something that makes the utf-8 flag turn on without # altering the contents of the string $bytes .= "\x{2603}"; chop $bytes; my $chars = Encode::decode("utf-8",$bytes); is($chars, "test:\x{e000}", "decode('utf8',...) with utf-8 flag"); }
It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag turned on. And while the document says "equivalent", it does not say "identical". Encode.pm defines decode_utf8 as follows: sub decode_utf8($;$) { my ( $octets, $check ) = @_; return $octets if is_utf8($octets); return undef unless defined $octets; $octets .= '' if ref $octets; $check ||= 0; $utf8enc ||= find_encoding('utf8'); my $string = $utf8enc->decode( $octets, $check ); $_[0] = $octets if $check and !ref $check and !( $check & LEAVE_SRC() ); return $string; } Dan the Encode Maintainer On Wed Jul 24 15:03:37 2013, MARKF wrote: Show quoted text
> The decode_utf8 doesn't do the same as decode("utf8",...) for all > inputs despite the documentation explicitly saying that > > $string = decode_utf8($octets [, CHECK]); > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > It acts differently when $octets has the UTF-8 flag turned on. > decode("utf8",...) treats each character in the string as a byte. > decode_utf8 simply returns the string unaltered. > > Failing test suite attached.
From: victor [...] vsespb.ru
Download (untitled) / with headers
text/plain 1.3k
IMHO it's not "equivalent", nor "identical". Maybe "similar", but difference should be described in documentation. Also, encode_utf8 is actually acts like encode("utf-8"), while described as "Equivalent" too. On Thu Jul 25 07:37:24 2013, DANKOGAI wrote: Show quoted text
> It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag > turned on. And while the document says "equivalent", it does not say > "identical". Encode.pm defines decode_utf8 as follows: > > sub decode_utf8($;$) { > my ( $octets, $check ) = @_; > return $octets if is_utf8($octets); > return undef unless defined $octets; > $octets .= '' if ref $octets; > $check ||= 0; > $utf8enc ||= find_encoding('utf8'); > my $string = $utf8enc->decode( $octets, $check ); > $_[0] = $octets if $check and !ref $check and !( $check & > LEAVE_SRC() ); > return $string; > } > > Dan the Encode Maintainer > > On Wed Jul 24 15:03:37 2013, MARKF wrote:
> > The decode_utf8 doesn't do the same as decode("utf8",...) for all > > inputs despite the documentation explicitly saying that > > > > $string = decode_utf8($octets [, CHECK]); > > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > > > It acts differently when $octets has the UTF-8 flag turned on. > > decode("utf8",...) treats each character in the string as a byte. > > decode_utf8 simply returns the string unaltered. > > > > Failing test suite attached.
From: victor [...] vsespb.ru
Download (untitled) / with headers
text/plain 2.5k
btw the following example prints different results, depending on $ARGV[0] =============== use Encode; use Devel::Peek; use utf8; my ($x, undef) = split(' ', decode("UTF-8", "X \xc2\xc6")); my $s = "\xc2\xb5"; die unless $x eq 'X'; if (1 == $ARGV[0]) { $s .= $x; } else { $s .= 'X'; } Dump decode_utf8("$s"); Dump decode("UTF-8", "$s"); __END__ With ARGV[0] == 1 SV = PV(0x20f87f8) at 0x2013aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x201b140 "\303\202\302\265X"\0 [UTF8 "\x{c2}\x{b5}X"] CUR = 5 LEN = 8 SV = PV(0x20f87d8) at 0x2013aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x2008400 "\302\265X"\0 [UTF8 "\x{b5}X"] CUR = 3 LEN = 8 with ARGV[0] == 2 SV = PV(0x11e67f8) at 0x1101aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x10f7380 "\302\265X"\0 [UTF8 "\x{b5}X"] CUR = 3 LEN = 8 SV = PV(0x1283c68) at 0x1101aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x10f6400 "\302\265X"\0 [UTF8 "\x{b5}X"] CUR = 3 LEN = 8 so, it missing documentation can cause hidden errors in some circumstances. On Fri Aug 16 21:24:06 2013, vsespb wrote: Show quoted text
> IMHO it's not "equivalent", nor "identical". Maybe "similar", but > difference should be described in documentation. > Also, encode_utf8 is actually acts like encode("utf-8"), while > described as "Equivalent" too. > > > On Thu Jul 25 07:37:24 2013, DANKOGAI wrote:
> > It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag > > turned on. And while the document says "equivalent", it does not say > > "identical". Encode.pm defines decode_utf8 as follows: > > > > sub decode_utf8($;$) { > > my ( $octets, $check ) = @_; > > return $octets if is_utf8($octets); > > return undef unless defined $octets; > > $octets .= '' if ref $octets; > > $check ||= 0; > > $utf8enc ||= find_encoding('utf8'); > > my $string = $utf8enc->decode( $octets, $check ); > > $_[0] = $octets if $check and !ref $check and !( $check & > > LEAVE_SRC() ); > > return $string; > > } > > > > Dan the Encode Maintainer > > > > On Wed Jul 24 15:03:37 2013, MARKF wrote:
> > > The decode_utf8 doesn't do the same as decode("utf8",...) for all > > > inputs despite the documentation explicitly saying that > > > > > > $string = decode_utf8($octets [, CHECK]); > > > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > > > > > It acts differently when $octets has the UTF-8 flag turned on. > > > decode("utf8",...) treats each character in the string as a byte. > > > decode_utf8 simply returns the string unaltered. > > > > > > Failing test suite attached.
From: victor [...] vsespb.ru
Download (untitled) / with headers
text/plain 2.8k
Equivalent (and Identical) ticket https://rt.cpan.org/Public/Bug/Display.html?id=61671 On Fri Aug 16 21:37:05 2013, vsespb wrote: Show quoted text
> btw the following example prints different results, depending on $ARGV[0] > > =============== > use Encode; > use Devel::Peek; > use utf8; > > my ($x, undef) = split(' ', decode("UTF-8", "X \xc2\xc6")); > > my $s = "\xc2\xb5"; > > > die unless $x eq 'X'; > if (1 == $ARGV[0]) { > $s .= $x; > } else { > $s .= 'X'; > } > > > Dump decode_utf8("$s"); > Dump decode("UTF-8", "$s"); > __END__ > > With ARGV[0] == 1 > > SV = PV(0x20f87f8) at 0x2013aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x201b140 "\303\202\302\265X"\0 [UTF8 "\x{c2}\x{b5}X"] > CUR = 5 > LEN = 8 > SV = PV(0x20f87d8) at 0x2013aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x2008400 "\302\265X"\0 [UTF8 "\x{b5}X"] > CUR = 3 > LEN = 8 > > with ARGV[0] == 2 > > SV = PV(0x11e67f8) at 0x1101aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x10f7380 "\302\265X"\0 [UTF8 "\x{b5}X"] > CUR = 3 > LEN = 8 > SV = PV(0x1283c68) at 0x1101aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x10f6400 "\302\265X"\0 [UTF8 "\x{b5}X"] > CUR = 3 > LEN = 8 > > so, it missing documentation can cause hidden errors in some circumstances. > > On Fri Aug 16 21:24:06 2013, vsespb wrote:
> > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > difference should be described in documentation. > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > described as "Equivalent" too. > > > > > > On Thu Jul 25 07:37:24 2013, DANKOGAI wrote:
> > > It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag > > > turned on. And while the document says "equivalent", it does not say > > > "identical". Encode.pm defines decode_utf8 as follows: > > > > > > sub decode_utf8($;$) { > > > my ( $octets, $check ) = @_; > > > return $octets if is_utf8($octets); > > > return undef unless defined $octets; > > > $octets .= '' if ref $octets; > > > $check ||= 0; > > > $utf8enc ||= find_encoding('utf8'); > > > my $string = $utf8enc->decode( $octets, $check ); > > > $_[0] = $octets if $check and !ref $check and !( $check & > > > LEAVE_SRC() ); > > > return $string; > > > } > > > > > > Dan the Encode Maintainer > > > > > > On Wed Jul 24 15:03:37 2013, MARKF wrote:
> > > > The decode_utf8 doesn't do the same as decode("utf8",...) for all > > > > inputs despite the documentation explicitly saying that > > > > > > > > $string = decode_utf8($octets [, CHECK]); > > > > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > > > > > > > It acts differently when $octets has the UTF-8 flag turned on. > > > > decode("utf8",...) treats each character in the string as a byte. > > > > decode_utf8 simply returns the string unaltered. > > > > > > > > Failing test suite attached.
> >
Download (untitled) / with headers
text/plain 367b
+1 to get this check eliminated. Pull request open here: https://github.com/dankogai/p5-encode/pull/11 On Fri Aug 16 13:24:06 2013, vsespb wrote: Show quoted text
> IMHO it's not "equivalent", nor "identical". Maybe "similar", but > difference should be described in documentation. > Also, encode_utf8 is actually acts like encode("utf-8"), while > described as "Equivalent" too.
From: victor [...] vsespb.ru
Download (untitled) / with headers
text/plain 545b
Or, alternative pull request - just document current behaviour: https://github.com/dankogai/p5-encode/pull/10 On Mon Aug 26 06:34:45 2013, MIYAGAWA wrote: Show quoted text
> +1 to get this check eliminated. > > Pull request open here: https://github.com/dankogai/p5-encode/pull/11 > > On Fri Aug 16 13:24:06 2013, vsespb wrote:
> > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > difference should be described in documentation. > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > described as "Equivalent" too.
>
Download (untitled) / with headers
text/plain 747b
I have merged https://github.com/dankogai/p5-encode/pull/11 https://github.com/dankogai/p5-encode/pull/10 Dan the Maintainer Thereof On Tue Aug 27 06:05:45 2013, vsespb wrote: Show quoted text
> Or, alternative pull request - just document current behaviour: > https://github.com/dankogai/p5-encode/pull/10 > > On Mon Aug 26 06:34:45 2013, MIYAGAWA wrote:
> > +1 to get this check eliminated. > > > > Pull request open here: https://github.com/dankogai/p5-encode/pull/11 > > > > On Fri Aug 16 13:24:06 2013, vsespb wrote:
> > > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > > difference should be described in documentation. > > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > > described as "Equivalent" too.
> >
From: victor [...] vsespb.ru
Download (untitled) / with headers
text/plain 897b
Why did you merge Both??? They contradict each other !! On Thu Aug 29 18:52:11 2013, DANKOGAI wrote: Show quoted text
> I have merged > > https://github.com/dankogai/p5-encode/pull/11 > https://github.com/dankogai/p5-encode/pull/10 > > Dan the Maintainer Thereof > > On Tue Aug 27 06:05:45 2013, vsespb wrote:
> > Or, alternative pull request - just document current behaviour: > > https://github.com/dankogai/p5-encode/pull/10 > > > > On Mon Aug 26 06:34:45 2013, MIYAGAWA wrote:
> > > +1 to get this check eliminated. > > > > > > Pull request open here: https://github.com/dankogai/p5-encode/pull/11 > > > > > > On Fri Aug 16 13:24:06 2013, vsespb wrote:
> > > > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > > > difference should be described in documentation. > > > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > > > described as "Equivalent" too.
> > >
>
Fixed with release of 2.54 (which reverted some of the documentation changes)


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.