Skip Menu |
 

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 34259
Status: open
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: MSCHILLI [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 2.24
Fixed in: (no value)



Subject: Utf8 flag on after decoding 100% ASCII data
Download (untitled) / with headers
text/plain 666b
Hi Dan, thanks for Encode, it's a great module! My collegue Richard Russo has found a case where Encode decodes 100% ASCII data and subsequently sets the utf8 flag: my $string = "191501885"; my $id = decode_utf8( $string ); print "$id " , Encode::is_utf8($id), "\n"; $id = decode ( "utf8", $string ); print "$id " , Encode::is_utf8($id), "\n"; yields 191501885 1 191501885 1 while according to the documentation, strings that are 100% ascii shouldn't have the utf8 flag on after they're utf8-decoded. Note that the string contains a 100% ASCII string and not a number. Would be great if you could take a look -- thanks! -- Mike
Download (untitled) / with headers
text/plain 1002b
I consider the behavior natural. Consider the case below. while(<>){ my $utf8 = decode_utf8($_); # .... } The subsequent code must be written conditionally if decode_utf8 conditionally sets the flag. Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text
> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >
Download (untitled) / with headers
text/plain 1.5k
Show quoted text
> I consider the behavior natural. Consider the case below. > > while(<>){ > my $utf8 = decode_utf8($_); > # .... > } > > The subsequent code must be written conditionally if decode_utf8 > conditionally sets the flag.
Sorry to be a pain, but this is complete garbage! The Encode documentation even goes into great detail to explain this. Look at the "The UTF8 flag" section: http://search.cpan.org/~dankogai/Encode-2.25/Encode.pm --- Goal #1: Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on. Goal #2: Old byte-oriented programs should magically start working on the new character-oriented data when appropriate. Goal #3: Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode. ... # When you decode, the resulting UTF8 flag is on unless you can unambiguously represent data. Here is the definition of dis-ambiguity. After $utf8 = decode('foo', $octet);, When $octet is... The UTF8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF .. As you see, there is one exception, In ASCII. That way you can assume Goal #1. And with Encode Goal #2 is assumed but you still have to be careful in such cases mentioned in CAVEAT paragraphs. --- So this bug is actually a bug. The data is ASCII only, so the UTF8 should be OFF after the decode. And fixing this bug should NOT be changing the documentation. This slows down the case where there is ASCII only data.
Download (untitled) / with headers
text/plain 1.3k
Show quoted text
> I consider the behavior natural. Consider the case below. > > while(<>){ > my $utf8 = decode_utf8($_); > # .... > } > > The subsequent code must be written conditionally if decode_utf8 > conditionally sets the flag.
I might add that this is not true either, there should be no need for conditional code at all, the whole point is that programs don't need to look at the UTF8 flag. If you have a string that's pure ASCII and has the UTF8 flag ON, then you can at any time join it with a string with the UTF8 flag that is ON, and it will be promoted just fine. my $asciistr = "hello"; # UTF8 flag OFF my $utf8octets1 = "hello"; # UTF8 flag OFF my $utf8octets2 = "\342\230\272"; # UTF8 flag OFF my $perlstr1 = decode_utf8($utf8octets1) # UTF8 flag OFF my $perlstr2 = decode_utf8($utf8octets2) # UTF8 flag ON my $perlstr3 = "\x{263a}"; # UTF8 flag ON my $result1 = $perlstr1 . $perlstr2; # UTF8 flag ON my $result2 = $perlstr1 . $asciistr; # UTF8 flag OFF All works just fine. The point is that if you work with data, even if it's incoming utf-8 data that you decode_utf8() to create a "perl string", then if that data was only ASCII data, it's a perl string with the UTF8 flag OFF and you get all the "fast" performance of octets. Only if you use non-ASCII chars do you actually pay the performance cost of perl strings with the UTF8 flag being on.
Download (untitled) / with headers
text/plain 1.8k
That one is tough to cope with because in encode (whatever -> utf8), transcoder is set so that it only complains the first byte that is malformed while decode (utf8 -> whatever) complains the whole unicode. The problem is that the transcoder is shared with other encodings so fixing this may break other encodings. I'll leave this ticket open till I come up with something better. Dan the Encode Maintainer On Wed May 14 23:45:53 2008, ROBM wrote: Show quoted text
> > I consider the behavior natural. Consider the case below. > > > > while(<>){ > > my $utf8 = decode_utf8($_); > > # .... > > } > > > > The subsequent code must be written conditionally if decode_utf8 > > conditionally sets the flag.
> > I might add that this is not true either, there should be no need for > conditional code at all, the whole point is that programs don't need to > look at the UTF8 flag. > > If you have a string that's pure ASCII and has the UTF8 flag ON, then > you can at any time join it with a string with the UTF8 flag that is ON, > and it will be promoted just fine. > > my $asciistr = "hello"; # UTF8 flag OFF > my $utf8octets1 = "hello"; # UTF8 flag OFF > my $utf8octets2 = "\342\230\272"; # UTF8 flag OFF > > my $perlstr1 = decode_utf8($utf8octets1) # UTF8 flag OFF > my $perlstr2 = decode_utf8($utf8octets2) # UTF8 flag ON > > my $perlstr3 = "\x{263a}"; # UTF8 flag ON > > my $result1 = $perlstr1 . $perlstr2; # UTF8 flag ON > my $result2 = $perlstr1 . $asciistr; # UTF8 flag OFF > > All works just fine. > > The point is that if you work with data, even if it's incoming utf-8 > data that you decode_utf8() to create a "perl string", then if that data > was only ASCII data, it's a perl string with the UTF8 flag OFF and you > get all the "fast" performance of octets. > > Only if you use non-ASCII chars do you actually pay the performance cost > of perl strings with the UTF8 flag being on.
Download (untitled) / with headers
text/plain 812b
On Tue Jul 01 16:09:09 2008, DANKOGAI wrote: Show quoted text
> That one is tough to cope with because in encode (whatever -> utf8), > transcoder is set so > that it only complains the first byte that is malformed while decode > (utf8 -> whatever) > complains the whole unicode. The problem is that the transcoder is > shared with other > encodings so fixing this may break other encodings.
I would have thought the solution is to keep some "found_non_ascii" (default 0) kind of flag in the transcoder when converting whatever -> utf8. If during the coversion you find a non-ascii output char (eg codepoint >=0x80), you set the flag. At the end of the conversion, you set the perl utf-8 flag on the string to on/off based on the "found_non_ascii" flag? Of course, I don't know the code, so I might be speaking rubbish... Rob
Download (untitled) / with headers
text/plain 838b
Document added in 2.27. See also #41163. Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text
> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >
Download (untitled) / with headers
text/plain 702b
On Wed Jan 21 17:22:03 2009, DANKOGAI wrote: Show quoted text
> Document added in 2.27. See also #41163. > > Dan the Encode Maintainer
This hasn't been resolved at all. Doing a diff -ru between 2.26 and 2.27 shows nothing in the changed in the documentation about this problem. In fact there's no mention of bug 34259 anywhere. Worse, the documentation for Encode still clearly states that decoding ASCII only data will return a perl string with the utf-8 flag OFF. Read this section: http://search.cpan.org/~dankogai/Encode/Encode.pm#The_UTF8_flag But when you test it, clearly still doesn't do what it says: $ perl -le 'use Encode; print $Encode::VERSION; print Encode::is_utf8(decode_utf8("blah"));' 2.27 1
From: bryce2 [...] obviously.com
Download (untitled) / with headers
text/plain 188b
I just spent hours on this. As of Perl 5.10.1, this bug is still present: <code> perl -le 'use Encode; print $Encode::VERSION; print Encode::is_utf8(decode_utf8("blah"));' 2.23 1 </code>
Download (untitled) / with headers
text/plain 972b
Looks like you just forgot to "use utf8". For compatibility's sake, Perl takes all scripts written in ISO-8859-1 unless you say "use utf8". perldoc perluniintro for details Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text
> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >
Download (untitled) / with headers
text/plain 438b
On Thu Jul 07 01:15:48 2011, DANKOGAI wrote: Show quoted text
> Looks like you just forgot to "use utf8". For compatibility's sake, > Perl takes all scripts written in ISO-8859-1 unless you say "use utf8".
Sorry, this doesn't make any sense in this context. "use utf8" is irrelevant if your program uses plain ASCII strings, as the snippets of code presented in this bug all do. The problem is that Encode isn't behaving according to its documentation.
Download (untitled) / with headers
text/plain 645b
There are three possible solutions to this: 1) Change all encodings to keep track of whether or not a code point above U+007F has been decoded and SvUTF8_(?on|off) accordingly 2) Change Encode::decode() to scan decoded strings for code points above U+007F and SvUTF8_off if no code points are above U+007F 3) Change documentation 1 or 2 isn't beneficial to me since most of my data contain Basic Latin and Latin-1 Supplement characters (Swedish), with occasional Miscellaneous Symbols and General Punctuation. The question is if it's worth the overhead, even English texts makes more and more use of General Punctuation. -- chansen
RT-Send-CC: chansen [...] cpan.org
On Sat Nov 12 17:32:33 2011, CHANSEN wrote: Show quoted text
> There are three possible solutions to this: > 1) Change all encodings to keep track of whether or not a code point > above > U+007F has been decoded and SvUTF8_(?on|off) accordingly > 2) Change Encode::decode() to scan decoded strings for code points > above > U+007F and SvUTF8_off if no code points are above U+007F > 3) Change documentation
I would recommend changing the documentation and possibly removing the whole section on the UTF8 flag altogether. The UTF8 flag is internal to Perl, or at least it is meant to be. All this discussion about it has led to much misunderstanding and chagrin over the years. The way it’s supposed to be is: A string is a string is a string. Previously the max char was 255. Now it’s higher. Show quoted text
> > 1 or 2 isn't beneficial to me since most of my data contain Basic > Latin and > Latin-1 Supplement characters (Swedish), with occasional Miscellaneous > Symbols > and General Punctuation. > > The question is if it's worth the overhead, even English texts makes > more > and more use of General Punctuation. > > -- > chansen
Download (untitled) / with headers
text/plain 1.4k
Hm. Not sure here. Modules like MIME::Base64 and Digest::SHA (newer versions) die with error if see a string with utf8 bit set. (that looks correct as those functions are defined only for bytes, not for characters). People obviously need to control utf8 bit. On Sun Nov 13 06:32:47 2011, SPROUT wrote: Show quoted text
> On Sat Nov 12 17:32:33 2011, CHANSEN wrote:
> > There are three possible solutions to this: > > 1) Change all encodings to keep track of whether or not a code point > > above > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > 2) Change Encode::decode() to scan decoded strings for code points > > above > > U+007F and SvUTF8_off if no code points are above U+007F > > 3) Change documentation
> > I would recommend changing the documentation and possibly removing the > whole section on > the UTF8 flag altogether. > > The UTF8 flag is internal to Perl, or at least it is meant to be. All > this discussion about it has > led to much misunderstanding and chagrin over the years. The way it’s > supposed to be is: A > string is a string is a string. Previously the max char was 255. Now > it’s higher. >
> > > > 1 or 2 isn't beneficial to me since most of my data contain Basic > > Latin and > > Latin-1 Supplement characters (Swedish), with occasional
> Miscellaneous
> > Symbols > > and General Punctuation. > > > > The question is if it's worth the overhead, even English texts makes > > more > > and more use of General Punctuation. > > > > -- > > chansen
> >
From: victor [...] vsespb.ru
Download (untitled) / with headers
text/plain 1.7k
Show quoted text
> Modules like MIME::Base64 and Digest::SHA (newer versions) die with error if see a string with utf8 bit set.
ignore this, this is just wrong On Thu Jan 03 19:28:20 2013, vsespb wrote: Show quoted text
> Hm. Not sure here. > > Modules like MIME::Base64 and Digest::SHA (newer versions) die with > error if see a string with utf8 bit set. (that looks correct as those > functions are defined only for bytes, not for characters). > > People obviously need to control utf8 bit. > > On Sun Nov 13 06:32:47 2011, SPROUT wrote:
> > On Sat Nov 12 17:32:33 2011, CHANSEN wrote:
> > > There are three possible solutions to this: > > > 1) Change all encodings to keep track of whether or not a code point > > > above > > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > > 2) Change Encode::decode() to scan decoded strings for code points > > > above > > > U+007F and SvUTF8_off if no code points are above U+007F > > > 3) Change documentation
> > > > I would recommend changing the documentation and possibly removing the > > whole section on > > the UTF8 flag altogether. > > > > The UTF8 flag is internal to Perl, or at least it is meant to be. All > > this discussion about it has > > led to much misunderstanding and chagrin over the years. The way it’s > > supposed to be is: A > > string is a string is a string. Previously the max char was 255. Now > > it’s higher. > >
> > > > > > 1 or 2 isn't beneficial to me since most of my data contain Basic > > > Latin and > > > Latin-1 Supplement characters (Swedish), with occasional
> > Miscellaneous
> > > Symbols > > > and General Punctuation. > > > > > > The question is if it's worth the overhead, even English texts makes > > > more > > > and more use of General Punctuation. > > > > > > -- > > > chansen
> > > >
> >
From: victor [...] vsespb.ru
On Thu May 15 07:31:57 2008, ROBM wrote: Show quoted text
> --- > Goal #1: > > Old byte-oriented programs should not spontaneously break on the old > byte-oriented data they used to work on. >
Show quoted text
> As you see, there is one exception, In ASCII. That way you can assume > Goal #1. And with Encode Goal #2 is assumed but you still have to be > careful in such cases mentioned in CAVEAT paragraphs.
Show quoted text
> So this bug is actually a bug. The data is ASCII only, so the UTF8 > should be OFF after the decode. > > And fixing this bug should NOT be changing the documentation. This slows > down the case where there is ASCII only data.
I think point about Goal #1 is invalid here. ASCII data can get utf-8 flag, for example, when splitting non-ASCII string (with flag on) to ASCII and non-ASCII parts. ASCII part will have utf8 bit on. Also, "Old byte-oriented" programs never deal with decode() and with any Unicode data, so they are not affected. So, IMHO documentation about ASCII flag behaviour should be dropped (however it's better add notice to CAVEATS)


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.