Skip Menu |
 

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 48018
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: lubo.rintel [...] gooddata.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 2.23
Fixed in: (no value)



Subject: Encode and iconv (etc.) disagree on what's valid UTF-8
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Type: text/plain
Charset: utf8
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 778
Download (untitled) / with headers
text/plain 778b
This input seems to be correctly marked as invalid, both by Encode as well as iconv: [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf"; print $u; decode ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 utf8 "\xEF" does not map to Unicode at /usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm line 162. iconv: incomplete character or shift sequence at end of buffer [lkundrak@trurl ~]$ Most tools I've encountered won't accept 0xEF 0xBF 0xBD sequence either, though not being an expert on the topic I can't really say who's wrong here. See: [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf\xbd"; print $u; decode ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 iconv: illegal input sequence at position 0 Iconv complains about somthing that decode() accepts happily.
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Charset: utf8
Content-Type: text/plain
Message-ID: <rt-3.6.HEAD-14916-1248817317-560.48018-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 1057
On Mon Jul 20 07:37:05 2009, lkundrak wrote: Show quoted text
> This input seems to be correctly marked as invalid, both by Encode as > well as iconv: > > [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf"; print $u; decode > ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 > utf8 "\xEF" does not map to Unicode at > /usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm line 162. > iconv: incomplete character or shift sequence at end of buffer > [lkundrak@trurl ~]$ > > Most tools I've encountered won't accept 0xEF 0xBF 0xBD sequence either, > though not being an expert on the topic I can't really say who's wrong > here. See:
That's U+FFFD (REPLACEMENT CHARACTER) encoded in UTF-8. Show quoted text
> > [lkundrak@trurl ~]$ perl -MEncode -e '$u = "\xef\xbf\xbd"; print $u; > decode ("UTF-8", $u, 1);'|iconv -f utf8 -t iso8859-1 > iconv: illegal input sequence at position 0 > > Iconv complains about somthing that decode() accepts happily.
It is a valid Unicode which does not have a map to iso8859-1. So both Encode and iconv are behaving okay. Dan the Encode Maintainer


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.