|Subject:||decode_utf8 is not equivalent to decode("utf8", ...)|
Hi, this is on debian lenny, all packages and kernel stock distribution (dom0 of a XEN environment which should not play any role for this problem). uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08 UTC 2010 i686 GNU/Linux perl -v gives: This is perl, v5.10.0 built for i486-linux-gnu-thread-multi From within the CPAN shell, we have installed the newest (as of the time of this writing) versions of the following modules: CGI, CPAN, CPAN::Test::Dummy::Perl5::Build, CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip, Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML The version of Encode is 2.40. As of the time of this writing, the documentation for the Encode module states that $string = decode_utf8($octets [, CHECK]); is equivalent to $string = decode("utf8", $octets [, CHECK]). This is definitely not true. We have a complex web app coded in perl which has been running flawlessly for years, and which used the first variant to decode some URL params which are fed into the application by a HTTP POST request and are encoded in UTF-8. After upgrading the underlying debian distribution (and thus, perl and respective modules), the app failed messing up German umlauts and other international chars. It took us more than two days of debugging until we came to the idea of (illogically in respect to the documentation) replacing the first variant by the second one; from this moment on, the app ran without any flaws again. Since this was very infuriating, we would like to prevent others from suffering the same problem the cause of which can't be deducted by thinking logically alone, and thus are filing a bug now. More precise description of setup: Apache 2.2.9, no default charset configured; Web page containing a form, whole page encoded as UTF-8 and tagged a UTF-8 by http headers and http metatags, form with attribute accept-charset="UTF-8"; A perl script, coded in UTF-8 by itself, all files read and written in UTF-8, including stdin and stdout, and using the CGI module version 3.49 for receiving / decoding the parameters sent by the form submit / POST request; Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:188.8.131.52) Gecko/20100401 Firefox/3.6.3) or MS IE8; Problem: decode_utf8 does not decode the parameter names and values from the POST request correctly, decode("utf8", ...) does. Thus, decode_utf8 and decode("utf8", ...) are not equivalent as stated by the docs. Furthermore, for our app, it does not make any difference if we use decode("utf8", ...) or decode("UTF-8", ...) which also seems like a contradiction to the docs, but maybe utf8 and UTF-8 would give different results in another scenario / with other strings. Tagging the problem as important is because of the fact that it took very much man-power and time to find the cause of the failing of our app; no matter if the bug is in the docs or in the module, we think it has to be considered important due to the fact that this special reason for failing scripts messing up encodings is very hard to find.