Skip Menu |
 

This queue is for tickets about the DBD-mysql CPAN distribution.

Report information
The Basics
Id: 87428
Status: resolved
Priority: 0/
Queue: DBD-mysql

People
Owner: CAPTTOFU [...] cpan.org
Requestors: MLEHMANN [...] cpan.org
pali [...] cpan.org
Cc: DBOOK [...] cpan.org
AdminCc:

Bug Information
Severity: (no value)
Broken in: 4.023
Fixed in: 4.041_01



Subject: data corruption: DBD::mysql ignores the utf8-flag
MIME-Version: 1.0
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
Message-ID: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 2183
Download (untitled) / with headers
text/plain 2.1k
perl knows two internal decoding for strings - plain octets and utf-8. which encoding is used is indicated by the so-called utf8 flag. some strings can be encoded in both formats, and some strings cna be encoded only in utf-8 (when they contain character codes >255). mysql (at least in the protocol) cannot handle any characters >255. it can handle utf-8, but utf-8 contains only byte values, i.e. <= 255. unfortunately, DBD::mysql doesn't understand the internal perl string encoding, and sometimes corrupts data. here is an example string: my $str = "\xaf"; internally, this string can be encoded either as plain octets with utf-8 flag clear, or as utf-8 string with the utf8 flag set. when passed to mysql, e.g. to execute, mysql _ignores_ the utf8 flag, which corrupts the value, as the utf8 flag indicates how the in-memory bytes need to be interpreted, and mysql doesn't have this information anymore. for example, when $str is internally utf8-encoded, mysql instead receives the string "\xc2\xaf", which is rather different. since the string acts identically on the Perl level regardless of the utf8 flag (and indeed compares identically to itself regardless of the flag value), this is hard-to-debug action at a distance, as two strings that are identical to perl (compare the same, print the same etc.) are passed as two different strings by DBD::mysql. the obvious fix is to downgrade scalars before passing them to mysql. this has two effects: 1. it ensures the corretc data is always passed, regardless of the internal encoding and 2. it can warn the user when character codes >255 are used, which mysql cannot handle (the user would have to encode them to utf-8 first for example). the reason why this is rarely a big issue is that perl currently avoids upgrading the scalar in many cases, and downgrades them when it thinks performance can be helped (for example, different versions of perl encode constant strings differently depending on whether "use utf8" is in use). still, it cost me a few hours of debugging today, because I hit exactly that case, and couldn't believe that DBD::mysql hasn't been updated since the string model changed in 5.005 :/
MIME-Version: 1.0
In-Reply-To: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.14-22013-1375152466-1119.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 223
Download (untitled) / with headers
text/plain 223b
a slight addendum: this is a bug in DBD::mysql and not in DBI, as some databases can handle data with character codes >255 (usually unicode), so it is up to the database driver to correctly encode the data for the database.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-17790-1396416738-1264.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 807
Download (untitled) / with headers
text/plain 807b
On Tue Jul 30 06:45:01 2013, MLEHMANN wrote: Show quoted text
> the obvious fix is to downgrade scalars before passing them to mysql. > this has two effects: 1. it ensures the corretc data is always passed, > regardless of the internal encoding and 2. it can warn the user when > character codes >255 are used, which mysql cannot handle (the user > would have to encode them to utf-8 first for example).
This would break code which works with perl character strings and stores it in mysql (with SET NAMES UTF8 option). You can argue that such code should be written with DBI option mysql_enable_utf8=1 (and DBI/DBD should skip downgrading strings), but there would be same problem with binary data - binary data should be downgraded and DBI cannot distinct binary data (for BLOB columns etc) and character data (VARCHAR).
CC: MLEHMANN [...] cpan.org
MIME-Version: 1.0
X-Spam-Status: No, score=-3.4 tagged_above=-99.9 required=10 tests=[AWL=0.500, BAYES_00=-1.9, FROM_OUR_RT=-2] autolearn=ham
In-Reply-To: <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org>
X-PGP: "1024D/DA743396 1999-01-26 Marc Alexander Lehmann <schmorp [...] schmorp.de> Key fingerprint = 475A FE9B D1D4 039E 01AC C217 A1E8 0270 DA74 3396"
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20140402180504.GB3124 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -3.4
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 1940E240654 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Wed, 2 Apr 2014 14:05:21 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id g1LmUk5Z-O8x for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Wed, 2 Apr 2014 14:05:16 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 580A624065B for <bug-DBD-mysql [...] rt.cpan.org>; Wed, 2 Apr 2014 14:05:15 -0400 (EDT)
Received: (qmail 992 invoked by alias); 2 Apr 2014 18:05:15 -0000
Received: from mail.plan9.de (HELO mail.nethype.de) (176.9.46.152) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Wed, 02 Apr 2014 11:05:11 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WVPX8-0002lc-Gs; Wed, 02 Apr 2014 18:05:06 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WVPX8-0000m3-BW; Wed, 02 Apr 2014 18:05:06 +0000
Received: from root by cerebro.laendle with local (Exim 4.80) (envelope-from <root [...] schmorp.de>) id 1WVPX8-0000sQ-BF; Wed, 02 Apr 2014 20:05:06 +0200
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Date: Wed, 2 Apr 2014 20:05:06 +0200
X-Spam-Level:
To: Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-19822-1396461921-1945.87428-0-0 [...] rt.cpan.org>
Content-Length: 3549
Download (untitled) / with headers
text/plain 3.4k
On Wed, Apr 02, 2014 at 01:32:19AM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> On Tue Jul 30 06:45:01 2013, MLEHMANN wrote:
> > the obvious fix is to downgrade scalars before passing them to mysql. > > this has two effects: 1. it ensures the corretc data is always passed, > > regardless of the internal encoding and 2. it can warn the user when > > character codes >255 are used, which mysql cannot handle (the user > > would have to encode them to utf-8 first for example).
> > This would break code which works with perl character strings and stores it in mysql (with SET NAMES UTF8 option).
That is incorrect: such code would work fine as well with downgraded strings (utf-8 is a byte-encoding). If you mean code that doesn't use utf-8, but unicode strings, then it's still incorrect: such code currently suffers from the reverse problem, i.e. sometimes data would be passed as binary or latin1, sometimes as utf-8. The solution for that would be always upgrading. No matter how you turn it, DBD::mysql is simply broken w.r.t. perl strings, because it doesn't let the user chose the format. Show quoted text
> You can argue that such code should be written with DBI option > mysql_enable_utf8=1 (and DBI/DBD should skip downgrading strings), but
No, this option has nothing to do with it - set names utf8 works fine with binary data (unless DBD::mysql is even more buggy), as utf8 is binary data. The problem is indeed as I reported - DBD::mysql wasn't updated to the new string model in perl 5.6, and currently randomly corrupts data. Since this is apparently a hard to understand problem, and I don't quite know which part is unclear, let me assure you I will be happy to explain how the perl string model works, how utf-8 works and so on, but I need some clues on where the misunderstanding sits. As a primer, try to distinguish between Perl and C - in perl, strings are simply lists of characters, and since perl 5.6, these characters can have codes > 255. Internally, as an optimisation, perl has two different and incompatible representations, utf-8 encoded and byte-encoded. Both forms can hold unicode and binary data(!), the utf8 flag _only_ changes how the character codes are represented, it doesn't change their interpretation. On the Perl level, the flag value is essentially random, as semantics are not supposed toc hange depending on the utf-8 flag, and it's not specified when and how this flag changes value, so on the Perl level, you cannot reliably affect this flag except by version-specific and undocumented hackery. A similar problem exists for numbers: perl doesn't distinguish between numbers and strings, so mysql has to guess (or the user has to specify a type, which is possible with bind_param). What DBI::mysql currently does is to take perl strings and randomly either encode them in utf-8 or byte encoding, regardless of what the encoding of the string really is. Fixing this might break some code that currently depends on undocumented and version-specific perl behaviour, but it enables writing code that no longer depends on such hacks. Right now, it's impossible to reliably pass binary (or utf-8) data to mysql(!) - the rules can (and do) change in every perl version. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-19822-1396461921-1945.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org> <20140402180504.GB3124 [...] schmorp.de> <rt-4.0.18-19822-1396461921-1945.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-15387-1396610294-687.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
RT-Send-CC: schmorp [...] schmorp.de
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 2655
Download (untitled) / with headers
text/plain 2.5k
On Wed Apr 02 22:05:21 2014, schmorp@schmorp.de wrote: Show quoted text
> On Wed, Apr 02, 2014 at 01:32:19AM -0400, Victor Efimov via RT <bug- > DBD-mysql@rt.cpan.org> wrote:
> > On Tue Jul 30 06:45:01 2013, MLEHMANN wrote:
> > > the obvious fix is to downgrade scalars before passing them to > > > mysql. > > > this has two effects: 1. it ensures the corretc data is always > > > passed, > > > regardless of the internal encoding and 2. it can warn the user > > > when > > > character codes >255 are used, which mysql cannot handle (the user > > > would have to encode them to utf-8 first for example).
> > > > This would break code which works with perl character strings and > > stores it in mysql (with SET NAMES UTF8 option).
>
[cut] Show quoted text
> If you mean code that doesn't use utf-8, but unicode strings, then > it's > still incorrect: such code currently suffers from the reverse problem, > i.e. sometimes data would be passed as binary or latin1, sometimes as > utf-8. The solution for that would be always upgrading.
Yes, I meant character strings (unicode strings). I told that it would break existing code, and this is correct. We have such code now, it works fine because downgraded unicode strings are rare and because we use it for Russian text (which cannot be downgraded). So I would consider it broken in rare cases. But you proposal will break it in _all_ cases. Show quoted text
> > No matter how you turn it, DBD::mysql is simply broken w.r.t. perl > strings, because it doesn't let the user chose the format.
I agree - it's broken on API level. It should have different API where users can specify where is binary string and where is character string. Show quoted text
>
> > You can argue that such code should be written with DBI option > > mysql_enable_utf8=1 (and DBI/DBD should skip downgrading strings), > > but
> > No, this option has nothing to do with it - set names utf8 works fine > with > binary data (unless DBD::mysql is even more buggy), as utf8 is binary > data. >
Yes, right. I think you misunderstands me - actually I meant that you _could_ suggest a solution that people should not use unicode character strings without mysql_enable_utf8=1 (and this will make you proposal for downgrading strings valid when mysql_enable_utf8=0), and I explained why this would not help either - that's because even in mysql_enable_utf8=1 mode there will be binary data for binary columns that should not be upgraded. Show quoted text
> know which part is unclear, let me assure you I will be happy to > explain how > the perl string model works, how utf-8 works and so on, but I need > some clues
No, thank you, I think I already know how it works. Also FYI I am not maintainer of this module.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-15380-1396611115-1298.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 865
Download (untitled) / with headers
text/plain 865b
On Tue Jul 30 06:45:01 2013, MLEHMANN wrote: Show quoted text
> the obvious fix is to downgrade scalars before passing them to mysql. > this has two effects: 1. it ensures the corretc data is always passed, > regardless of the internal encoding and 2. it can warn the user when > character codes >255 are used, which mysql cannot handle (the user > would have to encode them to utf-8 first for example).
Probably I missed that part - "(the user would have to encode them to utf-8 first for example)" - that would work, but that would too much code to encode each character strings to utf8 before passing to DBI + additionals performance costs. Also a function to encode string could ensure encoded string returned in downgraded form, so there is nothing to fix in DBI - user can implement and use such function by himself (and another one to ensude binary strings are downgraded).
MIME-Version: 1.0
X-Spam-Status: No, score=-3.65 tagged_above=-99.9 required=10 tests=[AWL=0.250, BAYES_00=-1.9, FROM_OUR_RT=-2] autolearn=ham
In-Reply-To: <rt-4.0.18-15387-1396610294-947.87428-7-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org> <20140402180504.GB3124 [...] schmorp.de> <rt-4.0.18-19822-1396461921-1945.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-15387-1396610294-947.87428-7-0 [...] rt.cpan.org>
X-Virus-Checked: Checked
X-PGP: "1024D/DA743396 1999-01-26 Marc Alexander Lehmann <schmorp [...] schmorp.de> Key fingerprint = 475A FE9B D1D4 039E 01AC C217 A1E8 0270 DA74 3396"
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20140405195308.GD3136 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -3.65
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 2FAEE240251 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 15:53:21 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FN65-6ojuwwM for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 15:53:17 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id B1E6F240085 for <bug-DBD-mysql [...] rt.cpan.org>; Sat, 5 Apr 2014 15:53:16 -0400 (EDT)
Received: (qmail 22201 invoked by alias); 5 Apr 2014 19:53:15 -0000
Received: from mail.plan9.de (HELO mail.nethype.de) (176.9.46.152) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sat, 05 Apr 2014 12:53:13 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWWeK-0007K3-Nv for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 19:53:08 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWWeK-0004bR-JY for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 19:53:08 +0000
Received: from root by cerebro.laendle with local (Exim 4.80) (envelope-from <root [...] schmorp.de>) id 1WWWeK-0000xr-JO for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 21:53:08 +0200
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Date: Sat, 5 Apr 2014 21:53:08 +0200
X-Spam-Level:
To: Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-5190-1396727601-96.87428-0-0 [...] rt.cpan.org>
Content-Length: 5245
Download (untitled) / with headers
text/plain 5.1k
On Fri, Apr 04, 2014 at 07:18:15AM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> Yes, I meant character strings (unicode strings). I told that it would break existing code, and this is correct.
It's not, and nothign you say indicates otherwise. Show quoted text
> We have such code now, it works fine
It works fine by accident only. It might not work with older or newer perl versions, because it relies on undocumented behaviour inside the perl interpreter which can and does change in different versions. Show quoted text
> because downgraded unicode strings are rare
They are rare because you are lucky - but what happens when you hit that rare case? Does your code that works fine still work fine in these rare cases? Show quoted text
> and because we use it for Russian text (which cannot be downgraded).
Russian text can easily be downgraded, for example when it's encoded in utf-8, as required by mysql. Show quoted text
> So I would consider it broken in rare cases.
The key is that the code in question already is broken, even if you are lucky and it works except in rare cases. Show quoted text
> But you proposal will break it in _all_ cases.
Not sure, but possible. The key, again, is that the change would allow one to fix broken code such as yours. Right now, the best you cna achieve is code that happens to work "most of the time". So your proposal is to keep a bug that makes it impossible to write corretc and working code, because it makes already broken code fail deterministically. I would say that's a ridiculous proposal. Why would anybody want guaranteed brokenness? Even you admit that your code already *is* broken. And so is my own code. And there is no way to fix either until DBD::mysql is fixed. I can try various workarounds such as utf8::downgrade or upgrade, but that doesn't fix the code, it only makes it work with my current perl binary. Show quoted text
> > No matter how you turn it, DBD::mysql is simply broken w.r.t. perl > > strings, because it doesn't let the user chose the format.
> > I agree - it's broken on API level. It should have different API where users can specify where is binary string and where is character string.
Either that, or it should simply offer the same API as mysql, namely use the same encodign as the underlying c lib, just as basically any other library does on the planet (compare Compress::Zlib for example, which doesn't have this bug, and also doesn't require extra specificatrion of whether something is a text string or not). I think whoever implemented this utf-8 stuff in DBD::mysql was simply confused - utf-8 strings aren't unicode strings. Fortunately, this is not a situation that created a backwards compatibility problem, because the behaviour isn't deterministic, but effectively random. Show quoted text
> > binary data (unless DBD::mysql is even more buggy), as utf8 is binary
> > Yes, right. I think you misunderstands me - actually I meant that you > _could_ suggest a solution that people should not use unicode character > strings without mysql_enable_utf8=1 (and this will make you proposal for > downgrading strings valid when mysql_enable_utf8=0), and I explained why > this would not help either - that's because even in mysql_enable_utf8=1 > mode there will be binary data for binary columns that should not be > upgraded.
The documentation of mysql_enable_utf8 says "turning on this flag tells MySQL that incoming data should be treated as UTF-8". I don't know what the option does (apparently, it doesn't treat anything as utf-8 with this flag, right?), but as documented, yes, it's quite obvious that you can't pass in generic binary data anymore. (In fact, I suspect when you pass in utf-8 data as expected, it will be double-encoded, which would intorduce pretty obvious data corruption). Of course, this option is marked as experimental (in my copy at least), so one shouldn't be surprised if a bug is found and fixed. In any case, I don't see what mysql_enable_utf8 has to do with anything, it's clearly a useless option unless all your data is unicode (or utf-8?), and even has the potential to corrupt data even more (what happens when i pass data to a binary column and retrieve it, will it double or even triple-encoding the data in some cases? As the documentatino stands, it seems that is the case). Show quoted text
> > know which part is unclear, let me assure you I will be happy to > > explain how > > the perl string model works, how utf-8 works and so on, but I need > > some clues
> > No, thank you, I think I already know how it works.
It looks to me as if you keep confusing unicode and utf-8 strings. They are different in Perl. Show quoted text
> Also FYI I am not maintainer of this module.
I know, but the maintainer of this module could be confused by your wrong comments, so it's good to clear up the situation. Summary: your code is broken, and so is mine. You might not understand it yet, but you are suffering from this very bug, just in reverse. If this bug were fixed, we both could fix our code. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-5190-1396727601-96.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org> <20140402180504.GB3124 [...] schmorp.de> <rt-4.0.18-19822-1396461921-1945.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-15387-1396610294-947.87428-7-0 [...] rt.cpan.org> <20140405195308.GD3136 [...] schmorp.de> <rt-4.0.18-5190-1396727601-96.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-25256-1396728093-971.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
RT-Send-CC: schmorp [...] schmorp.de
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 436
Download (untitled) / with headers
text/plain 436b
Show quoted text
> > and because we use it for Russian text (which cannot be downgraded).
> > Russian text can easily be downgraded, for example when it's encoded > in > utf-8, as required by mysql. >
As I told, I meant unicode character strings. By "downgraded" I mean "utf8::downgrade". So Russian text cannot be utf8::downgrade'd, because all characters are above 255. So perl character strings with Russian letters are always with UTF-8 flag on.
CC: MLEHMANN [...] cpan.org
MIME-Version: 1.0
X-Spam-Status: No, score=-3.678 tagged_above=-99.9 required=10 tests=[AWL=0.222, BAYES_00=-1.9, FROM_OUR_RT=-2] autolearn=ham
In-Reply-To: <rt-4.0.18-15380-1396611116-941.87428-6-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-15380-1396611116-941.87428-6-0 [...] rt.cpan.org>
X-PGP: "1024D/DA743396 1999-01-26 Marc Alexander Lehmann <schmorp [...] schmorp.de> Key fingerprint = 475A FE9B D1D4 039E 01AC C217 A1E8 0270 DA74 3396"
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20140405200425.GE3136 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -3.678
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 319A0240251 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:04:37 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fnMy4wulwAgK for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:04:33 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id C8F74240085 for <bug-DBD-mysql [...] rt.cpan.org>; Sat, 5 Apr 2014 16:04:32 -0400 (EDT)
Received: (qmail 22868 invoked by alias); 5 Apr 2014 20:04:32 -0000
Received: from mail.plan9.de (HELO mail.nethype.de) (176.9.46.152) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sat, 05 Apr 2014 13:04:30 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWWpF-0007Qp-UE; Sat, 05 Apr 2014 20:04:25 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWWpF-0005NW-Oy; Sat, 05 Apr 2014 20:04:25 +0000
Received: from root by cerebro.laendle with local (Exim 4.80) (envelope-from <root [...] schmorp.de>) id 1WWWpF-0000yw-Ol; Sat, 05 Apr 2014 22:04:25 +0200
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Date: Sat, 5 Apr 2014 22:04:25 +0200
X-Spam-Level:
To: Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-23698-1396728277-1557.87428-0-0 [...] rt.cpan.org>
Content-Length: 2807
Download (untitled) / with headers
text/plain 2.7k
On Fri, Apr 04, 2014 at 07:31:56AM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> > the obvious fix is to downgrade scalars before passing them to mysql. > > this has two effects: 1. it ensures the corretc data is always passed, > > regardless of the internal encoding and 2. it can warn the user when > > character codes >255 are used, which mysql cannot handle (the user > > would have to encode them to utf-8 first for example).
> > Probably I missed that part - "(the user would have to encode them to > utf-8 first for example)" - that would work, but that would too much > code to encode each character strings
Any evidence for that claim? I don't think there is. Show quoted text
> additionals performance costs.
The data needs to be transformed either inside or outside DBD::mysql, and somehow the encoding must be specified anyways, so this would not incur any additional performance costs (but see below). The only additional costs are the code that makes the program correct, which is a required component, not something that could be optimised away. Show quoted text
> Also a function to encode string could ensure encoded string returned in > downgraded form, so there is nothing to fix in DBI
I am not sure I understand that, but a DBD::mysql that force-accepts only utf-8 with one option, and otherwise just passes through strings unchanged would work fine for me (I would simply disable the option and use utf-8 for text, and would never run into a problem). An option to allow and return unicode strings for everything "non-numerical" would probably be of more use overall, as many databases are non-binary and then it would make it convenient to use unicode strings in perl where mysql expects utf-8, and vice versa. (The numericalness can already be specified, and has to, as DBD::mysql also cannot guess, so one cannot write correct code without specifying it). Show quoted text
> user can implement and use such function by himself (and another one to > ensude binary strings are downgraded).
AFAIK, there is no way to do that in Perl. The only way to do that reliably would be in XS code inside the module that uses it, which means it *has* to be in DBD::mysql (the user cannot implement this on her own). You are probably thinking of utf8::upgrade/downgrade or the like, but these obviously cannot be sued to implement this. Their only use is to work around broken libraries such as DBD::mysql while keeping your fingers crossed that the next version of perl might not break your fix. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
X-Spam-Status: No, score=-3.7 tagged_above=-99.9 required=10 tests=[AWL=0.200, BAYES_00=-1.9, FROM_OUR_RT=-2] autolearn=ham
In-Reply-To: <rt-4.0.18-25256-1396728093-682.87428-7-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org> <20140402180504.GB3124 [...] schmorp.de> <rt-4.0.18-19822-1396461921-1945.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-15387-1396610294-947.87428-7-0 [...] rt.cpan.org> <20140405195308.GD3136 [...] schmorp.de> <rt-4.0.18-5190-1396727601-96.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-25256-1396728093-682.87428-7-0 [...] rt.cpan.org>
X-PGP: "1024D/DA743396 1999-01-26 Marc Alexander Lehmann <schmorp [...] schmorp.de> Key fingerprint = 475A FE9B D1D4 039E 01AC C217 A1E8 0270 DA74 3396"
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20140405200836.GF3136 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -3.7
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id E0628240251 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:08:47 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XVHHOzlM4NHL for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:08:43 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 8FECB240085 for <bug-DBD-mysql [...] rt.cpan.org>; Sat, 5 Apr 2014 16:08:43 -0400 (EDT)
Received: (qmail 23141 invoked by alias); 5 Apr 2014 20:08:42 -0000
Received: from mail.plan9.de (HELO mail.nethype.de) (176.9.46.152) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sat, 05 Apr 2014 13:08:41 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWWtI-0007UX-Sf for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 20:08:36 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWWtI-0005eV-NN for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 20:08:36 +0000
Received: from root by cerebro.laendle with local (Exim 4.80) (envelope-from <root [...] schmorp.de>) id 1WWWtI-0000zW-NA for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 22:08:36 +0200
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Date: Sat, 5 Apr 2014 22:08:36 +0200
X-Spam-Level:
To: Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-23112-1396728528-1789.87428-0-0 [...] rt.cpan.org>
Content-Length: 845
Download (untitled) / with headers
text/plain 845b
On Sat, Apr 05, 2014 at 04:01:33PM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> > utf-8, as required by mysql. > >
> > As I told, I meant unicode character strings. By "downgraded" I mean "utf8::downgrade". So Russian text cannot be utf8::downgrade'd, because all characters are above 255. So perl character strings with Russian letters are always with UTF-8 flag on.
Thanks for the clarification, I understand now what you meant to convey now. In your original mail, you didn't say what you refer to. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-5190-1396727601-96.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org> <20140402180504.GB3124 [...] schmorp.de> <rt-4.0.18-19822-1396461921-1945.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-15387-1396610294-947.87428-7-0 [...] rt.cpan.org> <20140405195308.GD3136 [...] schmorp.de> <rt-4.0.18-5190-1396727601-96.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-23112-1396729408-1273.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
RT-Send-CC: schmorp [...] schmorp.de
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 1672
Download (untitled) / with headers
text/plain 1.6k
On Sat Apr 05 23:53:21 2014, schmorp@schmorp.de wrote: Show quoted text
> > The documentation of mysql_enable_utf8 says "turning on this flag > tells MySQL > that incoming data should be treated as UTF-8". > > I don't know what the option does (apparently, it doesn't treat > anything as > utf-8 with this flag, right?), but as documented, yes, it's quite > obvious > that you can't pass in generic binary data anymore. > > (In fact, I suspect when you pass in utf-8 data as expected, it will > be > double-encoded, which would intorduce pretty obvious data corruption).
Let's see again what docs tell: === When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string === that's correct. you get perl character strings, when reading data from mysql. (except binary columns) === Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect. === That means: 1) it just issues "SET NAMES utf8" command. That's all. Nothing more. 2) It tells MySQL (mysql server daemon process, not DBD::mysql library), that data is in UTF-8. If we talk about things on MySQL daemon side, there are no "character strings" "binary strings" etc, no confusion between perl character strings with utf8 flag and data encoded in utf-8 (usually without flag). so "UTF-8" here means just what it means in MySQL documentation. It's implemented via "SET NAMES utf8" command (see (1))
MIME-Version: 1.0
X-Spam-Status: No, score=-3.718 tagged_above=-99.9 required=10 tests=[AWL=0.182, BAYES_00=-1.9, FROM_OUR_RT=-2] autolearn=ham
In-Reply-To: <rt-4.0.18-23112-1396729408-1720.87428-7-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org> <20140402180504.GB3124 [...] schmorp.de> <rt-4.0.18-19822-1396461921-1945.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-15387-1396610294-947.87428-7-0 [...] rt.cpan.org> <20140405195308.GD3136 [...] schmorp.de> <rt-4.0.18-5190-1396727601-96.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-23112-1396729408-1720.87428-7-0 [...] rt.cpan.org>
X-PGP: "1024D/DA743396 1999-01-26 Marc Alexander Lehmann <schmorp [...] schmorp.de> Key fingerprint = 475A FE9B D1D4 039E 01AC C217 A1E8 0270 DA74 3396"
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20140405204955.GH3136 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -3.718
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 990D224030F for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:50:12 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dWzKkT3v3jF6 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:50:08 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 397BE240251 for <bug-DBD-mysql [...] rt.cpan.org>; Sat, 5 Apr 2014 16:50:08 -0400 (EDT)
Received: (qmail 24745 invoked by alias); 5 Apr 2014 20:50:04 -0000
Received: from mail.plan9.de (HELO mail.nethype.de) (176.9.46.152) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sat, 05 Apr 2014 13:50:00 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWXXI-0007sa-7y for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 20:49:56 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWXXI-0008PQ-3T for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 20:49:56 +0000
Received: from root by cerebro.laendle with local (Exim 4.80) (envelope-from <root [...] schmorp.de>) id 1WWXXI-00013J-3G for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 22:49:56 +0200
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Date: Sat, 5 Apr 2014 22:49:56 +0200
X-Spam-Level:
To: Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-25256-1396731013-241.87428-0-0 [...] rt.cpan.org>
Content-Length: 4161
On Sat, Apr 05, 2014 at 04:23:29PM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string > === > > that's correct. you get perl character strings, when reading data from mysql. (except binary columns)
What does "necessary" mean? If it means that the utf-8 flag is turned on if the mysql string contains characters > 255, it would be correct. This could be done if mysql ensures that everything is utf-8 encoded, in which case blindly setting the utf-8 flag would work, I don't know enough about libmysqlclient and mysqld to know what really happens, but I wouldn't rely on this meaning something correct, given that DBD::mysql is *known* to have a broken implementation. Show quoted text
> Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect. > > 1) it just issues "SET NAMES utf8" command. That's all. Nothing more.
That's not what it says. It says if you turn on this flag after connect, then you need to issue the set names utf8 command. Show quoted text
> 2) It tells MySQL (mysql server daemon process, not DBD::mysql library), that data is in UTF-8.
Do you have evidence for this? The official mysql docs say this only indicates the encoding used for the sql statement, not the embedded data (which is usually interpolated, but does not have to be so). This also makes sense - numbers are typically passed as strings in protocol, but still stay numbers (not utf-8 encoded data) when the statement is interpreted. Show quoted text
> If we talk about things on MySQL daemon side, there are no "character > strings" "binary strings" etc, no confusion between perl character > strings with utf8 flag and data encoded in utf-8 (usually without flag).
The MySQL daemon certainly distinguishes between character strings and binary! "char" and "binary" are data types and treated differently in mysql. binary strings compare differently than character strings for example. What it doesn't do is to distinguish between unicode and non-unicode in the protocol, and that is exactly the problem - DBD::mysql either should not attempt to distinguish, or should have a _deterministic_ algorithm. Right now, DBD::mysql sometiems utf-8 encodes data, sometimes not for the *same* strings on the Perl level. This is simply a bug - no matter what *we* think DBD::mysql _should_ do, it doesn't do it _right now_, because there is no deterministic way to influence it from the Perl level. As I have pointed out before, and as you chose to ignore: if you disagree, tell me a deterministic way to get binary data in mysql, which works in previous, current, and future versions (as long as perl works as documented). That your program (and now also my program) happens to work with the version of perl we employ is meaningless. I want a way that works correctly, even in futrue versions of Perl. Also, having to downgrade or upgrade every string before passing it to mysql is clearly something you don't want to do, but is currently necessary as a bug workaround. Again, you are suffering form the sme bug right now, you just don't realise it yet. All the drawbacks of the workarounds you think have to be employed for a fix already have to be employed. If DBD::mysql were fixed instead, most of these hacks wouldn't be required. Show quoted text
> so "UTF-8" here means just what it means in MySQL documentation. It's > implemented via "SET NAMES utf8" command (see (1))
"just" is a weasel word. As we have just seen, mysql documentation disagrees with you, so it apparently isn't that simple :) -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
X-Spam-Status: No, score=-3.733 tagged_above=-99.9 required=10 tests=[AWL=0.167, BAYES_00=-1.9, FROM_OUR_RT=-2] autolearn=ham
In-Reply-To: <rt-4.0.18-23112-1396729408-1720.87428-7-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-17790-1396416739-59.87428-6-0 [...] rt.cpan.org> <20140402180504.GB3124 [...] schmorp.de> <rt-4.0.18-19822-1396461921-1945.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-15387-1396610294-947.87428-7-0 [...] rt.cpan.org> <20140405195308.GD3136 [...] schmorp.de> <rt-4.0.18-5190-1396727601-96.87428-7-0 [...] rt.cpan.org> <rt-4.0.18-23112-1396729408-1720.87428-7-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20140405205336.GA3789 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -3.733
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 424A424030F for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:53:47 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hrhGAgGKeXoO for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sat, 5 Apr 2014 16:53:43 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id ECDBF240251 for <bug-DBD-mysql [...] rt.cpan.org>; Sat, 5 Apr 2014 16:53:42 -0400 (EDT)
Received: (qmail 24845 invoked by alias); 5 Apr 2014 20:53:42 -0000
Received: from mail.plan9.de (HELO mail.nethype.de) (176.9.46.152) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sat, 05 Apr 2014 13:53:40 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWXaq-0007ux-7Y for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 20:53:36 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.80) (envelope-from <schmorp [...] schmorp.de>) id 1WWXaq-0000FA-36 for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 20:53:36 +0000
Received: from root by cerebro.laendle with local (Exim 4.80) (envelope-from <root [...] schmorp.de>) id 1WWXaq-000143-2q for bug-DBD-mysql [...] rt.cpan.org; Sat, 05 Apr 2014 22:53:36 +0200
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Date: Sat, 5 Apr 2014 22:53:36 +0200
X-Spam-Level:
To: Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-25256-1396731227-1218.87428-0-0 [...] rt.cpan.org>
Content-Length: 738
Download (untitled) / with headers
text/plain 738b
Show quoted text
> This also makes sense - numbers are typically passed as strings in > protocol, but still stay numbers (not utf-8 encoded data) when the > statement is interpreted.
What I forgot to mention, btw., is that, while the protocol distinguishes between text (MYSQL_TYPE_STRING) and binary (MYSQL_TYPE_BLOB), this doesn't apply if values are interpolated, which is still, afaik, the default way of how DBD::mysql operates. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-25256-1396731013-241.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.18-19822-1396461921-1945.87428-7-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-26822-1396737422-224.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 1254
Download (untitled) / with headers
text/plain 1.2k
On Sun Apr 06 00:50:13 2014, schmorp@schmorp.de wrote: Show quoted text
> > As I have pointed out before, and as you chose to ignore: if you > disagree, > tell me a deterministic way to get binary data in mysql, which works > in previous, current, and future versions (as long as perl works as > documented). > > That your program (and now also my program) happens to work with the > version of perl we employ is meaningless. I want a way that works > correctly, even in futrue versions of Perl. >
1) new flag (let's say "mysql_enable_unicode") which turn on new API. without that flag everything works old way (let's call it "old DBI API"). 2) when sending data to DBI: - scalars treated as character strings, thus utf8::upgrad'ed before processing by old DBI API. - new exported function "binary()". binary($scalar) will return blessed object which contains reference to the scalar. when this object sent to DBI, DBI will detect the object and scalar will be utf8::downgraded before processing by old DBI API 3) when reading data from DBI: like now with mysql_enable_utf8 flag: - "SET NAMES utf8" issued. - When set, a data retrieved from a textual column type (char, varchar, etc) it will return character string. - for binary column will return binary string.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-30628-1443742093-489.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 402
Download (untitled) / with headers
text/plain 402b
This is still a problem. For example, Spreadsheet::ParseExcel tends to return strings which are not utf8 upgraded, so passing them directly to DBD::mysql with mysql_enable_utf8 enabled results in collation conflicts (Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) ...). utf8::upgrade on every string being passed "solves" the issue, but this shouldn't be needed.
MIME-Version: 1.0 (Mac OS X Mail 8.2 \(2104\))
X-Spam-Status: No, score=-6.7 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FROM_OUR_RT=-4, RCVD_IN_DNSWL_LOW=-0.7] autolearn=ham
In-Reply-To: <rt-4.0.18-30628-1443742094-1104.87428-5-0 [...] rt.cpan.org>
X-Mailer: Apple Mail (2.2104)
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-5-0 [...] rt.cpan.org> <rt-4.0.18-30628-1443742094-1104.87428-5-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.68.166.68 with SMTP id ze4mr14681677pbb.74.1444099237002; Mon, 05 Oct 2015 19:40:37 -0700 (PDT)
content-type: text/plain; charset="utf-8"
Message-ID: <6C15F986-6ACA-4640-B336-B0D5B4C4D893 [...] patg.net>
X-RT-Original-Encoding: utf-8
X-Spam-Score: -6.7
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] patg.net
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 278262402C6 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Mon, 5 Oct 2015 22:40:53 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xI2ysKVwpUHZ for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Mon, 5 Oct 2015 22:40:51 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 9C18E240083 for <bug-DBD-mysql [...] rt.cpan.org>; Mon, 5 Oct 2015 22:40:51 -0400 (EDT)
Received: (qmail 9105 invoked by alias); 6 Oct 2015 02:40:51 -0000
Received: from mail-pa0-f41.google.com (HELO mail-pa0-f41.google.com) (209.85.220.41) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Mon, 05 Oct 2015 19:40:40 -0700
Received: by padhy16 with SMTP id hy16so54445592pad.1 for <bug-DBD-mysql [...] rt.cpan.org>; Mon, 05 Oct 2015 19:40:37 -0700 (PDT)
Received: from atlsmiswrl02-c.atlsmi.co-lo.hp.com (pool-71-173-89-135.ptldme.east.myfairpoint.net. [71.173.89.135]) by smtp.gmail.com with ESMTPSA id jd9sm6627696pbd.31.2015.10.05.19.40.33 for <bug-DBD-mysql [...] rt.cpan.org> (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 05 Oct 2015 19:40:35 -0700 (PDT)
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <patg [...] patg.net>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=patg.net; s=google; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=U4nVVdOYGlgCjD6MXe9YU77cQhJuZFp7kyHLoSjPH8Q=; b=jqcnv4DP1EJ1I+Fo87g+66HbMkpqpvXaOIEhLN4SipBSj+fVGF4ayClvHEAPzfgn6m 5ZzJilKRBLxLQZM1iJc5YIrmM2KuvSUTYK4jlFd+knUMtaEMBgaIQl/7jHE8zlJDFR36 XAigQhGkV/zPTnbKzMB0vLoHwh+3T536h8bWI=
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
X-Google-Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=U4nVVdOYGlgCjD6MXe9YU77cQhJuZFp7kyHLoSjPH8Q=; b=EJgOX0ISSduGd7nTJZ82Gx9Kqc1edM3VgkL6DAQpg8Gem/oqMLPeYDkOs2TMoR0z68 6H/sBbmIt0KOyVPRmqpuzaVrVsMhYLAQ62F0Fy1cvhy4hEpMsM/wavcdJD5JF2+qWzM7 UudBEzhEBXmOrBKFjo2jPxvyB7P7pDtXzaZAqhsosGSgt+HTQadKUp5rm1v9uhH1vCdh IKltFKjVoP9lQU/G1YM/+dV0Nzl+3AYHXT6OPN92GIXRW/pZn6yTd6BE5SeuEvTLZNOw 37i8yHodoDgnSGTqa1tAIl+tLl6WofEfzap/H6TZLLW5iIhyMuEcHXSoMUWUweEVhtxp IWqg==
Date: Mon, 5 Oct 2015 22:40:29 -0400
X-Spam-Level:
To: bug-DBD-mysql [...] rt.cpan.org
Content-Transfer-Encoding: quoted-printable
X-GM-Message-State: ALoCoQnrfKzbuA8OtDG9Cce8AlqWcC39bHJWvg2kM3M3D0h5Qm2cpHj0d+1l1e9NWIgxSIFphuu2
From: Patrick Galbraith <patg [...] patg.net>
RT-Message-ID: <rt-4.0.18-32549-1444099253-672.87428-0-0 [...] rt.cpan.org>
Content-Length: 728
Download (untitled) / with headers
text/plain 728b
thank you for the report! I will look at the driver and see what is needed to make this not require having to upgrade every string explicitly. Show quoted text
> On Oct 1, 2015, at 7:28 PM, Dan Book via RT <bug-DBD-mysql@rt.cpan.org> wrote: > > Queue: DBD-mysql > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > This is still a problem. For example, Spreadsheet::ParseExcel tends to return strings which are not utf8 upgraded, so passing them directly to DBD::mysql with mysql_enable_utf8 enabled results in collation conflicts (Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) ...). utf8::upgrade on every string being passed "solves" the issue, but this shouldn't be needed.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-32549-1444099253-672.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-5-0 [...] rt.cpan.org> <rt-4.0.18-30628-1443742094-1104.87428-5-0 [...] rt.cpan.org> <6C15F986-6ACA-4640-B336-B0D5B4C4D893 [...] patg.net> <rt-4.0.18-32549-1444099253-672.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-28094-1444099747-727.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 995
Download (untitled) / with headers
text/plain 995b
This is an old bug and I'd like to fix it. I'm not an collation expert, so I will need to look at the other drivers to see what they do about this. Sorry for the ticket rot. On Mon Oct 05 22:40:53 2015, patg@patg.net wrote: Show quoted text
> thank you for the report! I will look at the driver and see what is > needed to make this not require having to upgrade every string > explicitly. >
> > On Oct 1, 2015, at 7:28 PM, Dan Book via RT <bug-DBD- > > mysql@rt.cpan.org> wrote: > > > > Queue: DBD-mysql > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > > > This is still a problem. For example, Spreadsheet::ParseExcel tends > > to return strings which are not utf8 upgraded, so passing them > > directly to DBD::mysql with mysql_enable_utf8 enabled results in > > collation conflicts (Illegal mix of collations > > (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) ...). > > utf8::upgrade on every string being passed "solves" the issue, but > > this shouldn't be needed.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.14-10453-1375152301-833.0-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-28296-1477149338-1228.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 195
Download (untitled) / with headers
text/plain 195b
Fix for UTF-8 support in DBD::mysql is in my pull request: https://github.com/perl5-dbi/DBD-mysql/pull/67 I would like if more people affected by UTF-8 bugs in DBD::mysql could test my changes...
CC: pali [...] cpan.org, DBOOK [...] cpan.org
MIME-Version: 1.0
X-Spam-Status: No, score=-5.851 tagged_above=-99.9 required=10 tests=[AWL=0.050, BAYES_00=-1.9, FROM_OUR_RT=-4, SPF_HELO_PASS=-0.001] autolearn=ham
In-Reply-To: <rt-4.0.18-28296-1477149339-149.87428-6-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-28296-1477149339-149.87428-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20161030204903.GA9903 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.851
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 93A312401C3 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sun, 30 Oct 2016 16:49:15 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vmPJy9eGuT3Z for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sun, 30 Oct 2016 16:49:13 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 8127C240137 for <bug-DBD-mysql [...] rt.cpan.org>; Sun, 30 Oct 2016 16:49:13 -0400 (EDT)
Received: (qmail 12700 invoked by alias); 30 Oct 2016 20:49:12 -0000
Received: from mail.nethype.de (HELO mail.nethype.de) (5.9.56.24) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sun, 30 Oct 2016 13:49:09 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.84_2) (envelope-from <schmorp [...] schmorp.de>) id 1c0x2N-0004I2-G8; Sun, 30 Oct 2016 20:49:03 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.84_2) (envelope-from <schmorp [...] schmorp.de>) id 1c0x2N-0004TI-Af; Sun, 30 Oct 2016 20:49:03 +0000
Received: from root by cerebro.laendle with local (Exim 4.84_2) (envelope-from <root [...] schmorp.de>) id 1c0x2N-00005k-9F; Sun, 30 Oct 2016 21:49:03 +0100
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Openpgp: id=904ad2f81fb16978e7536f726dea2ba30bc39eb6; url=http://pgp.schmorp.de/schmorp-pgpkey.txt; preference=signencrypt
Date: Sun, 30 Oct 2016 21:49:03 +0100
X-Spam-Level:
To: Pali via RT <bug-DBD-mysql [...] rt.cpan.org>
Content-Transfer-Encoding: quoted-printable
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-27383-1477860556-1472.87428-0-0 [...] rt.cpan.org>
Content-Length: 2119
On Sat, Oct 22, 2016 at 11:15:40AM -0400, Pali via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > Fix for UTF-8 support in DBD::mysql is in my pull request: https://github.com/perl5-dbi/DBD-mysql/pull/67 > I would like if more people affected by UTF-8 bugs in DBD::mysql could test my changes...
Thanks for looking into this - I only had a cursory look into the patch, and it seems it is wrong in the "other" direction now: + else if (is_binary && SvUTF8(ph->value)) + warn("UTF-8 encoded binary field %d", i); The UTF8 flag on a scalar does NOT mean the scalar is UTF-8 encoded - the scalars "\xfc" (no utf8 flag) and "\xc3\xbc" (with utf8 flag) are the same string, and in binary both encode the octet 0xfc. Emitting a warning is wrong here, and the message is wrong as well (scalars have no encoding information on the Perl level). The patch thus requires the same workarounds needed for utf-8 for binary data now - that's the "wrong in the other direction". Basically, when utf-8 encoded data is wanted, then SvPVutf8 is the correct function, while SvPVbyte is the right function for binary data - the patch only gets the utf-8 case right (with some optimisations). I can't see whether this is inteded or not - calling str_is_nonascii on an utf-8 encoded scalar doesn't seem to make much sense to me (binary data is 8 bit wide, not 7 bit). On the other hand, this seems to be in the patch multiple times. Don't have time to try it out, and maybe I am overlooking something - again, this is just a quick scan of the patch really. However, the only way to succeed, IMHO, is to get the idea of detecting or guessing encoding from perl scalars - the UTF8 flag _never_ indicates that the string data is utf-8 encoded. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-27383-1477860556-1472.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-28296-1477149339-149.87428-6-0 [...] rt.cpan.org> <20161030204903.GA9903 [...] schmorp.de> <rt-4.0.18-27383-1477860556-1472.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-19438-1477866397-534.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 3080
On Ned Okt 30 16:49:16 2016, schmorp@schmorp.de wrote: Show quoted text
> On Sat, Oct 22, 2016 at 11:15:40AM -0400, Pali via RT <bug-DBD- > mysql@rt.cpan.org> wrote:
> > <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > > > Fix for UTF-8 support in DBD::mysql is in my pull request: > > https://github.com/perl5-dbi/DBD-mysql/pull/67 > > I would like if more people affected by UTF-8 bugs in DBD::mysql > > could test my changes...
> > Thanks for looking into this - I only had a cursory look into the > patch, and > it seems it is wrong in the "other" direction now: > > + else if (is_binary && SvUTF8(ph->value)) > + warn("UTF-8 encoded binary field %d", i); > > The UTF8 flag on a scalar does NOT mean the scalar is UTF-8 encoded - > the > scalars "\xfc" (no utf8 flag) and "\xc3\xbc" (with utf8 flag) are the > same > string, and in binary both encode the octet 0xfc. Emitting a warning > is > wrong here, and the message is wrong as well (scalars have no encoding > information on the Perl level).
UTF8 flag tells if internal representation of PV in scalar is stored in utf8 or not. I was thinking that "\xc3\xbc" with utf8 flag is not mean to be binary anymore as it is internally stored as utf8. If you produce binary data which have internal representation in utf8 then I think there is some problem... Show quoted text
> The patch thus requires the same workarounds needed for utf-8 for > binary > data now - that's the "wrong in the other direction".
I will think about it... But pack/unpack/vec/... functions works also on string "\xc3\xbc" with utf8 flag same as on "\xfc" without utf8 flag... Show quoted text
> Basically, when utf-8 encoded data is wanted, then SvPVutf8 is the > correct > function, while SvPVbyte is the right function for binary data - the > patch > only gets the utf-8 case right (with some optimisations). > > I can't see whether this is inteded or not - calling str_is_nonascii > on an > utf-8 encoded scalar doesn't seem to make much sense to me (binary > data is > 8 bit wide, not 7 bit). On the other hand, this seems to be in the > patch > multiple times.
This is just optimization. SvPV returns data buffer in utf8 encoded or byte (latin1) encoded based on SvUTF8 flag. But plain ASCII data are same in both those encodings, so both functions SvPVbyte and SvPVutf8 returns exactly same data in that case. Checking str_is_nonascii is just optimization if SvPVutf8 is really needed to call... Show quoted text
> Don't have time to try it out, and maybe I am overlooking something - > again, this is just a quick scan of the patch really. However, the > only > way to succeed, IMHO, is to get the idea of detecting or guessing > encoding > from perl scalars - the UTF8 flag _never_ indicates that the string > data > is utf-8 encoded.
For char* value retrieved by SvPV() call, UTF8 flag really indicates if that char* value is utf8 encoded or not. But you are right that it does not tell if perl scalar accessed by pure perl functions are utf8 encoded or are nativelly in perl. All such guessing is wrong way. Driver should get either binary scalar or string scalar.
CC: MLEHMANN [...] cpan.org, pali [...] cpan.org, DBOOK [...] cpan.org
MIME-Version: 1.0
X-Spam-Status: No, score=-5.852 tagged_above=-99.9 required=10 tests=[AWL=0.049, BAYES_00=-1.9, FROM_OUR_RT=-4, SPF_HELO_PASS=-0.001] autolearn=ham
In-Reply-To: <rt-4.0.18-19438-1477866398-927.87428-6-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-28296-1477149339-149.87428-6-0 [...] rt.cpan.org> <20161030204903.GA9903 [...] schmorp.de> <rt-4.0.18-27383-1477860556-1472.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-19438-1477866398-927.87428-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20161030232032.GD5359 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.852
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 46E1524031C for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sun, 30 Oct 2016 19:20:47 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NmQRJLMgIzkm for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Sun, 30 Oct 2016 19:20:44 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id B0DD524024C for <bug-DBD-mysql [...] rt.cpan.org>; Sun, 30 Oct 2016 19:20:42 -0400 (EDT)
Received: (qmail 15523 invoked by alias); 30 Oct 2016 23:20:41 -0000
Received: from mail.nethype.de (HELO mail.nethype.de) (5.9.56.24) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sun, 30 Oct 2016 16:20:38 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.84_2) (envelope-from <schmorp [...] schmorp.de>) id 1c0zOy-0008Ep-VQ; Sun, 30 Oct 2016 23:20:33 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.84_2) (envelope-from <schmorp [...] schmorp.de>) id 1c0zOy-0001Fl-PS; Sun, 30 Oct 2016 23:20:32 +0000
Received: from root by cerebro.laendle with local (Exim 4.84_2) (envelope-from <root [...] schmorp.de>) id 1c0zOy-0001Tm-Mj; Mon, 31 Oct 2016 00:20:32 +0100
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Openpgp: id=904ad2f81fb16978e7536f726dea2ba30bc39eb6; url=http://pgp.schmorp.de/schmorp-pgpkey.txt; preference=signencrypt
Date: Mon, 31 Oct 2016 00:20:32 +0100
X-Spam-Level:
To: Pali via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-8045-1477869648-613.87428-0-0 [...] rt.cpan.org>
Content-Length: 4117
On Sun, Oct 30, 2016 at 06:26:38PM -0400, Pali via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> > wrong here, and the message is wrong as well (scalars have no encoding > > information on the Perl level).
> > UTF8 flag tells if internal representation of PV in scalar is stored in utf8 or not. I was thinking that "\xc3\xbc" with utf8 flag is not mean to be binary anymore as it is internally stored as utf8. If you produce binary data which have internal representation in utf8 then I think there is some problem...
The problem is mysql not correctly interpreting that flag, and your patch doesn't make it better because it fails (similarly to the original DBD::mysql) to implement the flag as defined by perl itself. It might be a problem (I don't think it is), but that's how perl currently works, and as long as DBD::mysql doesn't handle it as intended, it will be buggy. Show quoted text
> > The patch thus requires the same workarounds needed for utf-8 for > > binary > > data now - that's the "wrong in the other direction".
> > I will think about it... But pack/unpack/vec/... functions works also on string "\xc3\xbc" with utf8 flag same as on "\xfc" without utf8 flag...
The "but" is weird, because your patch doesn't do that, unlike pack/unpack (at least they got it right after I fixed them). The key here is to understand that your patch does't work on these two strings the same way, even though it should. Show quoted text
> > I can't see whether this is inteded or not - calling str_is_nonascii > > on an > > utf-8 encoded scalar doesn't seem to make much sense to me (binary > > data is > > 8 bit wide, not 7 bit). On the other hand, this seems to be in the > > patch > > multiple times.
> > This is just optimization. SvPV returns data buffer in utf8 encoded or byte (latin1) encoded based on SvUTF8 flag. But plain ASCII data are same in both those encodings, so both functions SvPVbyte and SvPVutf8 returns exactly same data in that case. Checking str_is_nonascii is just optimization if SvPVutf8 is really needed to call...
Are you really telling me that issuing a warning is some kind of optimisation? Because that's what the patch does after testing str_is_nonascii. That doesn't look like an optimisation to me, in fact, it is a bug :) Show quoted text
>
> > Don't have time to try it out, and maybe I am overlooking something - > > again, this is just a quick scan of the patch really. However, the > > only > > way to succeed, IMHO, is to get the idea of detecting or guessing > > encoding > > from perl scalars - the UTF8 flag _never_ indicates that the string > > data > > is utf-8 encoded.
> > For char* value retrieved by SvPV() call, UTF8 flag really indicates if that char* value is utf8 encoded or not.
Unfortunately no - the UTF8 flag merely indicates how the perl codepoints are stored, it doesn't say anything about whether the char * is utf8 encoded or not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag set (but they might). Show quoted text
> But you are right that it does not tell if perl scalar accessed by pure perl functions are utf8 encoded or are nativelly in perl.
Maybe you mean the right thing, but the patch is wrong and your explanations are as well. The UTF8 flag business in perl is really messy, and I wish it wasn't called "UTF8", but it really doesn't tell you anything about character encoding or whether the scalar is text or binary, it only tells you how the codepoints are stored (namely either as plain octets or in a format similar to utf-8 encoding, without being utf-8). Show quoted text
> All such guessing is wrong way. Driver should get either binary scalar or string scalar.
Exactly, the driver should handle binary and text correctly - your patch seedms to go along way towards handling text correctly. It would just be nice if it wouldn't break the binary case even more :) Greetings, -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-8045-1477869648-613.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-28296-1477149339-149.87428-6-0 [...] rt.cpan.org> <20161030204903.GA9903 [...] schmorp.de> <rt-4.0.18-27383-1477860556-1472.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-19438-1477866398-927.87428-6-0 [...] rt.cpan.org> <20161030232032.GD5359 [...] schmorp.de> <rt-4.0.18-8045-1477869648-613.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-11435-1477871138-971.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 5082
Download (untitled) / with headers
text/plain 4.9k
On Ned Okt 30 19:20:48 2016, schmorp@schmorp.de wrote: Show quoted text
> On Sun, Oct 30, 2016 at 06:26:38PM -0400, Pali via RT <bug-DBD- > mysql@rt.cpan.org> wrote:
> > > wrong here, and the message is wrong as well (scalars have no > > > encoding > > > information on the Perl level).
> > > > UTF8 flag tells if internal representation of PV in scalar is stored > > in utf8 or not. I was thinking that "\xc3\xbc" with utf8 flag is not > > mean to be binary anymore as it is internally stored as utf8. If you > > produce binary data which have internal representation in utf8 then I > > think there is some problem...
> > The problem is mysql not correctly interpreting that flag, and your > patch doesn't make it better because it fails (similarly to the > original > DBD::mysql) to implement the flag as defined by perl itself. It might > be a > problem (I don't think it is), but that's how perl currently works, > and as > long as DBD::mysql doesn't handle it as intended, it will be buggy. >
> > > The patch thus requires the same workarounds needed for utf-8 for > > > binary > > > data now - that's the "wrong in the other direction".
> > > > I will think about it... But pack/unpack/vec/... functions works also > > on string "\xc3\xbc" with utf8 flag same as on "\xfc" without utf8 > > flag...
> > The "but" is weird, because your patch doesn't do that, unlike > pack/unpack > (at least they got it right after I fixed them). The key here is to > understand that your patch does't work on these two strings the same > way, > even though it should.
Yea, driver should work in same way as those functions. You are right and all those warnings are really wrong... I will try to fix code. Thank you for first review! Show quoted text
> > > I can't see whether this is inteded or not - calling > > > str_is_nonascii > > > on an > > > utf-8 encoded scalar doesn't seem to make much sense to me (binary > > > data is > > > 8 bit wide, not 7 bit). On the other hand, this seems to be in the > > > patch > > > multiple times.
> > > > This is just optimization. SvPV returns data buffer in utf8 encoded > > or byte (latin1) encoded based on SvUTF8 flag. But plain ASCII data > > are same in both those encodings, so both functions SvPVbyte and > > SvPVutf8 returns exactly same data in that case. Checking > > str_is_nonascii is just optimization if SvPVutf8 is really needed to > > call...
> > Are you really telling me that issuing a warning is some kind of > optimisation? Because that's what the patch does after testing > str_is_nonascii. That doesn't look like an optimisation to me, in > fact, it > is a bug :)
With that description I mean code pattern: valbuf= SvPV(ph->value, vallen); if (enable_utf8 && !is_binary && !SvUTF8(ph->value) && str_is_nonascii(valbuf, vallen)) { SV *tmp = sv_2mortal(newSVpvn(valbuf, vallen)); valbuf = SvPVutf8(tmp, vallen); } About warning, yes... code is wrong. Show quoted text
> >
> > > Don't have time to try it out, and maybe I am overlooking something > > > - > > > again, this is just a quick scan of the patch really. However, the > > > only > > > way to succeed, IMHO, is to get the idea of detecting or guessing > > > encoding > > > from perl scalars - the UTF8 flag _never_ indicates that the string > > > data > > > is utf-8 encoded.
> > > > For char* value retrieved by SvPV() call, UTF8 flag really indicates > > if that char* value is utf8 encoded or not.
> > Unfortunately no - the UTF8 flag merely indicates how the perl > codepoints are > stored, it doesn't say anything about whether the char * is utf8 > encoded or > not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag set > (but > they might).
When utf8 encoded SV do not have the UTF8 flag set? Do you have example? I really thought that UTF8 status flag indicate that char* returned by SvPV() is utf8 encoded. Also in perlapi is written: SvUTF8 Returns a U32 value indicating whether the SV contains UTF-8 encoded data. Call this after SvPV() in case any call to string overloading updates the internal flag. Which I understood that UTF8 status flag indicates if SvPV() buffer is utf8 encoded or not. Show quoted text
> > But you are right that it does not tell if perl scalar accessed by > > pure perl functions are utf8 encoded or are nativelly in perl.
> > Maybe you mean the right thing, but the patch is wrong and your > explanations are as well. > > The UTF8 flag business in perl is really messy, and I wish it wasn't > called "UTF8", but it really doesn't tell you anything about character > encoding or whether the scalar is text or binary, it only tells you > how > the codepoints are stored (namely either as plain octets or in a > format > similar to utf-8 encoding, without being utf-8). >
> > All such guessing is wrong way. Driver should get either binary > > scalar or string scalar.
> > Exactly, the driver should handle binary and text correctly - your > patch > seedms to go along way towards handling text correctly. It would just > be > nice if it wouldn't break the binary case even more :) > > Greetings,
CC: MLEHMANN [...] cpan.org, pali [...] cpan.org, DBOOK [...] cpan.org
MIME-Version: 1.0
X-Spam-Status: No, score=-5.853 tagged_above=-99.9 required=10 tests=[AWL=0.048, BAYES_00=-1.9, FROM_OUR_RT=-4, SPF_HELO_PASS=-0.001] autolearn=ham
In-Reply-To: <rt-4.0.18-11435-1477871139-645.87428-6-0 [...] rt.cpan.org>
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-87428 [...] rt.cpan.org> <rt-4.0.14-10453-1375152301-833.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-28296-1477149339-149.87428-6-0 [...] rt.cpan.org> <20161030204903.GA9903 [...] schmorp.de> <rt-4.0.18-27383-1477860556-1472.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-19438-1477866398-927.87428-6-0 [...] rt.cpan.org> <20161030232032.GD5359 [...] schmorp.de> <rt-4.0.18-8045-1477869648-613.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-11435-1477871139-645.87428-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <20161104091121.GG4528 [...] schmorp.de>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.853
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id E94662403F4 for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Fri, 4 Nov 2016 05:11:34 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id n0XR7k6CKJnS for <cpan-bug+DBD-mysql [...] hipster.bestpractical.com>; Fri, 4 Nov 2016 05:11:32 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 861CD2403E7 for <bug-DBD-mysql [...] rt.cpan.org>; Fri, 4 Nov 2016 05:11:32 -0400 (EDT)
Received: (qmail 7884 invoked by alias); 4 Nov 2016 09:11:31 -0000
Received: from mail.nethype.de (HELO mail.nethype.de) (5.9.56.24) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Fri, 04 Nov 2016 02:11:26 -0700
Received: from [10.0.0.5] (helo=doom.schmorp.de) by mail.nethype.de with esmtp (Exim 4.84_2) (envelope-from <schmorp [...] schmorp.de>) id 1c2aWv-0004A1-Iv; Fri, 04 Nov 2016 09:11:21 +0000
Received: from [10.0.0.1] (helo=cerebro.laendle) by doom.schmorp.de with esmtp (Exim 4.84_2) (envelope-from <schmorp [...] schmorp.de>) id 1c2aWv-0005cE-Cq; Fri, 04 Nov 2016 09:11:21 +0000
Received: from root by cerebro.laendle with local (Exim 4.84_2) (envelope-from <root [...] schmorp.de>) id 1c2aWv-000211-B5; Fri, 04 Nov 2016 10:11:21 +0100
Delivered-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Return-Path: <schmorp [...] schmorp.de>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+DBD-mysql [...] hipster.bestpractical.com
X-RT-Mail-Extension: dbd-mysql
Openpgp: id=904ad2f81fb16978e7536f726dea2ba30bc39eb6; url=http://pgp.schmorp.de/schmorp-pgpkey.txt; preference=signencrypt
Date: Fri, 4 Nov 2016 10:11:21 +0100
X-Spam-Level:
To: Pali via RT <bug-DBD-mysql [...] rt.cpan.org>
From: Marc Lehmann <schmorp [...] schmorp.de>
RT-Message-ID: <rt-4.0.18-10318-1478250695-1613.87428-0-0 [...] rt.cpan.org>
Content-Length: 4064
Download (untitled) / with headers
text/plain 3.9k
Sorry for the delay, I am quite busy. On Sun, Oct 30, 2016 at 07:45:44PM -0400, Pali via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text
> > Are you really telling me that issuing a warning is some kind of > > optimisation? Because that's what the patch does after testing > > str_is_nonascii. That doesn't look like an optimisation to me, in > > fact, it > > is a bug :)
> > With that description I mean code pattern:
Somewhat off-topic: most modules simply use SvPVutf8/SvPVbyte, without making a copy, so the optimisation should not normally be necessary. This normally also works, as perl itself makes a temporary copy in those cases where the scalar is not mutable, and presumably knows better, so the optimisation is probably a deoptimisation in practise, as perl does not have to scan the string in general. It is, however, correct, so you might stay with this approach if you have a reason to do it differently than other parts of perl. Show quoted text
> > not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag set > > (but > > they might).
> > When utf8 encoded SV do not have the UTF8 flag set? Do you have example? I really thought that UTF8 status flag indicate that char* returned by SvPV() is utf8 encoded.
You see, that's the problem with the flag - it simply doesn't mean anything like "trhe scalar is utf-8 encoded". First of all, perl's "UTF8" encoding isn't the same as unicode's utf-8 encoding, and second, it really only is a way of representing code points > 255 in a multibyte way. This scalar is utf-8 encoded, as matter of fact. It is also binary data, as utf-8 data is always binary: my $sv = "\xc3\xbc"; But it might or might not have the utf8 flag set (this depends on the perl version and other factors). Likewise, this scalar is utf-8 encoded: utf8::encode $sv; But it does not have the utf8 flag set. That's why it is so dangerous to use "utf8 encoded" to talk about these things, as it's never clear whether the actual data is meant or perls utf-8 like internal encoding. In my experience, it is much safer to just say upgraded or downgraded, as thenh it's much harder to subconsciously fall into this trap. Show quoted text
> Also in perlapi is written: > > SvUTF8 Returns a U32 value indicating whether the SV contains UTF-8 encoded data. Call this after SvPV() in case any call to string overloading updates the internal flag. > > Which I understood that UTF8 status flag indicates if SvPV() buffer is utf8 encoded or not.
Yeah, it's not. It's really a horrible, horrible mess. It means that the character codes inside the scalar use perls extended multibyte encoding, confusingly called utf8, but it doesn't mean the SV contains utf-8 encoded data AT ALL. And the best thing is, you know this, but let yourself get confused by the bad documentation. In case I am not clear enough (it's ghard to be clear with all these confusing documentation), a string with character code 200 ("chr 200") can have this flag set or not, but in no case is *the scalar* utf-8 encoded. Just that if the utf-8 flag is set, it means the character codes use an encoding very similar to utf-8 (for example, chr 0x200000 results in invalid utf-8 in memory, but is representable in perls encoding). So basically, what I am saying is that it isn't useful to talk about these utf8 flags in perl as if they indicated utf-8 encoding of the actual data in some way. Even people who know this regularly confuse themselves, and in my experience, you get bugs this way. Otherwise, it's great to hear that you clearly know your business around utf-8 and the patch is going to be fixed. Now the big question is how to proceed in general, as by all appearances, DBD::mysql is unmaintained and the maintainers do no longer respond to mail. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-10318-1478250695-1613.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.18-27383-1477860556-1472.87428-6-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-21166-1478253109-1111.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 6288
Download (untitled) / with headers
text/plain 6.1k
On Pia Nov 04 05:11:35 2016, schmorp@schmorp.de wrote: Show quoted text
> Sorry for the delay, I am quite busy. > > On Sun, Oct 30, 2016 at 07:45:44PM -0400, Pali via RT <bug-DBD- > mysql@rt.cpan.org> wrote:
> > > Are you really telling me that issuing a warning is some kind of > > > optimisation? Because that's what the patch does after testing > > > str_is_nonascii. That doesn't look like an optimisation to me, in > > > fact, it > > > is a bug :)
> > > > With that description I mean code pattern:
> > Somewhat off-topic: most modules simply use SvPVutf8/SvPVbyte, without > making a copy, so the optimisation should not normally be necessary.
It is not only for optimisation, it is also because SvPVbyte() croaks on "wide" characters. I do not want to introduce croaks and instead DBD::mysql show warning. If "wide" character cannot be downgraded to Latin1, then its UTF-8 representation is used. Exactly same behaviour is in print when passing wide character without :utf8 layer. Show quoted text
> This > normally also works, as perl itself makes a temporary copy in those > cases > where the scalar is not mutable, and presumably knows better, so the > optimisation is probably a deoptimisation in practise, as perl does > not > have to scan the string in general. > > It is, however, correct, so you might stay with this approach if you > have > a reason to do it differently than other parts of perl.
I think reason, to not crash existing code is really good reason. Show quoted text
> > > not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag > > > set > > > (but > > > they might).
> > > > When utf8 encoded SV do not have the UTF8 flag set? Do you have > > example? I really thought that UTF8 status flag indicate that char* > > returned by SvPV() is utf8 encoded.
> > You see, that's the problem with the flag - it simply doesn't mean > anything > like "trhe scalar is utf-8 encoded". First of all, perl's "UTF8" > encoding > isn't the same as unicode's utf-8 encoding, and second, it really only > is a > way of representing code points > 255 in a multibyte way. > > This scalar is utf-8 encoded, as matter of fact. It is also binary > data, > as utf-8 data is always binary: > > my $sv = "\xc3\xbc"; > > But it might or might not have the utf8 flag set (this depends on the > perl > version and other factors). Likewise, this scalar is utf-8 encoded: > > utf8::encode $sv; > > But it does not have the utf8 flag set. > > That's why it is so dangerous to use "utf8 encoded" to talk about > these > things, as it's never clear whether the actual data is meant or perls > utf-8 like internal encoding. > > In my experience, it is much safer to just say upgraded or downgraded, > as > thenh it's much harder to subconsciously fall into this trap.
Now I understand what you mean by your definition "utf8 encoded". Basically string scalar in perl contains sequence of numbers, where is each number represent exactly one character. And we have two different internal representation of strings (latin1 and extended utf8 resp. ebcdic and special utfebcdic) in perl and pure perl code does not see any difference between them. With "utf8 encoded" you mean that "numbers" represent utf8 sequence of octets, right? I used "utf8 encoded" term in case when macro SvPV() returns C char* which is "utf8 encoded" (not UTF-8, but perl's extended utf8). This is different! And if SvUTF8() returns true, then previous SvPV() call returns C char* which is "utf8 encoded" -- char* contains perl's extended utf8 string. SvUTF8 is sufficient condition but not necessary. As you pointed utf8::encode($sv) unset SvUTF8 flag, but SvPV() still returns char* in perl's extended utf8 encoding. Show quoted text
> > Also in perlapi is written: > > > > SvUTF8 Returns a U32 value indicating whether the SV contains UTF-8 > > encoded data. Call this after SvPV() in case any call to string > > overloading updates the internal flag. > > > > Which I understood that UTF8 status flag indicates if SvPV() buffer > > is utf8 encoded or not.
> > Yeah, it's not. It's really a horrible, horrible mess. It means that > the > character codes inside the scalar use perls extended multibyte > encoding, > confusingly called utf8, but it doesn't mean the SV contains utf-8 > encoded > data AT ALL. And the best thing is, you know this, but let yourself > get > confused by the bad documentation. > > In case I am not clear enough (it's ghard to be clear with all these > confusing documentation), a string with character code 200 ("chr 200") > can have this flag set or not, but in no case is *the scalar* utf-8 > encoded. Just that if the utf-8 flag is set, it means the character > codes > use an encoding very similar to utf-8 (for example, chr 0x200000 > results > in invalid utf-8 in memory, but is representable in perls encoding). > > So basically, what I am saying is that it isn't useful to talk about > these > utf8 flags in perl as if they indicated utf-8 encoding of the actual > data in > some way. Even people who know this regularly confuse themselves, and > in my > experience, you get bugs this way. > > Otherwise, it's great to hear that you clearly know your business > around > utf-8 and the patch is going to be fixed.
As perl scalars contain sequence of numbers, we can talk about "wide characters" and wide strings (wide character > 0xFF). I hope this is not confusing. And in C char* we can talk about (perl's extended) utf8 encoding. SvUTF8 can be used only in context of that char* data (not in pure perl context). Now I'm waiting for new DBD::mysql release because it change some code around parameter parsing (cause conflicts with my patch) and after that I rebase & publish new version of utf8 patches... Show quoted text
> Now the big question is how to proceed in general, as by all > appearances, > DBD::mysql is unmaintained and the maintainers do no longer respond to > mail.
DBD::mysql is still maintained. New versions are periodically releasing, see: https://metacpan.org/pod/DBD::mysql Last version is from OCT 20, 2016. Also security fixes (like one for CVE-2016-1246) are delivered... I do not see any problem, maintainers respond to email and also to pull requests on github. PS: you do not need to CC me in this RT. I'm automatically CCed by RT, so your explicit CC just cause that I get your emails two times :-)
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-21166-1478253109-1111.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.18-27383-1477860556-1472.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-21166-1478253109-1111.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-2396-1481240725-238.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 153
Download (untitled) / with headers
text/plain 153b
Pull request is updated: https://github.com/perl5-dbi/DBD-mysql/pull/67 Now it should handle wide characters correctly. Marc Lehmann, can you look at it?
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-2396-1481240725-238.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.18-27383-1477860556-1472.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-21166-1478253109-1111.87428-0-0 [...] rt.cpan.org> <rt-4.0.18-2396-1481240725-238.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-12530-1483692963-113.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 82
UTF-8 and Unicode fixes are now in DBD::mysql devel version 4.041_01. Please test.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-12530-1483692963-113.87428-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.18-27383-1477860556-1472.87428-6-0 [...] rt.cpan.org> <rt-4.0.18-21166-1478253109-1111.87428-0-0 [...] rt.cpan.org> <rt-4.0.18-2396-1481240725-238.87428-0-0 [...] rt.cpan.org> <rt-4.0.18-12530-1483692963-113.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-19857-1498900641-1379.87428-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 37
Reopening, fix was reverted in 4.043.
X-RT-Interface: REST
MIME-Version: 1.0
X-Mailer: MIME-tools 5.504 (Entity 5.504)
RT-Message-ID: <rt-4.0.18-24810-1510732338-269.87428-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Length: 78


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.