Skip Menu |
 

This queue is for tickets about the URI CPAN distribution.

Report information
The Basics
Id: 86064
Status: open
Priority: 0/
Queue: URI

People
Owner: Nobody in particular
Requestors: gwilliams [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: 1.60



X-RT-Interface: Web
MIME-Version: 1.0
Message-ID: <rt-4.0.13-28462-1370981779-145.0-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
X-RT-Original-Encoding: utf-8
Content-Type: multipart/mixed; boundary="----------=_1370981779-28462-2"
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Length: 981
Download (untitled) / with headers
text/plain 981b
The URI->as_iri method seems to produce both character strings and byte sequences depending on the input of punycode URIs. This makes dealing with the output difficult when trying to sensibly combine it with other strings. It seems to me that the difference depends on whether the decoded punycode value only contains codepoints that can be represented in latin-1. The attached test script shows the decoding of two punycode URIs: http://www.hestebedgård.dk/ http://✪df.ws/ Using Devel::Peek, it can be seen that "hestebedgård" is represented as a byte sequence with U+00e5 being represented as the single byte 0xE5 with the SV lacking the UTF8 flag. On the other hand, "✪df" is represented as a UTF8-flagged character string with the first character correctly encoded as \x{272a}. I believe the attached patch solves this problem, but I'm not sure if it might break any other cases, or if there's a better way of forcing the decoded unicode string to have the UTF8 flag.
Subject: iri_encoding.diff
MIME-Version: 1.0
Content-Type: application/octet-stream; name="iri_encoding.diff"
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline; filename="iri_encoding.diff"
Content-Transfer-Encoding: base64
Content-Length: 485
Download iri_encoding.diff
text/x-diff 485b
--- URI/_punycode.pm.orig 2013-06-09 10:14:14.000000000 +0400 +++ URI/_punycode.pm.new 2013-06-11 11:49:19.000000000 +0400 @@ -86,7 +86,11 @@ warn join " ", map sprintf('%04x', $_), @output if $DEBUG; $i++; } - return join '', map chr, @output; + my $uri = join '', map chr, @output; + use Encode; + my $octets = encode('UTF-8', $uri, Encode::FB_CROAK); + $uri = decode('UTF-8', $octets, Encode::FB_CROAK); + return $uri; } sub encode_punycode {
Subject: iri_encoding.pl
MIME-Version: 1.0
Content-Type: text/x-perl-script; name="iri_encoding.pl"
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline; filename="iri_encoding.pl"
Content-Transfer-Encoding: binary
Content-Length: 222
Download iri_encoding.pl
text/x-perl 222b
#!/usr/bin/perl use strict; use warnings; use Devel::Peek; use URI; my $latin1 = URI->new('http://www.xn--hestebedgrd-58a.dk/')->as_iri; my $utf8 = URI->new('http://xn--df-oiy.ws/')->as_iri; Dump($latin1); Dump($utf8);
MIME-Version: 1.0
In-Reply-To: <rt-4.0.13-28462-1370981779-145.0-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.13-28462-1370981779-145.0-0-0 [...] rt.cpan.org>
Content-Type: text/html; charset="utf-8"
Message-ID: <rt-4.0.13-11536-1371065130-570.86064-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 779
On Tue Jun 11 16:16:19 2013, GWILLIAMS wrote:
Show quoted text
> The URI->as_iri method seems to produce both character strings and
> byte sequences depending on the input of punycode URIs. This makes
> dealing with the output difficult when trying to sensibly combine
> it with other strings.

There should not really be an semantic difference between utf8::upgraded or utf8::downgraded strings.  If you have problems combining the result with other strings there is something else that's not quite right.  The simplest way to upgrade is to just call:

  utf8::upgrade($iri);

I don't really think $url->as_iri should change.  At least I would like to see a stronger argument before we do.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.13-11536-1371065130-570.86064-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-4.0.13-28462-1370981779-145.0-0-0 [...] rt.cpan.org> <rt-4.0.13-11536-1371065130-570.86064-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.13-8109-1371303057-848.86064-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 997
Download (untitled) / with headers
text/plain 997b
On Wed Jun 12 15:25:30 2013, GAAS wrote: Show quoted text
> There should not really be an semantic difference between > utf8::upgraded or > utf8::downgraded strings. If you have problems combining the result > with other > strings there is something else that's not quite right. The simplest > way to > upgrade is to just call: > > utf8::upgrade($iri); > > I don't really think $url->as_iri should change. At least I would like > to see a > stronger argument before we do.
That's a fair point. The problem may be more complex than I thought. I believe the problem I'm facing now (related to a bug-report I received for RDF::Trine) is that the string ends up being passed to a system library via XS that expects UTF8 encoded data, and has trouble with the latin-1. Moreover, the punycode spec as well as the documentation for as_iri talk explicitly about unicode strings, so I'm not sure why the appropriate place to make the utf8::upgrade call wouldn't be in the as_iri implementation. Thoughts? thanks, .greg
MIME-Version: 1.0
X-Spam-Status: No, score=-2.469 tagged_above=-99.9 required=10 tests=[AWL=-0.559, BAYES_00=-1.9, MIME_QP_LONG_LINE=0.001, SPF_HELO_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Content-Disposition: inline
X-Spam-Flag: NO
X-RT-Interface: API
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Content-Type: multipart/signed; boundary="===============4721624624627841914=="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Message-ID: <20141224235154.4506.15262 [...] bastian.jones.dk>
X-Spam-Score: -2.469
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 3B7622402D2 for <cpan-bug+URI [...] hipster.bestpractical.com>; Wed, 24 Dec 2014 18:52:14 -0500 (EST)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cmhEw58e8OZq for <cpan-bug+URI [...] hipster.bestpractical.com>; Wed, 24 Dec 2014 18:52:12 -0500 (EST)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id BE4572402C4 for <bug-URI [...] rt.cpan.org>; Wed, 24 Dec 2014 18:52:11 -0500 (EST)
Received: (qmail 13704 invoked by alias); 24 Dec 2014 23:52:11 -0000
Received: from coreander.jones.dk (HELO coreander.jones.dk) (80.68.88.141) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Wed, 24 Dec 2014 15:52:06 -0800
Received: from localhost (localhost [127.0.0.1]) by coreander.jones.dk (Postfix) with ESMTP id 9327730B58 for <bug-URI [...] rt.cpan.org>; Thu, 25 Dec 2014 00:52:01 +0100 (CET)
Received: from coreander.jones.dk ([127.0.0.1]) by localhost (coreander.jones.dk [127.0.0.1]) (amavisd-new, port 10024) with SMTP id Jr8P-wnkB1pT for <bug-URI [...] rt.cpan.org>; Thu, 25 Dec 2014 00:52:00 +0100 (CET)
Received: from xayide.jones.dk (188-183-5-254-static.dk.customer.tdc.net [188.183.5.254]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by coreander.jones.dk (Postfix) with ESMTPS id 0DB0E30B54 for <bug-URI [...] rt.cpan.org>; Thu, 25 Dec 2014 00:51:59 +0100 (CET)
Received: from localhost (localhost [127.0.0.1]) by xayide.jones.dk (Postfix) with ESMTP id 3CB50110E for <bug-URI [...] rt.cpan.org>; Thu, 25 Dec 2014 00:51:59 +0100 (CET)
Received: from xayide.jones.dk ([127.0.0.1]) by localhost (xayide.jones.dk [127.0.0.1]) (amavisd-new, port 10024) with SMTP id AZIT94A04h4m for <bug-URI [...] rt.cpan.org>; Thu, 25 Dec 2014 00:51:56 +0100 (CET)
Received: from jones.dk (unknown [192.168.222.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by xayide.jones.dk (Postfix) with ESMTPSA id DAB1A23A7 for <bug-URI [...] rt.cpan.org>; Thu, 25 Dec 2014 00:51:55 +0100 (CET)
Received: (nullmailer pid 27129 invoked by uid 1000); Wed, 24 Dec 2014 23:51:55 -0000
Delivered-To: cpan-bug+URI [...] hipster.bestpractical.com
User-Agent: alot/0.3.6
Subject: [rt.cpan.org #86064] utf8::upgraded input produce utf8::downgraded output
Return-Path: <jonas [...] jones.dk>
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+URI [...] hipster.bestpractical.com
X-RT-Mail-Extension: uri
Date: Thu, 25 Dec 2014 00:51:54 +0100
X-Spam-Level:
To: bug-URI [...] rt.cpan.org
From: Jonas Smedegaard <dr [...] jones.dk>
RT-Message-ID: <rt-4.0.18-18932-1419465135-461.86064-0-0 [...] rt.cpan.org>
Content-Length: 0
MIME-Version: 1.0
content-type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 1226
Download (untitled) / with headers
text/plain 1.1k
Hi Gisle, Comparing this bugreport with https://github.com/kasei/perl-iri/issues/2 (understanding far better now than when I followed along a year ago), it occurs to me that in this conversation it is not clear that URI module degrades already utf8::upgraded strings. Perhaps that is the "stronger argument" that you sought back then? This demonstrates the degradation (based on above IRI conversation): use URI; use Devel::Peek; my $value = "http://www.hestebedg\x{e5}rd.dk/#frag"; utf8::upgrade($value); print STDERR "Raw value: "; Dump($value); my $uri = URI->new($value); print STDERR "URI as_iri: "; Dump($uri->as_iri); Regards, - Jonas P.S. "Hestebedgård" is a farm turned into a museum, located on the island of Orø where I live. I hit bugs in RDF::Trine when challenging myself to learn RDF by semantically modelling public facilities on my island - leading e.g. to http://data.biks.dk/hours/ ...in case you are curious and do not grok scandinavian language (as your name and interest in non-ASCII characters indicates). -- * Jonas Smedegaard - idealist & Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private
MIME-Version: 1.0
Content-Description: signature
Content-Type: application/pgp-signature; charset="us-ascii"; name="signature.asc"
Content-Transfer-Encoding: 7bit
X-RT-Original-Encoding: ascii
Content-Length: 949
Download signature.asc
application/pgp-signature 949b

Message body not shown because it is not plain text.



This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.