Skip Menu |
 

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 80503
Status: resolved
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: ambrus [...] math.bme.hu
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 3.44
Fixed in: (no value)



From zsbana [...] gmail.com Tue Oct 30 17: 37:48 2012
MIME-Version: 1.0
X-Spam-Status: No, score=-6.246 tagged_above=-99.9 required=10 tests=[AWL=-0.126, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
X-Spam-Flag: NO
content-type: text/plain; charset="utf-8"
Message-ID: <CAHku1CHdS=CO6sMzpF+3kkCwD0hoJNt-br-GuN6XgEUHP+UScw [...] mail.gmail.com>
Reply-To: ambrus [...] math.bme.hu
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Spam-Score: -6.246
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id E81FA240975 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 30 Oct 2012 17:37:47 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xpqYInxm7rdv for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 30 Oct 2012 17:37:46 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id B661F240974 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 30 Oct 2012 17:37:45 -0400 (EDT)
Received: (qmail 12907 invoked by uid 103); 30 Oct 2012 21:37:45 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 30 Oct 2012 21:37:45 -0000
Received: from mail-vb0-f50.google.com (HELO mail-vb0-f50.google.com) (209.85.212.50) by 16.mx.develooper.com (qpsmtpd/0.84/v0.84-167-g4ed6cab) with ESMTP; Tue, 30 Oct 2012 14:37:40 -0700
Received: by mail-vb0-f50.google.com with SMTP id fa15so793587vbb.9 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 30 Oct 2012 14:37:29 -0700 (PDT)
Received: by 10.220.226.200 with SMTP id ix8mr6484820vcb.67.1351633049526; Tue, 30 Oct 2012 14:37:29 -0700 (PDT)
Received: by 10.58.117.42 with HTTP; Tue, 30 Oct 2012 14:37:29 -0700 (PDT)
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
Subject: Newlines in attribute values
Return-Path: <zsbana [...] gmail.com>
X-RT-Mail-Extension: xml-twig
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-Spam-Check-BY: 16.mx.develooper.com
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:date:x-google-sender-auth:message-id :subject:from:to:content-type; bh=uSkiSRNrWfYV4OsZLiqJdhJEfE+iDpV3fC49pJkJZaw=; b=TiE5nGLpInI1NmL/gkep9mBrrpdMsNM3ZyIXYL/dJY4CFIaXWzHXVD18zxpee1LwIB R2i02+SMs5ejAmLFuIghxY2G3mD8hjEgDh5h3YozhPnhuP6DE8oXWy3M1nQqBvO1bK28 FMLEbgVYQbb2z1LFucn0EdO3oxqTlHSjxGaJwimzEpdkLQkakS8qA5JYj9Ph5bRDltnm DWy6F9dUJ7d5Mcm8ReYrWbHAJVrzIwwTgAyAPPA0H24B3xSIrXLgRn1+7i2vN59ieef5 qhj20cjax/wTITPrz811dJB0eKOby0bNU3TP5v+t/CfCesuKlAJbhYrHLg6ZdYuwXACU TSVA==
X-Google-Sender-Auth: x3SjmaqGQSrUrd1us0rx-VDja0Q
Date: Tue, 30 Oct 2012 22:37:29 +0100
Sender: zsbana [...] gmail.com
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
From: Zsbán Ambrus <ambrus [...] math.bme.hu>
X-RT-Original-Encoding: ISO-8859-1
Content-Length: 3697
Download (untitled) / with headers
text/plain 3.6k
Hello, According to the specs, a newline character in an attribute value must be escaped with an entity otherwise an xml reader will normalize it to a space, but XML::Twig's writer does not seem to know about this. Let me tell the story of the details. I was trying to edit an XML files,actually project configuration files of MS Visual Studio, with Twig. This XML had an attribute with an escaped CRLF inside an attribute value, something like "foo&#13;&#10;bar". This attribute was in an element I didn't change in my editing. When I tried to use the modified XML, I got an error. It turns out that XML::Twig wrote out the attribute with the CRLF unescaped, and the XML reader in MS Visual Studio read it as a single space. After some inquiry, perlmonks told me that the behavior of the XML reader is correct. It turns out that the XML 1.0 standard claims that if a reader finds unescaped CR, LF, CRLF, or HT in an attribute value, it must normalize it to a space. You can find a reference for this behavior at "http://stackoverflow.com/questions/260436/preserving-attribute-whitespace-in-xslt". It turns out that the reader part of XML::Twig behaves correctly: it too reads an unescaped newline in an attribute as a space, but the writer part fails to escape newlines. This means that when you read an escaped newline from an attribute then write it out, the value changes, so I believe this is a bug in XML::Twig. Here's a simple example showing the bug. $ perl -we 'use XML::Twig; my $ct= qq(<m><n p="q&#x0a;r"/><s t="u\nv"/></m>); my $tw = XML::Twig->new; $tw->parse($ct); $tw->flush; print $/;' <m><n p="q r"/><s t="u v"/></m> $ For this simple example, I'm using perl v5.16.1on amd64-linux, XML::Twig v3.41, XML::Parser v2.41, Encode v2.44, all vanilla; with libexpat 2.0.1-7+squeeze1 from the debian package. Ambrus ---- Configuration: perl: 5.016001 OS: linux - x86_64-linux required XML::Parser : 2.41 Can't exec "xmlwf": No such file or directory at t/zz_dump_config.t line 34. Use of uninitialized value $xmlwf_v in pattern match (m//) at t/zz_dump_config.t line 35. Missing argument in sprintf at t/zz_dump_config.t line 114. expat : <no version information found> Strongly Recommended Scalar::Util : 1.25 (for improved memory management) Encode : 2.44 (for encoding conversions) Modules providing additional features XML::XPathEngine : 0.13 (to use XML::Twig::XPath) XML::XPath : <not available> (to use XML::Twig::XPath if Tree::XPathEngine not available) LWP : 6.04 (for the parseurl method) HTML::TreeBuilder : 5.02 (to use parse_html and parsefile_html) HTML::Entities::Numbered : <not available> (to allow parsing of HTML containing named entities) HTML::Tidy : 1.54 (to use parse_html and parsefile_html with the use_tidy option) HTML::Entities : 3.69 (for the html_encode filter) Tie::IxHash : <not available> (for the keep_atts_order option) Text::Wrap : 2009.0305 (to use the "wrapped" option for pretty_print) Modules used only by the auto tests t/zz_dump_config.t .................. 1/1 Test : 1.25_02 Test::Pod : <not available> XML::Simple : <not available> XML::Handler::YAWriter : <not available> XML::SAX::Writer : <not available> XML::Filter::BufferText : <not available> IO::Scalar : <not available> IO::CaptureOutput : <not available>
MIME-Version: 1.0
In-Reply-To: <CAHku1CHdS=CO6sMzpF+3kkCwD0hoJNt-br-GuN6XgEUHP+UScw [...] mail.gmail.com>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
References: <CAHku1CHdS=CO6sMzpF+3kkCwD0hoJNt-br-GuN6XgEUHP+UScw [...] mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-13335-1352796231-135.80503-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 1186
Download (untitled) / with headers
text/plain 1.1k
On Tue Oct 30 17:37:48 2012, ambrus@math.bme.hu wrote: Show quoted text
> According to the specs, a newline character in an attribute value must > be escaped with an entity otherwise an xml reader will normalize it to > a space, but XML::Twig's writer does not seem to know about this.
Hi, Sorry, I saw the bug report, thought about it, and... forgot to answer it. First the work around: if you create the twig using the keep_encoding option, then you get what you want. Be aware of the (potential) problems with keep_encoding though: all the character data you get, whether in attribute or elements becomes unescaped, and is output asis, so if you add data, you have to escape it yourself. A better fix is not possible, because XML::Parser normally reports the data after resolving the entities, so by the time it gets to XML::Twig the numerical entity is lost. For example: perl -MXML::Parser -E'XML::Parser->new( Handlers => { Start => sub { my( $t, $tag, %att)= @_; say $att{p}; } })->parse( q{<d p="q&#x0a;r"/>})' outputs this: q r XML::Twig with the keep_encoding option has to resort to getting the original string from XML::Parser and re-parsing it. Does this help? __ mirod
From zsbana [...] gmail.com Tue Nov 13 14: 18:15 2012
MIME-Version: 1.0
X-Spam-Status: No, score=-6.242 tagged_above=-99.9 required=10 tests=[AWL=-0.122, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-13335-1352796231-34.80503-6-0 [...] rt.cpan.org>
X-Spam-Flag: NO
References: <RT-Ticket-80503 [...] rt.cpan.org> <CAHku1CHdS=CO6sMzpF+3kkCwD0hoJNt-br-GuN6XgEUHP+UScw [...] mail.gmail.com> <rt-3.8.HEAD-13335-1352796231-34.80503-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Reply-To: ambrus [...] math.bme.hu
Message-ID: <CAHku1CEJMwE=KPJzJSAMaKRjA_TaNo97=aFNiKF1wkrsMXNFNg [...] mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -6.242
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 8EF722409B0 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 13 Nov 2012 14:18:15 -0500 (EST)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id L3azowaMmy61 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 13 Nov 2012 14:18:13 -0500 (EST)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 7714B24006F for <bug-XML-Twig [...] rt.cpan.org>; Tue, 13 Nov 2012 14:18:13 -0500 (EST)
Received: (qmail 20640 invoked by uid 103); 13 Nov 2012 19:18:12 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 Nov 2012 19:18:12 -0000
Received: from mail-vc0-f178.google.com (HELO mail-vc0-f178.google.com) (209.85.220.178) by 16.mx.develooper.com (qpsmtpd/0.84/v0.84-167-g4ed6cab) with ESMTP; Tue, 13 Nov 2012 11:18:08 -0800
Received: by mail-vc0-f178.google.com with SMTP id gb30so7595710vcb.9 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 13 Nov 2012 11:18:05 -0800 (PST)
Received: by 10.52.70.8 with SMTP id i8mr28402318vdu.24.1352834285354; Tue, 13 Nov 2012 11:18:05 -0800 (PST)
Received: by 10.58.117.42 with HTTP; Tue, 13 Nov 2012 11:18:05 -0800 (PST)
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #80503] Newlines in attribute values
Return-Path: <zsbana [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=LnOr9eXPMMwx0uX2Bh2mi3tGC3HOshJgVO+Ci5g1SLQ=; b=PVAhW4WdDhOPHOISHu9f7Fi7ueg04si1vsm85qO41OQaZWjqrsmrKyliQ2g3nVzURL tFAx00c9PlN85O+wJ9pnKhN33RFpBs+lgmtnatu5npyhFCIL+nFLPaKc6Qad8zhpU4lP QEYOG6WZOou6hFsEOocjc0CC+SsdBwuFsmcGT4q5bzF8nE5guwU8p3xFYLP7+oGM3im/ UfE20VeLYZ6qDbfLee1sGZyQqhcbSXotS9F++HMDHmEXLwnIQkCqH5tXrz5gglmpjkGx xIkxgXNn01NEfjIGZXb+FpeDMqQ1ieQC6CfBKiK0rbFn1DZ9Ea8WQ/fhaOltJYw1PH7M E+Zg==
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-RT-Mail-Extension: xml-twig
X-Google-Sender-Auth: Jafo-B8f9ipyUlms0SUzXBJKkfk
Sender: zsbana [...] gmail.com
Date: Tue, 13 Nov 2012 20:18:05 +0100
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
From: Zsbán Ambrus <ambrus [...] math.bme.hu>
RT-Message-ID: <rt-3.8.HEAD-15875-1352834296-603.80503-0-0 [...] rt.cpan.org>
Content-Length: 1889
Download (untitled) / with headers
text/plain 1.8k
On 11/13/12, MIROD via RT <bug-XML-Twig@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=80503 > > > On Tue Oct 30 17:37:48 2012, ambrus@math.bme.hu wrote: >
>> According to the specs, a newline character in an attribute value must >> be escaped with an entity otherwise an xml reader will normalize it to >> a space, but XML::Twig's writer does not seem to know about this.
> > A better fix is not possible, because XML::Parser normally reports the > data after resolving the entities, so by the time it gets to XML::Twig > the numerical entity is lost.
Hello mirod. Thanks for your reply, but I think you may have misunderstood my report. It's true that keep_encoding could be used as a workaround, but I think a better fix _is_ possible. Currently, when you have a newline in an attribute value, XML::Twig will output it as a literal newline. $ perl -we 'use XML::Twig; $tw = XML::Twig->new; $tw->set_root(XML::Twig::Elt->new("m", {"n" => "p\nq"})); $tw->flush; print $/;' <m n="p q"/> $ This is simply wrong, because the literal newline in the attribute value does not represent a newline, it represents a space instead. This is what the XML standard says, and this is how libexpat and other xml readers read the above output. Even XML::Twig reads this output that way, with a space in the attribute value. In an attribute value, XML::Twig should always escape not only quotation marks and ampersands, but also newlines, because the XML syntax says they must be escaped. So if I run the above code with a hypothetical future version of XML::Twig, the output should be <m n="p&#10;q"/> because that's the only way to correctly represent the given attribute value in the output. This would complicate the XML::Twig code because it means attribute values must be escaped in a different way from pcdata, but I still think such a fix is necessary. Ambrus


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.