Skip Menu |
 

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 13509
Status: resolved
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: ddascalescu [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



Date: Sat, 2 Jul 2005 09:52:53 +0200 (MEST)
From: ddascalescu [...] gmail.com
To: bug-XML-Twig [...] rt.cpan.org
Subject: Forced conversion of the > base entity to '>'
Download (untitled) / with headers
text/plain 1.3k
I was using XML::Twig 2005/06/28 09:53:02 and I saw that when parsing and printing an XML, the &gt; base entity is always replaced in the output with '>'. Other base entities such as &lt; are not affected, and this lack of symmetry makes me wonder why this happens. I tried correcting the problem by using keep_encoding => 1, but that has the unfortunate effect of not parsing UTF-8. As far as I know, there is no way to still parse UTF-8 and *not* convert &gt; to '>'. Please see the patch. It's a dirty hack, I did not have the time to dig in the module and understand its logic, so it might be wrong. Here is a test case: #! perl -w use strict; use XML::Twig; my $t= XML::Twig->new(); $t->parse( '<start>3 &lt; pi &gt; 3</start>' ); $t->print; And here is the patch: --- D:\Perl\site\lib\XML\Twig318.pm 7/2/2005 00:27:14 +++ D:\Perl\site\lib\XML\Twig318.orig 7/2/2005 00:21:16 @@ -6810,6816 +6810,6816 @@ } } else - { $string=~ s/([&<>])/$XML::Twig::base_ent{$1}/g unless( $keep_encoding || $elt->{asis}); + { $string=~ s/([&<])/$XML::Twig::base_ent{$1}/g unless( $keep_encoding || $elt->{asis}); $string=~ s{\Q]]>}{]]&gt;}g; } return $output_text_filter ? $output_text_filter->( $string) : $string; Hope this helps, Dan Dascalescu
Date: Sat, 02 Jul 2005 10:23:41 +0200
From: Michel Rodriguez <mirod [...] xmltwig.com>
To: bug-XML-Twig [...] rt.cpan.org
Subject: Re: [cpan #13509] Forced conversion of the &gt; base entity to '>'
RT-Send-Cc:
Download (untitled) / with headers
text/plain 1.9k
ddascalescu@gmail.com via RT wrote: Show quoted text
> This message about XML-Twig was sent to you by ddascalescu@gmail.com <ddascalescu@gmail.com> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=13509 > > > I was using XML::Twig 2005/06/28 09:53:02 and I saw that > when parsing and printing an XML, the &gt; base entity is > always replaced in the output with '>'. Other base entities > such as &lt; are not affected, and this lack of symmetry > makes me wonder why this happens.
The '>' does NOT need to be converted to &gt;, except when following ']]', so it cannot be confused with the end of a CDATA section, which BTW does not make sense to me, but hey, I can't really argue with the XML spec. Note also that the parser always passes '>' (and '<' and '&') to the layer above, whether the original text had '>' or &gt; Show quoted text
> I tried correcting the problem by using keep_encoding => 1, > but that has the unfortunate effect of not parsing UTF-8.
??? Not parsing UTF-8? I don't understand here. Show quoted text
> Here is a test case: > > #! perl -w > use strict; > use XML::Twig; > > my $t= XML::Twig->new(); > $t->parse( '<start>3 &lt; pi &gt; 3</start>' ); > $t->print;
The output is valid XML, and for any XML parser <start>3 &lt; pi &gt; 3</start> is equivalent to <start>3 &lt; pi > 3</start> There are a few things I can do: - leave things asis and add a FAQ about it (and something in the main docs for the module) - provide an option that will also escape > into &gt; I do not think that changing the default setting is possible at this stage, it would risk breaking backward compatibility. Plus not escaping the > has the side effect that when people have to process the XML produced by XML::Twig, they need to use a proper XML parser, and not home-brewed, regexp based, pseudo-parsers. This seems to have helped quite a few users so far. So, do you really need to escape the '>' ? -- Michel Rodriguez Perl &amp; XML xmltwig.com
Date: Wed, 6 Jul 2005 01:49:58 +0200 (MEST)
From: ddascalescu [...] gmail.com
To: bug-XML-Twig [...] rt.cpan.org
Subject: Re: [cpan #13509] Forced conversion of the &gt; base entity to '>'
RT-Send-Cc:
Show quoted text
> > I tried correcting the problem by using keep_encoding => 1, > > but that has the unfortunate effect of not parsing UTF-8.
Show quoted text
> ??? Not parsing UTF-8? I don't understand here.
Sorry for not being very clear on this. This code should illustrate my point (change keep_encoding to 0 and run again). my $t= XML::Twig->new( keep_encoding => 1 ); $t->parse( "<start>\xE3\x80\x82</start>" ); print length $t->root->text == 1? "UTF-8 parsed in" : "Raw bytes parsed in"; Show quoted text
> So, do you really need to escape the '>' ?
My scripts do various transformations on the XML and I need the output XML to be as close to the input as possible, so if the input contained '&gt;', so should the output. On the other hand, if the input contained '>' the output should not be '&gt;'. So I'm actually looking for a way to tell XML::Twig not to transform the input XML more than necessary. That would include not converting single quotes around attribute values to double quotes, not touching spaces etc. But the "&gt;" is the most important issue. Hope this helps, Dan Dascalescu
Date: Wed, 06 Jul 2005 08:49:58 +0200
From: Michel Rodriguez <mirod [...] xmltwig.com>
To: bug-XML-Twig [...] rt.cpan.org
Subject: Re: [cpan #13509] Forced conversion of the &gt; base entity to '>'
RT-Send-Cc:
ddascalescu@gmail.com via RT wrote: Show quoted text
> This message about XML-Twig was sent to you by ddascalescu@gmail.com <ddascalescu@gmail.com> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=13509 > >
>>>I tried correcting the problem by using keep_encoding => 1, >>>but that has the unfortunate effect of not parsing UTF-8.
> >
>>??? Not parsing UTF-8? I don't understand here.
> > > Sorry for not being very clear on this. This code should illustrate > my point (change keep_encoding to 0 and run again). > > my $t= XML::Twig->new( > keep_encoding => 1 > ); > $t->parse( "<start>\xE3\x80\x82</start>" ); > print length $t->root->text == 1? "UTF-8 parsed in" : "Raw bytes parsed in";
Indeed, the utf8 flag is not set on the data. This is because XML::Parser does not set it on the original_string method. Usually keep_encoding is used to process documents in non-utf8, so no one seems to have noticed that. You will have to set the flag yourself, by using an input filter: input_filter => sub { _utf8_on( $_[0]); return $_[0] }, See attached example. Show quoted text
>>So, do you really need to escape the '>' ?
> > > My scripts do various transformations on the XML and I need the > output XML to be as close to the input as possible, so if the > input contained '&gt;', so should the output. On the other hand, > if the input contained '>' the output should not be '&gt;'. > > So I'm actually looking for a way to tell XML::Twig not to > transform the input XML more than necessary. That would include > not converting single quotes around attribute values to double > quotes, not touching spaces etc. But the "&gt;" is the most > important issue.
Thar would be the keep_encoding mode. Otherwise you will have no way to distinguish between > and &gt; The fact that this is important to you still shows that there is a problem with your overall XML processing chain. All XML tools report Show quoted text
> and &gt; as the same thing, unless you jump through hoops like
XML::Twig does. Does the fix I suggest fix your problem? -- Michel Rodriguez Perl &amp; XML xmltwig.com
Download parse_utf8
text/plain 909b
#!/usr/bin/perl -w use strict; use warnings; use XML::Twig; use Encode qw( _utf8_on is_utf8); my $t= XML::Twig->new( keep_encoding => 1, input_filter => sub { _utf8_on( $_[0]); return $_[0] }, ); $t->parse( "<start>\xE3\x80\x82</start>" ); test_utf( $t->root->text, '$t->root->text'); test_utf( $t->sprint, '$t->sprint'); test_utf( $t->root->sprint, '$t->root->sprint'); my $p= XML::Parser->new( Handlers => { Char => sub { test_utf( $_[0]->recognized_string, '$_[0]->recognized_string'); test_utf( $_[0]->original_string, '$_[0]->original_string'); } } ); $p->parse( "<start>\xE3\x80\x82</start>" ); sub test_utf { my( $string, $message)= @_; print "$message: ", ( is_utf8( $string) ? "is utf8" : "is NOT utf8"), "\n"; }
Date: Fri, 8 Jul 2005 03:40:29 +0200 (MEST)
From: ddascalescu [...] gmail.com
To: bug-XML-Twig [...] rt.cpan.org
Subject: Re: [cpan #13509] Forced conversion of the &gt; base entity to '>'
RT-Send-Cc:
Download (untitled) / with headers
text/plain 1.6k
Thank you for the support, Michael. I should have been more general about the encoding issue: Show quoted text
> Indeed, the utf8 flag is not set on the data.
In general, if you set keep_encoding to 0, the encoding of the XML will be automatically recognized and the XML will be properly parsed. Show quoted text
> See attached example.
The example worked for UTF-8, but I should have presented the general case. The example won't work for, e.g., UTF-16: my $t= XML::Twig->new( keep_encoding => 1, input_filter => sub { _utf8_on($_[0]); return $_[0] }, ); my $string = '<?xml version="1.0" encoding="utf-16"?><start>text</start>'; $string =~ s/(.)/$1\x00/g; # brutal conversion to UTF-16 $t->parse($string); Show quoted text
> The fact that this is important to you still shows that there > is a problem with your overall XML processing chain.
My applications generally take XML as input and output RTF. One tags the RTF for translation with TRADOS and the other one builds an RTF glossary out of a TMX file. In the case of the glossary, the users need to be able to search for one of their strings in the glossary, including a string containing "&gt;": "Click File &gt;&gt; Exit to exit the application". In the case of the tagged RTF, the client wants the file back from translation without alterations to anything else than the text. That being said, what I'm wondering is: 1) Is there a reason for XML::Twig not to simply preserve '&gt;' or '>' as they appeared in the input? 2) More importantly, is there a way to have the encoding in the XML automatically recognized (UTF-8, UTF-16 etc.) and still preserve '&gt;'s and '>' Thank you, Dan Dascalescu
Date: Fri, 08 Jul 2005 06:20:59 +0200
From: Michel Rodriguez <mirod [...] xmltwig.com>
To: bug-XML-Twig [...] rt.cpan.org
Subject: Re: [cpan #13509] Forced conversion of the &gt; base entity to '>'
RT-Send-Cc:
Download (untitled) / with headers
text/plain 1.1k
ddascalescu@gmail.com via RT wrote: Show quoted text
> That being said, what I'm wondering is: > > 1) Is there a reason for XML::Twig not to simply preserve '&gt;' > or '>' as they appeared in the input?
Yes. For a parser &gt; and > are the same, and they are usually reported the same way. XML::Twig has to use a non-standard method of XML::Parser to differenciate between the 2. Show quoted text
> > 2) More importantly, is there a way to have the encoding in > the XML automatically recognized (UTF-8, UTF-16 etc.) and still > preserve '&gt;'s and '>'
No Your problem comes from the interaction between XML and non-XML formats. You are asking an XML tool to do things XML tools are not designed to do. The keep_encoding option is actually slightly controversial, and seen as bordering evil by some. That said I can still think of a few solutions: - get the encoding, convert the document to UTF-8, use the keep_encoding option and the input_filter to output utf8, then convert back to the original encoding - preparse the file and convert entities to a spcial tag like <my:ent name="gt"/>, process, and replace the entities on output There are probably other ways -- Michel Rodriguez Perl &amp; XML xmltwig.com
Date: Mon, 11 Jul 2005 04:39:32 +0200 (MEST)
From: ddascalescu [...] gmail.com
To: bug-XML-Twig [...] rt.cpan.org
Subject: Re: [cpan #13509] Forced conversion of the &gt; base entity to '>'
RT-Send-Cc:
Download (untitled) / with headers
text/plain 427b
Show quoted text
> preparse the file and convert entities to a spcial tag > like <my:ent name="gt"/>, process, and replace the entities > on output
That won't work if you have entities in attributes: <element att="2 &gt; 1"> would become <element att="2 <my:ent name="gt"/> 1> Looks like I have to mess with the encodings. -- Dan Dascalescu Localization Software Engineer HighTech Passport Ltd. Phone: (408)-453-6303 ext. 17


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.