Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 14964
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: jtalbot [...] proionta.gr
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 3.18
Fixed in: 3.22

Attachments


Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.417 (Entity 5.417)
Subject: Attributes of tags get entity-decoded (and even worse, wrongly) when parsed
X-RT-Original-Encoding: iso-8859-1
Content-Length: 793
Download (untitled) / with headers
text/plain 793b
Running Debian stable with Perl 5.8.4 I'm parsing this content from a string: <a href="page.pl?id=10&sub=20"> When I print it as_HTML, I get <a href="page.pl?id=10&sub;=20"> A semi-colon is mistakenly added after the word 'sub'. Running the Perl debugger shows that the problem is not in printing stage, but in the parsing. I use HTML::TreeBuilder->new_from_content($string) to parse. Here's my program: --------------------------- #!/usr/bin/perl -w use HTML::TreeBuilder; my $page = '<a href="page.pl?id=10&sub=20">'; my $p = HTML::TreeBuilder->new_from_content( $page ); # [debug at this stage shows that $p contains a unicode character instead of '&sub'] print $p->as_HTML(); --------------------------- Until this is fixed, is there a way to disable entity-decoding when parsing?
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Message-Id: <rt-3.5.HEAD-1441-1148008460-1113.14964-0-0 [...] rt.cpan.org>
Content-Type: multipart/mixed; boundary="----------=_1148008460-1441-12"
X-RT-Original-Encoding: utf-8
Content-Length: 0
Content-Disposition: inline
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 408
Download (untitled) / with headers
text/plain 408b
I have attached a test case based on Test::More. From the comments: # HTML::TreeBuilder invokes HTML::Entities::decode on the contents of # HREF attributes. Some CGI-based sites use lang=en or such for # internationalization. When this parameter is after an ampersand, # the resulting &lang is decoded, breaking the link. "sub" is another # popular one. Thanks. -- Rocco Caputo - http://poe.perl.org/
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Type: multipart/mixed; boundary="----------=_1148008460-1441-11"
Content-Length: 0
Content-Type: text/plain; charset="utf8"
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 0
Content-Type: application/octet-stream; name="support-html-treebuilder.perl"
Content-Disposition: inline; filename="support-html-treebuilder.perl"
Content-Transfer-Encoding: base64
Content-Length: 662
Download support-html-treebuilder.perl
application/octet-stream 662b

Message body not shown because it is not plain text.

MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Message-Id: <rt-3.6.HEAD-25562-1163289868-1297.14964-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 105
Download (untitled) / with headers
text/plain 105b
Resolved as part of HTML-Tree 3.22, which will be released this weekend as part of the Chicago Hackathon.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.