This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id:
14964
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
jtalbot [...] proionta.gr
Cc:
AdminCc:

BugTracker
Severity:
Critical
Broken in:
3.18
Fixed in:
3.22

Attachments


Subject: Attributes of tags get entity-decoded (and even worse, wrongly) when parsed
Running Debian stable with Perl 5.8.4 I'm parsing this content from a string: <a href="page.pl?id=10&sub=20"> When I print it as_HTML, I get <a href="page.pl?id=10&sub;=20"> A semi-colon is mistakenly added after the word 'sub'. Running the Perl debugger shows that the problem is not in printing stage, but in the parsing. I use HTML::TreeBuilder->new_from_content($string) to parse. Here's my program: --------------------------- #!/usr/bin/perl -w use HTML::TreeBuilder; my $page = '<a href="page.pl?id=10&sub=20">'; my $p = HTML::TreeBuilder->new_from_content( $page ); # [debug at this stage shows that $p contains a unicode character instead of '&sub'] print $p->as_HTML(); --------------------------- Until this is fixed, is there a way to disable entity-decoding when parsing?
I have attached a test case based on Test::More. From the comments: # HTML::TreeBuilder invokes HTML::Entities::decode on the contents of # HREF attributes. Some CGI-based sites use lang=en or such for # internationalization. When this parameter is after an ampersand, # the resulting &lang is decoded, breaking the link. "sub" is another # popular one. Thanks. -- Rocco Caputo - http://poe.perl.org/

Message body not shown because it is not plain text.

Resolved as part of HTML-Tree 3.22, which will be released this weekend as part of the Chicago Hackathon.


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.