Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 14964
Status: resolved
Priority: 0/
Queue: HTML-Tree

Owner: Nobody in particular
Requestors: jtalbot [...]

Bug Information
Severity: Critical
Broken in: 3.18
Fixed in: 3.22


Subject: Attributes of tags get entity-decoded (and even worse, wrongly) when parsed
Download (untitled) / with headers
text/plain 793b
Running Debian stable with Perl 5.8.4 I'm parsing this content from a string: <a href=""> When I print it as_HTML, I get <a href=";=20"> A semi-colon is mistakenly added after the word 'sub'. Running the Perl debugger shows that the problem is not in printing stage, but in the parsing. I use HTML::TreeBuilder->new_from_content($string) to parse. Here's my program: --------------------------- #!/usr/bin/perl -w use HTML::TreeBuilder; my $page = '<a href="">'; my $p = HTML::TreeBuilder->new_from_content( $page ); # [debug at this stage shows that $p contains a unicode character instead of '&sub'] print $p->as_HTML(); --------------------------- Until this is fixed, is there a way to disable entity-decoding when parsing?
Download (untitled) / with headers
text/plain 408b
I have attached a test case based on Test::More. From the comments: # HTML::TreeBuilder invokes HTML::Entities::decode on the contents of # HREF attributes. Some CGI-based sites use lang=en or such for # internationalization. When this parameter is after an ampersand, # the resulting &lang is decoded, breaking the link. "sub" is another # popular one. Thanks. -- Rocco Caputo -
Download support-html-treebuilder.perl
application/octet-stream 662b

Message body not shown because it is not plain text.

Download (untitled) / with headers
text/plain 105b
Resolved as part of HTML-Tree 3.22, which will be released this weekend as part of the Chicago Hackathon.

This service is sponsored and maintained by Best Practical Solutions and runs on infrastructure.

Please report any issues with to