This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id:
24947
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Jeff.Fearn [...] gmail.com
Requestors:
mark [...] blackmans.org
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: can we get an option to HTML::TreeBuilder to not decode entities?
Date: Wed, 14 Feb 2007 16:41:40 +0000
To: bug-html-tree@rt.cpan.org
From: Mark Blackman <mark@blackmans.org>
Hi, As far as I can tell HTML::TreeBuilder will *always* decode HTML entities in the _content attribute if it's not being ignored and isn't CDATA. 992 HTML::Entities::decode($text) 993 unless $ignore_text || $is_cdata 994 || $HTML::Tagset::isCDATA_Parent{$pos->{'_tag'}}; I've got requirement to read HTML as written rather than decoded, so an option to *not* decode like $never_decode might be appropriate. As I believe the patch is trivial, I've not included it, but if it helps I'm happy to submit one. If I've misread the docs and there is some way to suspend decoding for all text _content items then I'd be grateful for a pointer. Cheers, Mark Blackman
From: mark@blackmans.org
On Wed Feb 14 12:09:16 2007, mark@blackmans.org wrote:
Show quoted text
> Hi, > > As far as I can tell HTML::TreeBuilder will *always* decode HTML > entities in the _content attribute if it's not being ignored and > isn't CDATA. > > 992 HTML::Entities::decode($text) > 993 unless $ignore_text || $is_cdata > 994 || $HTML::Tagset::isCDATA_Parent{$pos->{'_tag'}}; > > > I've got requirement to read HTML as written rather than decoded, > so an option to *not* decode like $never_decode might be appropriate. > > As I believe the patch is trivial, I've not included it, but if it > helps I'm happy to submit one. > > If I've misread the docs and there is some way to suspend decoding > for all text _content items then I'd be grateful for a pointer. > > Cheers, > Mark Blackman >
--- HTML-Tree-3.23/lib/HTML/TreeBuilder.pm 2006-11-12 17:13:46.000000000 +0000 +++ /opt/local/lib/perl5/site_perl/5.8.8/HTML/TreeBuilder.pm 2007-02-14 16:43:50.000000000 +0000 @@ -148,6 +148,7 @@ $self->{'_element_class'} = 'HTML::Element'; $self->{'_ignore_unknown'} = 1; $self->{'_ignore_text'} = 0; + $self->{'_never_decode'} = 1; $self->{'_warn'} = 0; $self->{'_no_space_compacting'}= 0; $self->{'_store_comments'} = 0; @@ -194,6 +195,7 @@ sub no_space_compacting { shift->_elem('_no_space_compacting', @_); } sub ignore_unknown { shift->_elem('_ignore_unknown', @_); } sub ignore_text { shift->_elem('_ignore_text', @_); } +sub never_decode { shift->_elem('_never_decode', @_); } sub ignore_ignorable_whitespace { shift->_elem('_tighten', @_); } sub store_comments { shift->_elem('_store_comments', @_); } sub store_declarations { shift->_elem('_store_declarations', @_); } @@ -985,12 +987,13 @@ return unless length $text; # I guess that's always right my $ignore_text = $self->{'_ignore_text'}; + my $never_decode = $self->{'_never_decode'}; my $no_space_compacting = $self->{'_no_space_compacting'}; my $pos = $self->{'_pos'} || $self; HTML::Entities::decode($text) - unless $ignore_text || $is_cdata + unless $ignore_text || $is_cdata || $never_decode || $HTML::Tagset::isCDATA_Parent{$pos->{'_tag'}}; #my($indent, $nugget);
A patch has been applied which adds the option ignore_entities, which will prevent HTML::Entities::decode from being run over text content. This defaults off, maintaining current behaviour. Code is now hosted at http://github.com/jfearn/HTML-Tree
Subject: 4.0 released
Hi HTML::Tree ve4rsion 4.0 has been released which includes a fix for this issue. Cheers, Jeff.


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.