Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 83758
Status: rejected
Priority: 0/
Queue: HTML-Tree

Owner: Nobody in particular
Requestors: kamelkev [...]

Bug Information
Severity: Important
Broken in: 4.2
Fixed in: (no value)

Subject: HTML-Tree improperly tagging strings as UTF8
Download (untitled) / with headers
text/plain 385b
Hi, I give the module an ASCII string via the parse method. I then perform "as_HTML" and receive a UTF8 string which contains no UTF8 characters. This is very counter intuitive - I would expect the output string to be encoded identically as the input string, especially if the resulting output content is identical to the input content. thanks, Kevin Kamel MailerMailer LLC
text/x-perl 383b
#!/usr/bin/perl -w use strict; use Data::Dumper; use HTML::TreeBuilder; my $badstring = '<html><head></head><body><span>Text: &#x641;</span></body></html>'; my $parser = HTML::TreeBuilder->new(); $parser->store_comments(1); $parser->parse($badstring); my $string = $parser->as_HTML(undef," ",{}); print $string . "\n"; if (utf8::is_utf8($string)) { print "I AM BROKEN!\n"; }
Download (untitled) / with headers
text/plain 445b
You are giving too much weight to the utf8 flag. That's an internal implementation detail of the way Perl 5 stores strings. You had a non-ASCII character included as an entity reference. During parsing, that reference was converted to the actual character. To do that, Perl needed to store the string with UTF-8. When as_HTML re-encodes the string, it still has the utf8 flag set. But that makes no difference to the meaning of the string.

This service is sponsored and maintained by Best Practical Solutions and runs on infrastructure.

Please report any issues with to