Skip Menu |
 

This queue is for tickets about the HTML-HTML5-Parser CPAN distribution.

Report information
The Basics
Id: 118913
Status: new
Priority: 0/
Queue: HTML-HTML5-Parser

People
Owner: perl [...] toby.ink
Requestors: d.koroliov [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: A bug in the Perl module
Date: Wed, 23 Nov 2016 10:38:12 +0200
To: bug-HTML-HTML5-Parser [...] rt.cpan.org
From: Dmitry Korolyov <d.koroliov [...] gmail.com>
Download (untitled) / with headers
text/plain 1.2k
Good day. For one thing, I would like to thank you for that useful and indispensable module, but I've found a bug in it (though I'm not sure, whether this is really a bug or my fault). The module seems to handle html entities incorrectly, at least one entity - &nbsp; When I parse a string (no matter from file or directly from a variable) the module converts &nbsp to the character itself but in the iso-8859-1 encoding which is then handled as utf-8 by the module itself. So when I get a parsed string I have 'Â ' instead of the nbsp character. Here is an example script: #!/usr/bin/perl use strict; use warnings; use HTML::HTML5::Parser qw(); my $raw_str = '<!doctype html> <html> <head> <meta charset="utf-8"> <title>a bug report</title> </head> <body> <div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;error conditions</div> </body> </html>'; my $parsed_str = HTML::HTML5::Parser->new->parse_string($raw_str, {encoding => 'utf-8'}); open (my $fh, '>:encoding(UTF-8)', 'bug-html5-parser.html'); print $fh $parsed_str; -- I have Perl v. 5.20.2, Ubuntu 15.04 disto. Thank you, D.Koroliov.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.