Skip Menu |
 

This queue is for tickets about the HTML-HTML5-Parser CPAN distribution.

Report information
The Basics
Id: 96399
Status: new
Priority: 0/
Queue: HTML-HTML5-Parser

People
Owner: Nobody in particular
Requestors: vincent [...] vinc17.net
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.301
Fixed in: (no value)



Subject: UTF-8 character confuses the parser
Download (untitled) / with headers
text/plain 1.1k
Bug I've reported on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946 Consider the following HTML file: <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>title</title> </head> <body> <p>↓</p> </body> </html> On this file, the following script #!/usr/bin/env perl use strict; use HTML::HTML5::Parser; use utf8; # for the characters in the script. use open ':encoding(UTF-8)'; # for the file arguments. binmode STDIN, ':encoding(UTF-8)'; # for stdin. binmode STDOUT, ':encoding(UTF-8)'; # for stdout. @ARGV == 1 or die "Usage: $0 <file.html>\n"; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_file($ARGV[0]); print "Charset: '", $parser->charset($doc), "'\n"; print $doc->toString(); outputs: Charset: '' <?xml version="1.0" encoding="windows-1252"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html> If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL LETTER E WITH ACUTE), then the encoding is correctly detected.
From: vincent [...] vinc17.net
As a consequence of this bug, html2xhtml doesn't work at all when applied on a file. No problems when the HTML document is provided in the standard input, though. For instance, with test.html as: <!DOCTYPE html> <html><body><p>Test €</p></body></html> I get: $ html2xhtml test.html <?xml version="1.0" encoding="windows-1252"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html> $ html2xhtml < test.html <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test €</p> </body></html> and with test.html as: <!DOCTYPE html> <html><body><p>Test é</p></body></html> $ html2xhtml test.html <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test �</p> </body></html> $ html2xhtml < test.html <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test é</p> </body></html> parse_file is used in the former test (like in my original bug report), and parse_string is used in the latter test. Thus it seems that it's parse_file that is broken.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.