Skip Menu |
 

This queue is for tickets about the HTML-HTML5-Parser CPAN distribution.

Report information
The Basics
Id: 99730
Status: open
Priority: 0/
Queue: HTML-HTML5-Parser

People
Owner: Nobody in particular
Requestors: dr [...] jones.dk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Fwd: Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser
Date: Wed, 22 Oct 2014 16:28:57 +0200
To: bug-html-html5-parser [...] rt.cpan.org
From: Jonas Smedegaard <dr [...] jones.dk>
Download (untitled) / with headers
text/plain 3.1k
Hi, Someone in Debian ran into the issue below, that seems like a bug in your perl module: Forwarded message from Vincent Lefevre (2014-06-08 21:03:03): Show quoted text
> Package: libhtml-html5-parser-perl > Version: 0.301-1 > Severity: important > > (with possible data loss as a consequence) > > Consider the following HTML file: > > <?xml version="1.0" encoding="utf-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> > <title>title</title> > </head> > <body> > <p>↓</p> > </body> > </html> > > On this file, the following script > > #!/usr/bin/env perl > > use strict; > use HTML::HTML5::Parser; > > use utf8; # for the characters in the script. > use open ':encoding(UTF-8)'; # for the file arguments. > binmode STDIN, ':encoding(UTF-8)'; # for stdin. > binmode STDOUT, ':encoding(UTF-8)'; # for stdout. > > @ARGV == 1 or die "Usage: $0 <file.html>\n"; > > my $parser = HTML::HTML5::Parser->new; > my $doc = $parser->parse_file($ARGV[0]); > print "Charset: '", $parser->charset($doc), "'\n"; > print $doc->toString(); > > outputs: > > Charset: '' > <?xml version="1.0" encoding="windows-1252"?> > <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html> > > If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL > LETTER E WITH ACUTE), then I get: > > Charset: 'utf-8' > <?xml version="1.0" encoding="utf-8"?> > <!--?xml version="1.0" encoding="utf-8"?--> > <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head> > <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> > <title>title</title> > </head> > <body> > <p>�</p> > > > </body></html> > > which is also incorrect, but at least the charset is correct. > > -- System Information: > Debian Release: jessie/sid > APT prefers unstable > APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental') > Architecture: amd64 (x86_64) > Foreign Architectures: i386 > > Kernel: Linux 3.11-2-amd64 (SMP w/2 CPU cores) > Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) > Shell: /bin/sh linked to /bin/dash > > Versions of packages libhtml-html5-parser-perl depends on: > ii libhtml-html5-entities-perl 0.003-2 > ii libio-html-perl 1.00-1 > ii libtry-tiny-perl 0.22-1 > ii liburi-perl 1.60-1 > ii libxml-libxml-perl 2.0116+dfsg-1 > ii perl 5.18.2-4 > ii perl-modules [libhttp-tiny-perl] 5.18.2-4 > > libhtml-html5-parser-perl recommends no packages. > > Versions of packages libhtml-html5-parser-perl suggests: > pn libxml-libxml-devel-setlinenumber-perl <none> > > -- no debconf information > > _______________________________________________ > pkg-perl-maintainers mailing list > pkg-perl-maintainers@lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-perl-maintainers
-- * Jonas Smedegaard - idealist & Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private
Download signature.asc
application/pgp-signature 949b

Message body not shown because it is not plain text.

From: vincent [...] vinc17.net
Download (untitled) / with headers
text/plain 183b
Note that I already reported the bug: https://rt.cpan.org/Public/Bug/Display.html?id=96399 which now has additional details (and the Debian bug was already forwarded to this bug).
Subject: Re: [rt.cpan.org #99730] Fwd: Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser
Date: Thu, 23 Oct 2014 13:08:02 +0200
To: bug-HTML-HTML5-Parser [...] rt.cpan.org
From: Jonas Smedegaard <dr [...] jones.dk>
Download (untitled) / with headers
text/plain 597b
Quoting vincent@vinc17.net via RT (2014-10-23 09:45:24) Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=99730 > > > Note that I already reported the bug: > > https://rt.cpan.org/Public/Bug/Display.html?id=96399 > > which now has additional details (and the Debian bug was already > forwarded to this bug).
Oh, silly me - I thought I'd double-checked that, but evidently not :-P Sorry Toby for the noice, - Jonas -- * Jonas Smedegaard - idealist & Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private
Download signature.asc
application/pgp-signature 949b

Message body not shown because it is not plain text.



This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.