Skip Menu | will be shut down on March 1st, 2021.

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 17901
Status: rejected
Priority: 0/
Queue: HTML-Parser

Owner: Nobody in particular
Requestors: ralphbolton [...]

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: 3.50

Subject: HTML::Entities misses at least one Unicode (high bit) Character
Download (untitled) / with headers
text/plain 628b
I think I've found a problem which causes HTML::Entities to miss an entity when encoding (both numeric and normal). I've attached a TGZ that includes a small snippet of malformed UTF8 and a small test that demonstrates the problem. Here's how I'd show it: % tar xvf missedentity.tgz % ./ > out % vi out The "out" file will contain: Einar [Aacute]gú Frið Of course, the [Aacute] should have been encoded. I know this is easy to say, and very annoying, but given this entity is missing, how many others may also be missing? My system details: Redhat Fedora 4 Perl 5.8.6 HTML::Parser 3.50 HTML::Entities 1.32
Subject: missedentity.tgz
Download missedentity.tgz
application/x-gzip 451b

Message body not shown because it is not plain text.

Download (untitled) / with headers
text/plain 132b
The file you are reading is Latin-1, not UTF-8. If you change your open() statement to relect this the result is as expected.
text/x-diff 278b
--- 2006-03-21 12:46:24.000000000 +0100 +++ 2006-03-21 12:46:40.000000000 +0100 @@ -5,7 +5,7 @@ use strict; use warnings; -unless(open(FILE,"<:utf8","dodgytext")) +unless(open(FILE,"<:encoding(latin1)","dodgytext")) { die "Could not open file: $!\n"; }

This service is sponsored and maintained by Best Practical Solutions and runs on infrastructure.

Please report any issues with to