Skip Menu |
 

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 77108
Status: rejected
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: v.virvilis [...] biovista.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



CC: Miltiadis Koutsokeras <m.koutsokeras [...] biovista.com>
Subject: [HTML::Entities] BUG: decoding valid UTF-8 when decodes multiple entities
Date: Thu, 10 May 2012 16:30:02 +0300
To: bug-HTML-Parser [...] rt.cpan.org
From: Vassilis Virvilis <v.virvilis [...] biovista.com>
Download (untitled) / with headers
text/plain 1.2k
Hi, I am running debian unstable with html-parser 3.69 When the HTML::Entities decode_entities encounter the valid UTF-8 character CF87 (greek chi) it leaves him unchanged as it should be (input-correct). When the input files contains a html entity &#x1d4ae; (input-bug) and CF87 then it correctly transforms the html entity but it also transforms CF87 to C38FC287 which is wrong. You can run the examples by $>./bug_html_decode_entities.pl < input-correct > output-correct $>./bug_html_decode_entities.pl < input-bug > output-bug Hope that helps best regards -- Show quoted text
__________________________________ Vassilis Virvilis Ph.D. Head of IT Biovista Inc. US Offices 2421 Ivy Road Charlottesville, VA 22903 USA T: +1.434.971.1141 F: +1.434.971.1144 European Offices 34 Rodopoleos Street Ellinikon, Athens 16777 GREECE T: +30.210.9629848 F: +30.210.9647606 www.biovista.com Biovista is a privately held biotechnology company that finds novel uses for existing drugs, and profiles their side effects using their mechanism of action. Biovista develops its own pipeline of drugs in CNS, oncology, auto-immune and rare diseases. Biovista is collaborating with biopharmaceutical companies on indication expansion and de-risking of their portfolios and with the FDA on adverse event prediction.
Download input-correct
text/plain 10b

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Download input-bug
text/plain 41b

Message body is not shown because sender requested not to inline it.

The string passed to decode_entities() need to be decoded first to be come a proper Unicode string.  When you read it directly from a file it's still encoded UTF-8.

  $ perl -MHTML::Entities -MEncode -le 'print encode_utf8(decode_entities(decode_utf8("\xCF\x87&#x1d4ae;")))'

You can ask perl to do this on input/output automatically with the -CS option.  If you run:

  $ perl -CS bug_html_decode_entities.pl <input-bug.txt

I belive you see the expected output (instead of the "bug").
Subject: Re: [rt.cpan.org #77108] [HTML::Entities] BUG: decoding valid UTF-8 when decodes multiple entities
Date: Mon, 14 May 2012 10:19:03 +0300
To: bug-HTML-Parser [...] rt.cpan.org
From: Vassilis Virvilis <v.virvilis [...] biovista.com>
Download (untitled) / with headers
text/plain 359b
On 13/05/2012 03:24 μμ, Gisle_Aas via RT wrote: Show quoted text
Show quoted text
> $ perl -CS bug_html_decode_entities.pl<input-bug.txt > > I belive you see the expected output (instead of the "bug"). >
I can confirm that this works. I wasn't aware of the -CS family commands. Thank you very much for the insight. Vassilis


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.