This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id:
7785
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
jgmyers [...] proofpoint.com
Cc:
AdminCc:

BugTracker
Severity:
Normal
Broken in:
3.36
Fixed in:
(no value)



Subject: Warning messages when parsing questionable entities
When parsing the text: �� one gets warnings: UTF-16 surrogate 0xdbc0 at [...] UTF-16 surrogate 0xdc85 at [...] There are two issues here. One, while this encoding is highly questionable, it would be good if it could be interpreted the same as "􀂅". Two, when there is an unpaired surrogate or an illegal character (such as "") there should be no warning. It probably should interpret all such junk as �, "REPLACEMENT CHARACTER".
From: jgmyers@proofpoint.com
Proposed fix
diff -u HTML-Parser-3.36-utf8/util.c HTML-Parser-3.36-work/util.c --- HTML-Parser-3.36-utf8/util.c 2004-09-27 19:01:40.000000000 -0700 +++ HTML-Parser-3.36-work/util.c 2004-11-01 14:15:38.000000000 -0800 @@ -76,6 +76,7 @@ #ifdef UNICODE_ENTITIES char buf[UTF8_MAXLEN]; int repl_utf8; + int high_surrogate = 0; #else char buf[1]; #endif @@ -133,7 +134,30 @@ repl_utf8 = 0; } else { - char *tmp = uvuni_to_utf8(buf, num); + char *tmp; + if ((num & 0xFFFFFC00) == 0xDC00) { + if (high_surrogate != 0) { + t -= 3; /* Back up past 0xFFFD */ + num = ((high_surrogate - 0xD800) << 10) + + (num - 0xDC00) + 0x10000; + } else { + num = 0xFFFD; + } + } + + if ((num & 0xFFFFFC00) == 0xD800) { + high_surrogate = num; + num = 0xFFFD; + } + else { + high_surrogate = 0; + } + + if ((num >= 0xFDD0 && num <= 0xFDEF) || + ((num & 0xFFFE) == 0xFFFE)) { + num = 0xFFFD; + } + tmp = uvuni_to_utf8(buf, num); repl = buf; repl_len = tmp - buf; repl_utf8 = 1; @@ -160,6 +184,9 @@ #endif } } +#ifdef UNICODE_ENTITIES + high_surrogate = 0; +#endif } if (repl) { @@ -169,6 +196,10 @@ t--; /* '&' already copied, undo it */ #ifdef UNICODE_ENTITIES + if (*s != '&') { + high_surrogate = 0; + } + if (!SvUTF8(sv) && repl_utf8) { STRLEN len = t - SvPVX(sv); if (len) { Only in HTML-Parser-3.36-work/: util.c~
Why did you make chars (num >= 0xFDD0 && num <= 0xFDEF) replaced? Making ((num & 0xFFFE) == 0xFFFE)) illegal is wrong as it matchs 0x10FFFF and similar. Perl itself has the same bug.
Why did you make chars (num >= 0xFDD0 && num <= 0xFDEF) replaced? Making ((num & 0xFFFE) == 0xFFFE)) illegal is wrong as it matchs 0x10FFFF and similar. Perl itself has the same bug.
For now I've applied this modification of your patch.
Index: util.c =================================================================== RCS file: /cvsroot/libwww-perl/html-parser/util.c,v retrieving revision 2.18 retrieving revision 2.19 diff -u -p -u -r2.18 -r2.19 --- util.c 14 Sep 2004 13:47:16 -0000 2.18 +++ util.c 8 Nov 2004 12:54:57 -0000 2.19 @@ -1,4 +1,4 @@ -/* $Id: util.c,v 2.18 2004/09/14 13:47:16 gisle Exp $ +/* $Id: util.c,v 2.19 2004/11/08 12:54:57 gisle Exp $ * * Copyright 1999-2001, Gisle Aas. * @@ -76,6 +76,7 @@ decode_entities(pTHX_ SV* sv, HV* entity #ifdef UNICODE_ENTITIES char buf[UTF8_MAXLEN]; int repl_utf8; + int high_surrogate = 0; #else char buf[1]; #endif @@ -138,7 +139,30 @@ decode_entities(pTHX_ SV* sv, HV* entity repl_utf8 = 0; } else { - char *tmp = uvuni_to_utf8(buf, num); + char *tmp; + if ((num & 0xFFFFFC00) == 0xDC00) { /* low-surrogate */ + if (high_surrogate != 0) { + t -= 3; /* Back up past 0xFFFD */ + num = ((high_surrogate - 0xD800) << 10) + + (num - 0xDC00) + 0x10000; + high_surrogate = 0; + } else { + num = 0xFFFD; + } + } + else if ((num & 0xFFFFFC00) == 0xD800) { /* high-surrogate */ + high_surrogate = num; + num = 0xFFFD; + } + else { + high_surrogate = 0; + /* otherwise invalid? */ + if (num == 0xFFFE || num == 0xFFFF || num > 0x1F0000) { + num = 0xFFFD; + } + } + + tmp = uvuni_to_utf8(buf, num); repl = buf; repl_len = tmp - buf; repl_utf8 = 1; @@ -165,6 +189,9 @@ decode_entities(pTHX_ SV* sv, HV* entity #endif } } +#ifdef UNICODE_ENTITIES + high_surrogate = 0; +#endif } if (repl) { @@ -174,6 +201,10 @@ decode_entities(pTHX_ SV* sv, HV* entity t--; /* '&' already copied, undo it */ #ifdef UNICODE_ENTITIES + if (*s != '&') { + high_surrogate = 0; + } + if (!SvUTF8(sv) && repl_utf8) { STRLEN len = t - SvPVX(sv); if (len) { Index: t/uentities.t =================================================================== RCS file: /cvsroot/libwww-perl/html-parser/t/uentities.t,v retrieving revision 1.6 retrieving revision 1.7 diff -u -p -u -r1.6 -r1.7 --- t/uentities.t 3 Oct 2003 14:50:08 -0000 1.6 +++ t/uentities.t 8 Nov 2004 12:55:06 -0000 1.7 @@ -14,7 +14,7 @@ unless (&HTML::Entities::UNICODE_SUPPORT exit; } -print "1..10\n"; +print "1..13\n"; print "not " unless decode_entities("&euro") eq "\x{20AC}"; print "ok 1\n"; @@ -25,18 +25,18 @@ print "ok 2\n"; print "not " unless decode_entities("&#500000") eq chr(500000); print "ok 3\n"; -{ - no warnings 'utf8'; # These are illegal unicode chars - print "not " unless decode_entities("&#xFFFF") eq "\x{FFFF}"; - print "ok 4\n"; - - print "not " unless decode_entities("&#x10FFFF") eq chr(0x10FFFF); - print "ok 5\n"; +print "not " unless decode_entities("&#xFFFF") eq "\x{FFFD}"; +print "ok 4\n"; - print "not " unless decode_entities("&#XFFFFFFFF") eq chr(0xFFFF_FFFF); - print "ok 6\n"; +{ + no warnings 'utf8'; # workaround for perl bug + print "not " unless decode_entities("&#x10FFFF") eq chr(0x10FFFF); + print "ok 5\n"; } +print "not " unless decode_entities("&#XFFFFFFFF") eq chr(0xFFFD); +print "ok 6\n"; + print "not " unless decode_entities("&#0") eq "\0" && decode_entities("&#0;") eq "\0" && decode_entities("&#x0") eq "\0" && @@ -77,3 +77,11 @@ print "not " if $err; print "ok 10\n"; +print "not " unless decode_entities("&#56256;&#56453;") eq chr(0x100085); +print "ok 11\n"; + +print "not " unless decode_entities("&#56256;&#56453;") eq chr(0x100085); +print "ok 12\n"; + +print "not " unless decode_entities("&#56256") eq chr(0xFFFD); +print "ok 13\n";
[GAAS - Mon Nov 8 08:04:12 2004]:
Show quoted text
> Why did you make chars (num >= 0xFDD0 && num <= 0xFDEF) replaced? > > Making ((num & 0xFFFE) == 0xFFFE)) illegal is wrong as it > matchs 0x10FFFF and similar. Perl itself has the same bug.
The Unicode book I had was for Unicode 3.0. It looks like Unicode 3.1 does make all of these noncharacters, so I guess Perl is right after all. I'll modify the patch to match.


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.