Skip Menu | You are currently an anonymous guest. | Login | Return to Main | About rt.cpan.org
 

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.

X Report information
Id: 17962
Status: resolved
Left: 0 min
Priority: 0/0
Queue: HTML-Parser

Owner: Nobody
Requestors: LGODDARD <LGODDARD [...] cpan.org>
Cc:
AdminCc:

Severity: Critical
Broken in: 3.19
Fixed in: (no value)




X History Display mode: Brief headersFull headers
#   Fri Mar 03 11:55:58 2006 LGODDARD - Ticket created  
Subject: Mis-represents data.
[text/plain 621b]
Please see below dumper of an HTML::TokeParser token: compare $VAR1->
[1]->{href} and $VAR1->[4]. The latter is correct.

This is for the latest binary for Win32 ActivePerl - which is an old
version, I admit. No VC++ here, so I can't say if this is really a
current bug or not.

$VAR1 = [
'a',
{
'href' => '/index.php?
currpage=2&days=1&jobtype=0&keywords=PERL〈=en&orderby=4&task=JobSearc
h&xc=0'
},
[
'href'
],
'<a href="/index.php?
currpage=2&days=1&jobtype=0&keywords=PERL&lang=en&orderby=4&task=JobSea
rch&xc=0">'
];


#   Sun Mar 12 17:34:10 2006 GAAS - Correspondence added  
[text/plain 158b]
Can't tell if there is anything wrong without a test case that include
the HTML that you parses. Please provide a minimal program that
demonstrates the bug.

#   Sun Mar 12 17:34:11 2006 RT_System - Status changed from 'new' to 'open'  
#   Mon Mar 13 06:09:15 2006 guest - Correspondence added  
From: lgoddard[...]cpan.org
[text/plain 332b]
On Sun Mar 12 17:34:10 2006, GAAS wrote:
> Can't tell if there is anything wrong without a test case that
include
> the HTML that you parses. Please provide a minimal program that
> demonstrates the bug.

I've attached a full example with perl code, raw HTML data, and the
URI of the (dynamic) data source.

Hope that helps.
lee

[text/plain 22.7k]
Message body not shown because it is too large or is not plain text.
#   Tue Mar 21 07:23:04 2006 GAAS - Correspondence added  
[text/plain 716b]
The reason "&lang" is expanded is that its an official HTML entity
name; see http://www.w3.org/TR/REC-html40/sgml/entities.html#h-24.3.1

Browsers has used to expand entities even if the trailing ";" is
missing, but there seems to be an exception for the non-Latin1
entities out-there. I tested this piece of HTML in Firefox/Konqeror:

<html>
<body>
<a href="foo?a=1&eth=1&times=3&lang=4&Gamma=5&lang;=6">foo
&lang;&lang=</a>
</body>
</html>

and they both expand "&eth", "&times" and "&lang;" into the
corresponding char but leaves "&lang" and "&Gamma" alone. Strangely
enough Firefox expands "&lang" outside of the attribute so it actually
plays by even more rules.

HTML is such a mess!
#   Tue Mar 21 09:38:42 2006 cologne[...]leegoddard.net - Correspondence added  
Subject: Re: [rt.cpan.org #17962] Mis-represents data.
Date: Tue, 21 Mar 2006 15:38:06 +0100
To: bug-HTML-Parser[...]rt.cpan.org
From: Lee Goddard <lee[...]leegoddard.net>
[text/plain 1.3k]
Gisle_Aas via RT wrote:

><URL: http://rt.cpan.org/Ticket/Display.html?id=17962 >
>
>The reason "&lang" is expanded is that its an official HTML entity
>name; see http://www.w3.org/TR/REC-html40/sgml/entities.html#h-24.3.1
>
>Browsers has used to expand entities even if the trailing ";" is
>missing, but there seems to be an exception for the non-Latin1
>entities out-there. I tested this piece of HTML in Firefox/Konqeror:
>
> <html>
> <body>
> <a href="foo?a=1&eth=1&times=3&lang=4&Gamma=5&lang;=6">foo
>&lang;&lang=</a>
> </body>
> </html>
>
>and they both expand "&eth", "&times" and "&lang;" into the
>corresponding char but leaves "&lang" and "&Gamma" alone. Strangely
>enough Firefox expands "&lang" outside of the attribute so it actually
>plays by even more rules.
>
>HTML is such a mess!
>
HTML: it's getting better all the time (couldn't get much worse), to
coin a phrase...
If only everyone would agree with the standard. I don't have the energy
to track down the URI spec today, but logically (HTML/logic: ha!): the
semi-colon in &lang; above ought to be URI-encoded, right? Otherwise it
might be interpreted as a new-style delimiter as the ampersand was the
old-style delimiter. What should happen when those two appaer together,
I duuno.

Ho hum.

Any thoughts how you might deal with the mess? My vote is to not look
for entities in URIs...

Cheers
lee


#   Wed Mar 22 04:26:09 2006 GAAS - Status changed from 'open' to 'resolved'