This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id:
27522
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
ivacklin [...] cs.helsinki.fi
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: HTML::HeadParser doesn't grok some broken xhtml
Date: Sun, 10 Jun 2007 16:58:34 +0300
To: bug-HTML-Parser@rt.cpan.org
From: T Ilmari Vacklin <ivacklin@cs.helsinki.fi>
See <http://code-libre.org>. The XHTML has an initial bogus <option> which is probably why headparser fails to extract any headers.
This also occurs with variations on the <title> tag, such as: <head> <title> some title</title> </head> "some title" is essentially ignored. I discovered this using WWW::Mechanize: use WWW::Mechanize; my $mech = new WWW::Mechanize(); $mech->get('http://www.umm.edu/patiented/articles/what_other_drugs_used_parkinsons_disease_000051_8.htm'); print $mech->title, "\n"; The expected result is to print "Parkinson's disease", but nothing is printed at all. Cheers, Dave
On Wed Nov 05 16:57:07 2008, DIBERRI wrote:
Show quoted text
> This also occurs with variations on the <title> tag, such as: > > <head> > <title> > some title</title> > </head> > > "some title" is essentially ignored.
The problem here was that HTML::HeadParser did not ignore the Unicode BOM in decoded form. I have commited a change that will fix this (in 3.58).


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.