Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 99936
Status: open
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: porton [...] narod.ru
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Wrong parsing HTML
Date: Fri, 31 Oct 2014 19:00:35 +0200
To: bug-html-tree [...] rt.cpan.org
From: Victor Porton <porton [...] narod.ru>
Download (untitled) / with headers
text/plain 905b
File test2.html: [[[ <html> <head> <title>Test</title> </head> <body> <form> <link></link> <input name="x" /> </form> </body> </html> ]]] [[[ #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse_file("test2.html"); print $tree->as_HTML, "\n"; ]]] Result: [[[ <html><head><title>Test</title><link /></head><body><form></form><input name="x" /></body></html> ]]] It closes <form> tag at a wrong place, what makes the <input> outside of the form. Also the <link> tag is placed in a wrong place. The example is based on (stripped down) real HTML code from a third party site. We need to make it working. Yes, the place of <link> tag is wrong, but we need to make it working anyway. I will attempt to fix this error in HTML::TreeBuilder but may need your help. -- Victor Porton - http://portonvictor.org
Subject: Re: [rt.cpan.org #99936] AutoReply: Wrong parsing HTML
Date: Fri, 31 Oct 2014 19:50:10 +0200
To: "bug-HTML-Tree [...] rt.cpan.org" <bug-html-tree [...] rt.cpan.org>
From: Victor Porton <porton [...] narod.ru>
Download (untitled) / with headers
text/plain 440b
Oh, it is a duplicate of Bug #83641. Well in 83641 it is said "Given that this document is very invalid, I'm not sure whether this should be considered a bug or not, but it seemed worth reporting." But for our company it is important to fix this bug, because we use third party HTML documents which are invalid, but we can't make them valid. So we need it to work even with invalid HTML files. -- Victor Porton - http://portonvictor.org
Download (untitled) / with headers
text/plain 368b
It should be sufficient to do: $HTML::Tagset::isHeadOrBodyElement{link} = 1; $HTML::Tagset::isHeadElement{link} = undef; after loading HTML::Tree but before parsing. If the HTML has other head-only tags in the body, you can do the same for them. This is messing with global variables, so it'll affect the whole program. You can use 'local' to limit the scope.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.