Subject: Parsing documents with head-only tags in the body
Date: Tue, 26 Feb 2013 6:45:54 -0800
To: bug-html-tree [...]
From: Joe Seaton <js-bugtraq [...]>
Hello, While working with HTML::ParseTree I recently discovered a particularly unpleasant HTML document that failed to parse as I would have liked due to the presence of a <link> tag in the body. Given that this document is very invalid, I'm not sure whether this should be considered a bug or not, but it seemed worth reporting. A minimal document is as follows: <html><head><title>Title</title></head><body> <form> <p>Before</p> <link> <div>After</div> </form> <span>Outside</span> </body> </html> This results in the following parse tree: <html> @0 <head> @0.0 <title> @0.0.0 "Title" <link /> @0.0.1 <body> @0.1 <form> @0.1.0 <p> @ "Before" <div> @0.1.1 "After" <span> @0.1.2 "Outside" Notably the div following the link tag is considered a child of the body, rather than the form. For my purposes I care about the contents of the form and nothing else, so I would prefer this div to be contained in the form still. The relevant part of the trace is: Proposing a new LINK under html/body/form. * head element LINK found inside BODY! (Attaching link under head) (Current lineage of pos: LINK under html.) Proposing a new text node (\x0a ) under html/head. (Attaching text node (\x0a ) under head). Proposing a new DIV under html/head. * body-element DIV minimizes HEAD, makes implicit BODY. (Attaching div under body) This seems to be due to line 679 (in v5.03): $self->{'_pos'} = $self->{'_head'} || die "Where'd my head go?"; The code to reproduce this is fairly trivial: use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file($ARGV[0]); $tree->dump; Disabling implicit tags causes the document to be parsed as follows, preserving the location of the following div at the expense of having an extraneous link tag. <html> @0 (IMPLICIT) <html> @0.0 <head> @0.0.0 <title> @ "Title" <body> @0.0.1 <form> @ <p> @ "Before" <link /> @ <div> @ "After" <span> @ "Outside" I hope this is of some interest to you all. many thanks, Joe

