Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 83641
Status: new
Priority: 0/
Queue: HTML-Tree

Owner: Nobody in particular
Requestors: js-bugtraq [...]

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)

Subject: Parsing documents with head-only tags in the body
Date: Tue, 26 Feb 2013 6:45:54 -0800
To: bug-html-tree [...]
From: Joe Seaton <js-bugtraq [...]>
Download (untitled) / with headers
text/plain 2.1k
Hello, While working with HTML::ParseTree I recently discovered a particularly unpleasant HTML document that failed to parse as I would have liked due to the presence of a <link> tag in the body. Given that this document is very invalid, I'm not sure whether this should be considered a bug or not, but it seemed worth reporting. A minimal document is as follows: <html><head><title>Title</title></head><body> <form> <p>Before</p> <link> <div>After</div> </form> <span>Outside</span> </body> </html> This results in the following parse tree: <html> @0 <head> @0.0 <title> @0.0.0 "Title" <link /> @0.0.1 <body> @0.1 <form> @0.1.0 <p> @ "Before" <div> @0.1.1 "After" <span> @0.1.2 "Outside" Notably the div following the link tag is considered a child of the body, rather than the form. For my purposes I care about the contents of the form and nothing else, so I would prefer this div to be contained in the form still. The relevant part of the trace is: Proposing a new LINK under html/body/form. * head element LINK found inside BODY! (Attaching link under head) (Current lineage of pos: LINK under html.) Proposing a new text node (\x0a ) under html/head. (Attaching text node (\x0a ) under head). Proposing a new DIV under html/head. * body-element DIV minimizes HEAD, makes implicit BODY. (Attaching div under body) This seems to be due to line 679 (in v5.03): $self->{'_pos'} = $self->{'_head'} || die "Where'd my head go?"; The code to reproduce this is fairly trivial: use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file($ARGV[0]); $tree->dump; Disabling implicit tags causes the document to be parsed as follows, preserving the location of the following div at the expense of having an extraneous link tag. <html> @0 (IMPLICIT) <html> @0.0 <head> @0.0.0 <title> @ "Title" <body> @0.0.1 <form> @ <p> @ "Before" <link /> @ <div> @ "After" <span> @ "Outside" I hope this is of some interest to you all. many thanks, Joe

This service is sponsored and maintained by Best Practical Solutions and runs on infrastructure.

Please report any issues with to