Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 46040
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: dean.karres [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: missing </p> tags
Date: Wed, 13 May 2009 10:14:54 -0500
To: bug-HTML-Tree [...] rt.cpan.org
From: Dean Karres <dean.karres [...] gmail.com>
Hi, I am running HTML-Tree-3.23 on a RHEL 5.3 server. I am using the Template Toolkit but that happens later in the process. I have an html file: ########################################## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>ITG</title> <link href="index.css" rel="stylesheet" type="text/css" /> </head> <body> <div id="itg-about"> <p>The primary mission of ITG is to provide state-of-the-art imaging facilities for researchers at the Institute for Advanced Science and Technology . This service mission is accomplished through two facilities: the Microscopy Suite and the Visualization Laboratory.</p> <p>A secondary mission of the ITG is to develop advanced imaging technologies with an emphasis on projects in remote instrument control and scientific visualization.</p> </div> <div class="itg-column-1"> <div id="itg-iotw"> [% PERL %] print `/old-www/www/exhibits/iotw/new-iotw.cgi`; [% END %] </div> <div id="itg-forum"> [% PERL %] print `/old-www/www/publications/forums/last-Forum.cgi`; [% END %] </div> </div> <div class="itg-column-2"> <div id="itg-announcement"> [% PERL %] print `/old-www/www/publications/announcements/announcements.cgi`; [% END %] </div> <div id="itg-news"> [% PERL %] print `/old-www/www/publications/news/new-News.cgi`; [% END %] </div> </div> </body> </html> ########################################## I have a script that reads this file and harvests the <BODY> text: ######################################### #!/usr/bin/perl -w use strict; select(STDOUT); $|++; use HTML::TreeBuilder; my $stdinFile = ""; my $tree = HTML::TreeBuilder->new; $tree->p_strict(1); $tree->warn(1); $tree->implicit_tags(1); $tree->store_comments(1); my $body = ""; my $tmp = ""; if ($#ARGV < 0) { $ARGV[0] = "/www/www/Index.html"; } if ($ARGV[0] !~ /\.(htm|html|shm|shtml)(#.*)?$/) { die "Malformed query string: \"$#ARGV\"\n" } die "Not a file\n" if (!-f $ARGV[0] || -z $ARGV[0]); $tree->parse_file("$ARGV[0]"); # # harvest the first H1 tag and any sub-H2 tags # eval { $body = $tree->look_down('_tag', 'body'); }; die __LINE__ . ": " . $@ if $@; die "$ARGV[0] is missing a BODY tag\n" if (! $body); $tmp = $body->as_HTML; $tmp =~ s/<body>//i; $tmp =~ s/<\/body>//i; print STDOUT $tmp; $tree->delete(); exit(0); ######################################### The result of running the script on the html is: ######################################### <div id="itg-about"><p>The primary mission of the ITG is to provide state-of-the-art imaging facilities for researchers at the Institute for Advanced Science and Technology. This service mission is accomplished through two facilities: the Microscopy Suite and the Visualization Laboratory.<p>A secondary mission of the ITG is to develop advanced imaging technologies with an emphasis on projects in remote instrument control and scientific visualization.</div><div class="itg-column-1"><div id="itg-iotw"> [% PERL %] print `/old-www/www/exhibits/iotw/new-iotw.cgi`; [% END %] </div><div id="itg-forum"> [% PERL %] print `/old-www/www/publications/forums/last-Forum.cgi`; [% END %] </div></div><div class="itg-column-2"><div id="itg-announcement"> [% PERL %] print `/old-www/www/publications/announcements/announcements.cgi`; [% END %] </div><div id="itg-news"> [% PERL %] print `/old-www/www/publications/news/new-News.cgi`; [% END %] </div></div> ######################################### You may note that not quite half-way in is the string: "Laboratory.<p>A secondary". The "</p>" tag is missing in the result. I may have missconfigured the script but I thought: $tree->p_strict(1); $tree->implicit_tags(1); would do the trick. What am I missing? -- Dean Karres
Subject: Re: [rt.cpan.org #46040] AutoReply: missing </p> tags
Date: Wed, 13 May 2009 14:11:57 -0500
To: bug-HTML-Tree [...] rt.cpan.org
From: Dean Karres <dean.karres [...] gmail.com>
Download (untitled) / with headers
text/plain 257b
Sigh, never mind. Why do I find solutions after I submit bug reports... The answer is in the as_HTML method. Several closing tags are optional by default. Giving as_HTML an empty set of optional end tags clears this issue right up. sorry for the noise
Resolved per requestor.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.