Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 46040
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: dean.karres [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



MIME-Version: 1.0
X-Spam-Status: No, hits=0.0 required=8.0 tests=DK_SIGNED,SPF_PASS
content-type: text/plain; charset="utf-8"
Message-ID: <f59e77b50905130814u649c3389m2781c945f05e718a [...] mail.gmail.com>
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by diesel.bestpractical.com (Postfix) with SMTP id 8880823C11B for <bug-HTML-Tree [...] rt.cpan.org>; Wed, 13 May 2009 11:15:09 -0400 (EDT)
Received: (qmail 31624 invoked by uid 103); 13 May 2009 15:15:08 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 May 2009 15:15:08 -0000
Received: from mail-qy0-f133.google.com (HELO mail-qy0-f133.google.com) (209.85.221.133) by 16.mx.develooper.com (qpsmtpd/0.80) with ESMTP; Wed, 13 May 2009 08:15:02 -0700
Received: by qyk39 with SMTP id 39so1303658qyk.33 for <bug-HTML-Tree [...] rt.cpan.org>; Wed, 13 May 2009 08:14:56 -0700 (PDT)
Received: by 10.224.89.16 with SMTP id c16mr1303445qam.375.1242227695055; Wed, 13 May 2009 08:14:55 -0700 (PDT)
Delivered-To: cpan-bug+HTML-Tree [...] diesel.bestpractical.com
Subject: missing </p> tags
Return-Path: <dean.karres [...] gmail.com>
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; b=lgDt61EOOxAvAPYZbT6YcdD0UXrQL8Ejv8+ilmOWfuf+FmRWwp5+hZEG/M4ns/AXcm +LGUT1R4q74VkwM3xPrPRwVdhvXI994US7AsY9lzers+nf7ZV8Q5cVL49O75wWIK+1wl VtLB5mRqqvN6lBUo0THfVPkl0FXJWAcU+AYZY=
X-Original-To: bug-HTML-Tree [...] rt.cpan.org
X-Spam-Check-BY: 16.mx.develooper.com
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type:content-transfer-encoding; bh=oupKDqr5OMbf4p7zkUwqwcRXe28cfANiJ62JZZmEESk=; b=mVw0jw0/+bGfpD5tTDd8iwMWwacgm1f6Ot2DwnkXOYlzpexLKjwszDx83VoXfC8d1v CwaZmuWj4cPcR9CKpSlGt9WYF5fFAd/i1J+vCrBWLcznkFacwv4UQBucU1ThZ9h01aQc fce85vbN90Zk+19qqOyzzAbBHsxi1ThTdx8mo=
Date: Wed, 13 May 2009 10:14:54 -0500
X-Spam-Level: *
To: bug-HTML-Tree [...] rt.cpan.org
Content-Transfer-Encoding: 7bit
From: Dean Karres <dean.karres [...] gmail.com>
X-RT-Original-Encoding: ISO-8859-1
Content-Length: 4173
Hi, I am running HTML-Tree-3.23 on a RHEL 5.3 server. I am using the Template Toolkit but that happens later in the process. I have an html file: ########################################## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>ITG</title> <link href="index.css" rel="stylesheet" type="text/css" /> </head> <body> <div id="itg-about"> <p>The primary mission of ITG is to provide state-of-the-art imaging facilities for researchers at the Institute for Advanced Science and Technology . This service mission is accomplished through two facilities: the Microscopy Suite and the Visualization Laboratory.</p> <p>A secondary mission of the ITG is to develop advanced imaging technologies with an emphasis on projects in remote instrument control and scientific visualization.</p> </div> <div class="itg-column-1"> <div id="itg-iotw"> [% PERL %] print `/old-www/www/exhibits/iotw/new-iotw.cgi`; [% END %] </div> <div id="itg-forum"> [% PERL %] print `/old-www/www/publications/forums/last-Forum.cgi`; [% END %] </div> </div> <div class="itg-column-2"> <div id="itg-announcement"> [% PERL %] print `/old-www/www/publications/announcements/announcements.cgi`; [% END %] </div> <div id="itg-news"> [% PERL %] print `/old-www/www/publications/news/new-News.cgi`; [% END %] </div> </div> </body> </html> ########################################## I have a script that reads this file and harvests the <BODY> text: ######################################### #!/usr/bin/perl -w use strict; select(STDOUT); $|++; use HTML::TreeBuilder; my $stdinFile = ""; my $tree = HTML::TreeBuilder->new; $tree->p_strict(1); $tree->warn(1); $tree->implicit_tags(1); $tree->store_comments(1); my $body = ""; my $tmp = ""; if ($#ARGV < 0) { $ARGV[0] = "/www/www/Index.html"; } if ($ARGV[0] !~ /\.(htm|html|shm|shtml)(#.*)?$/) { die "Malformed query string: \"$#ARGV\"\n" } die "Not a file\n" if (!-f $ARGV[0] || -z $ARGV[0]); $tree->parse_file("$ARGV[0]"); # # harvest the first H1 tag and any sub-H2 tags # eval { $body = $tree->look_down('_tag', 'body'); }; die __LINE__ . ": " . $@ if $@; die "$ARGV[0] is missing a BODY tag\n" if (! $body); $tmp = $body->as_HTML; $tmp =~ s/<body>//i; $tmp =~ s/<\/body>//i; print STDOUT $tmp; $tree->delete(); exit(0); ######################################### The result of running the script on the html is: ######################################### <div id="itg-about"><p>The primary mission of the ITG is to provide state-of-the-art imaging facilities for researchers at the Institute for Advanced Science and Technology. This service mission is accomplished through two facilities: the Microscopy Suite and the Visualization Laboratory.<p>A secondary mission of the ITG is to develop advanced imaging technologies with an emphasis on projects in remote instrument control and scientific visualization.</div><div class="itg-column-1"><div id="itg-iotw"> [% PERL %] print `/old-www/www/exhibits/iotw/new-iotw.cgi`; [% END %] </div><div id="itg-forum"> [% PERL %] print `/old-www/www/publications/forums/last-Forum.cgi`; [% END %] </div></div><div class="itg-column-2"><div id="itg-announcement"> [% PERL %] print `/old-www/www/publications/announcements/announcements.cgi`; [% END %] </div><div id="itg-news"> [% PERL %] print `/old-www/www/publications/news/new-News.cgi`; [% END %] </div></div> ######################################### You may note that not quite half-way in is the string: "Laboratory.<p>A secondary". The "</p>" tag is missing in the result. I may have missconfigured the script but I thought: $tree->p_strict(1); $tree->implicit_tags(1); would do the trick. What am I missing? -- Dean Karres
MIME-Version: 1.0
X-Spam-Status: No, hits=0.0 required=8.0 tests=DK_SIGNED,SPF_PASS
In-Reply-To: <rt-3.6.HEAD-2324-1242227729-332.46040-3-0 [...] rt.cpan.org>
References: <RT-Ticket-46040 [...] rt.cpan.org> <f59e77b50905130814u649c3389m2781c945f05e718a [...] mail.gmail.com> <rt-3.6.HEAD-2324-1242227729-332.46040-3-0 [...] rt.cpan.org>
Message-ID: <f59e77b50905131211h3568fd7axc3ebe30fb6510831 [...] mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by diesel.bestpractical.com (Postfix) with SMTP id 84C6823C14E for <bug-HTML-Tree [...] rt.cpan.org>; Wed, 13 May 2009 15:12:24 -0400 (EDT)
Received: (qmail 17731 invoked by uid 103); 13 May 2009 19:12:24 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 May 2009 19:12:24 -0000
Received: from mail-qy0-f133.google.com (HELO mail-qy0-f133.google.com) (209.85.221.133) by 16.mx.develooper.com (qpsmtpd/0.80) with ESMTP; Wed, 13 May 2009 12:12:18 -0700
Received: by qyk39 with SMTP id 39so1557800qyk.33 for <bug-HTML-Tree [...] rt.cpan.org>; Wed, 13 May 2009 12:11:57 -0700 (PDT)
Received: by 10.224.45.143 with SMTP id e15mr1689612qaf.164.1242241917689; Wed, 13 May 2009 12:11:57 -0700 (PDT)
Delivered-To: cpan-bug+HTML-Tree [...] diesel.bestpractical.com
Subject: Re: [rt.cpan.org #46040] AutoReply: missing </p> tags
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=sKet5krxGgi9LhVJ8f4KxX+6HAnw+fs3Ar41ecjQVf4yZrUk1sv7T3nwySAmG2uqJp 39rV6AjwbZW/qngur+OzVm8Mtw/BONXo2BMe6gaHssOYiIVopqgNJTkv/euYo282tW9U YNXCbM+ZO3jHRQfjbMa5UIsj2/+bc54XEugvo=
Return-Path: <dean.karres [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=7ZHScRJc8vmTFDsNCQ2JOcZ08QhiWzeKSICaFLUf6a8=; b=q5zljZrHGnjzoJuYwj+qta47a4AplhsEUItwOFOGTUXr39DTUD1Iz1BS5jbHMjd6Zo TXd9kG/mz3gGZzdVojtWiazTOudDKoJF2GW/ewWOGK2cSS81KkWFmhY6a30i0HqXLZmR XNy6Sq2/b5ENZAQ37FY0e+F1PKYPxz0q0oXYk=
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: bug-HTML-Tree [...] rt.cpan.org
Date: Wed, 13 May 2009 14:11:57 -0500
X-Spam-Level: *
To: bug-HTML-Tree [...] rt.cpan.org
Content-Transfer-Encoding: 7bit
From: Dean Karres <dean.karres [...] gmail.com>
RT-Message-ID: <rt-3.6.HEAD-2324-1242241969-280.46040-0-0 [...] rt.cpan.org>
Content-Length: 257
Download (untitled) / with headers
text/plain 257b
Sigh, never mind. Why do I find solutions after I submit bug reports... The answer is in the as_HTML method. Several closing tags are optional by default. Giving as_HTML an empty set of optional end tags clears this issue right up. sorry for the noise
MIME-Version: 1.0
In-Reply-To: <f59e77b50905130814u649c3389m2781c945f05e718a [...] mail.gmail.com>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Charset: utf8
References: <f59e77b50905130814u649c3389m2781c945f05e718a [...] mail.gmail.com>
Content-Type: text/plain
Message-ID: <rt-3.6.HEAD-2324-1242247275-1100.46040-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 23
Resolved per requestor.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.