Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 68097
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: cjm [...] cpan.org
Requestors: spudsoup [...] cpan.org
Cc: Jeff.Fearn [...] gmail.com
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 4.2
Fixed in: 4.903-TRIAL



CC: Jeff.Fearn [...] gmail.com
Subject: Patch for HTML::Tree to add new_from_url() (Using LWP)
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
X-RT-Original-Encoding: utf-8
Content-Type: multipart/mixed; boundary="----------=_1305060339-18806-236"
Content-Length: 0
Content-Type: text/plain; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Length: 190
Download (untitled) / with headers
text/plain 190b
As discussed, I have written an enhancement to add a new_from_url() static constructor that used LWP to fetch content from a url. -- David Pottage (AKA: SPUDSOUP) david@electric-spoon.com
Subject: new_from_url.patch
MIME-Version: 1.0
Content-Type: text/x-diff; name="new_from_url.patch"
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline; filename="new_from_url.patch"
Content-Transfer-Encoding: binary
Content-Length: 2973
Download new_from_url.patch
text/x-diff 2.9k
diff --git a/Makefile.PL b/Makefile.PL index afe58c2..83c20e8 100644 --- a/Makefile.PL +++ b/Makefile.PL @@ -15,6 +15,7 @@ WriteMakefile 'Test::Exception' => 0, 'HTML::Tagset' => '3.02', 'HTML::Parser' => '3.46', + 'LWP::UserAgent' => 0, 'Module::Build' => 0 } ) diff --git a/lib/HTML/Tree.pm b/lib/HTML/Tree.pm index 31b5fde..d05f90f 100644 --- a/lib/HTML/Tree.pm +++ b/lib/HTML/Tree.pm @@ -55,6 +55,12 @@ sub new_from_content { goto &HTML::TreeBuilder::new_from_content; } +sub new_from_url { + shift; + unshift @_, 'HTML::TreeBuilder'; + goto &HTML::TreeBuilder::new_from_url; +} + 1; __END__ diff --git a/lib/HTML/TreeBuilder.pm b/lib/HTML/TreeBuilder.pm index 3e8b1ff..1000ce3 100644 --- a/lib/HTML/TreeBuilder.pm +++ b/lib/HTML/TreeBuilder.pm @@ -52,6 +52,7 @@ BEGIN { #--------------------------------------------------------------------------- +use LWP::UserAgent; use HTML::Entities (); use HTML::Tagset 3.02 (); @@ -109,6 +110,19 @@ sub new_from_content { # from any number of scalars return $new; } +sub new_from_url { # should accept anything that LWP does. + my $class = shift; + Carp::croak("new_from_url takes only one argument") + unless @_ == 1; + Carp::croak("new_from_url is a class method only") + if ref $class; + my $new = $class->new(); + + my $fetch_result = LWP::UserAgent->new->get( $_[0] ); + $new->parse( $fetch_result->content ); + return $new; +} + # TODO: document more fully? sub parse_content { # from any number of scalars my $tree = shift; diff --git a/t/construct_tree.t b/t/construct_tree.t index 6b1faed..6a23b6d 100644 --- a/t/construct_tree.t +++ b/t/construct_tree.t @@ -1,9 +1,9 @@ -#!/usr/bin/perl -T +#!/usr/bin/perl use warnings; use strict; -use Test::More tests => ( 3 + 7 * 8 ); +use Test::More tests => ( 3 + 7 * 10 ); #initial tests + number of tests in test_new_obj() * number of times called @@ -69,6 +69,29 @@ is( $HTMLPart1 . $HTMLPart2, $HTML, "split \$HTML correctly" ); test_new_obj( $parse_content_obj, "new(); parse_content Scalar" ); } +# URL tests +{ + SKIP: { + eval { + require URI::file; + require LWP::UserAgent; + }; + skip "URI::file or LWP::UserAgent not installed", 2 if $@; + + my $file_url = URI->new("file:".$TestInput); + + { + my $file_obj = HTML::Tree->new_from_url($file_url->as_string); + test_new_obj( $file_obj, "new_from_url Scalar" ); + } + + { + my $file_obj = HTML::Tree->new_from_url($file_url); + test_new_obj( $file_obj, "new_from_url Object" ); + } + } +} + # Scalar REF Tests { my $content_obj = HTML::Tree->new_from_content($HTML);
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-18809-1305061204-1110.68097-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 184
Download (untitled) / with headers
text/plain 184b
I have just spotted a bug in my patch. The skip line in the new tests I added to t/construct_tree.t should be skipping 14 tests not 2 if URI::file or LWP::UserAgent are not installed.
MIME-Version: 1.0
In-Reply-To: <rt-3.8.HEAD-18809-1305061204-1110.68097-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
References: <rt-3.8.HEAD-18809-1305061204-1110.68097-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-18807-1307442127-98.68097-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 459
Download (untitled) / with headers
text/plain 459b
On Wed May 11 07:00:04 2011, SPUDSOUP wrote: Show quoted text
> I have just spotted a bug in my patch. > > The skip line in the new tests I added to t/construct_tree.t should be > skipping 14 tests not 2 if URI::file or LWP::UserAgent are not installed.
Hi I applied your patch, but I removed the skip as the test will fail if LWP::UserAgent isn't installed since it's used in TreeBuilder. https://github.com/jfearn/HTML-Tree/commit/5785d7fc8c33e7c43b8d3b678ed4c1f38beeab86
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-3357-1338606769-1271.68097-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 113
Download (untitled) / with headers
text/plain 113b
I added the skip back, because I made LWP an optional requirement. It's not loaded until new_from_url is called.
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-8038-1339028359-1195.68097-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 297
Download (untitled) / with headers
text/plain 297b
I've made some changes to your patch, including checking for errors and non-HTML responses. Please take a look at https://metacpan.org/release/CJM/HTML-Tree-4.902-TRIAL/ and/or https://github.com/madsen/HTML-Tree/blob/master/lib/HTML/TreeBuilder.pm#L116 and see if you have any comments.
From david [...] chrestomanci.org Fri Jun 8 14: 43:45 2012
CC: spudsoup [...] cpan.org, Jeff.Fearn [...] gmail.com
MIME-Version: 1.0
X-Spam-Status: No, score=-6.9 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-8038-1339028360-1761.68097-6-0 [...] rt.cpan.org>
X-Spam-Flag: NO
X-Originating-Smarthost03-Ip: [82.70.68.182]
References: <RT-Ticket-68097 [...] rt.cpan.org> <rt-3.8.HEAD-8038-1339028360-1761.68097-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <4FD247C8.8070301 [...] chrestomanci.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
X-RT-Original-Encoding: utf-8
X-Spam-Score: -6.9
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 8A8D92403AE for <cpan-bug+HTML-Tree [...] hipster.bestpractical.com>; Fri, 8 Jun 2012 14:43:45 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qj1cR2hOMZWd for <cpan-bug+HTML-Tree [...] hipster.bestpractical.com>; Fri, 8 Jun 2012 14:43:43 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id DDD0624038A for <bug-HTML-Tree [...] rt.cpan.org>; Fri, 8 Jun 2012 14:43:42 -0400 (EDT)
Received: (qmail 17187 invoked by uid 103); 8 Jun 2012 18:43:41 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 8 Jun 2012 18:43:41 -0000
Received: from smarthost03.mail.zen.net.uk (HELO smarthost03.mail.zen.net.uk) (212.23.1.3) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Fri, 08 Jun 2012 11:43:38 -0700
Received: from [82.70.68.182] (helo=www.chrestomanci.org) by smarthost03.mail.zen.net.uk with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <david [...] chrestomanci.org>) id 1Sd49m-0004dR-P1 for bug-HTML-Tree [...] rt.cpan.org; Fri, 08 Jun 2012 18:43:34 +0000
Received: by www.chrestomanci.org (Postfix, from userid 1013) id DEBAD42F4F; Fri, 8 Jun 2012 19:43:32 +0100 (BST)
Received: from [192.168.1.12] (miranda [192.168.1.12]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: david) by www.chrestomanci.org (Postfix) with ESMTPSA id AE1F842F46; Fri, 8 Jun 2012 19:43:21 +0100 (BST)
Delivered-To: cpan-bug+HTML-Tree [...] hipster.bestpractical.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1
Subject: Re: [rt.cpan.org #68097] Patch for HTML::Tree to add new_from_url() (Using LWP)
Return-Path: <david [...] chrestomanci.org>
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+HTML-Tree [...] hipster.bestpractical.com
X-RT-Mail-Extension: html-tree
X-Old-Spam-Status: No, score=-6.9 required=4.0 tests=ALL_TRUSTED,BAYES_00 autolearn=unavailable version=3.3.1
Date: Fri, 08 Jun 2012 19:43:20 +0100
X-Old-Spam-Level:
X-Spam-Level:
To: bug-HTML-Tree [...] rt.cpan.org
Content-Transfer-Encoding: 7bit
From: David Pottage <david [...] chrestomanci.org>
RT-Message-ID: <rt-3.8.HEAD-16104-1339181026-1611.68097-0-0 [...] rt.cpan.org>
Content-Length: 838
Download (untitled) / with headers
text/plain 838b
On 07/06/12 01:19, Christopher J. Madsen via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=68097> > > I've made some changes to your patch, including checking for errors and > non-HTML responses. Please take a look at > > https://metacpan.org/release/CJM/HTML-Tree-4.902-TRIAL/ > > and/or > > > https://github.com/madsen/HTML-Tree/blob/master/lib/HTML/TreeBuilder.pm#L116 > > and see if you have any comments.
The code changes look sensible. My concern is that there are no extra tests to cover the case where the remote server returns non html content. We should probably add some more tests were a local plain texts file is passed to the new_from_url() and we check that we get the correct error. I can write a patch to the tests if you like, or you can if you prefer. -- David Pottage spudsoup@cpan.org
MIME-Version: 1.0
In-Reply-To: <rt-3.8.HEAD-16104-1339181026-1611.68097-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
References: <RT-Ticket-68097 [...] rt.cpan.org> <rt-3.8.HEAD-8038-1339028360-1761.68097-6-0 [...] rt.cpan.org> <4FD247C8.8070301 [...] chrestomanci.org> <rt-3.8.HEAD-16104-1339181026-1611.68097-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-26791-1339219103-1544.68097-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 272
Download (untitled) / with headers
text/plain 272b
On Fri, Jun 8, 2012 1:43:46 PM, david@chrestomanci.org wrote: Show quoted text
> My concern is that there are no extra tests to cover the case where > the remote server returns non html content.
Good point. I've added some tests to https://metacpan.org/release/CJM/HTML-Tree-4.903-TRIAL/
From david [...] electric-spoon.com Sat Jun 9 04: 53:12 2012
CC: Jeff.Fearn [...] gmail.com
MIME-Version: 1.0
X-Spam-Status: No, score=-6.9 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-26791-1339219103-711.68097-6-0 [...] rt.cpan.org>
X-Spam-Flag: NO
References: <RT-Ticket-68097 [...] rt.cpan.org> <rt-3.8.HEAD-8038-1339028360-1761.68097-6-0 [...] rt.cpan.org> <4FD247C8.8070301 [...] chrestomanci.org> <rt-3.8.HEAD-16104-1339181026-1611.68097-6-0 [...] rt.cpan.org> <rt-3.8.HEAD-26791-1339219103-711.68097-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <4FD30DF2.5010007 [...] electric-spoon.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
X-Originating-Smarthost04-Ip: [82.70.68.182]
X-RT-Original-Encoding: utf-8
X-Spam-Score: -6.9
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 8A3492403CB for <cpan-bug+HTML-Tree [...] hipster.bestpractical.com>; Sat, 9 Jun 2012 04:53:12 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jR7I4vxzt0Hh for <cpan-bug+HTML-Tree [...] hipster.bestpractical.com>; Sat, 9 Jun 2012 04:53:10 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id A49812403CA for <bug-HTML-Tree [...] rt.cpan.org>; Sat, 9 Jun 2012 04:53:10 -0400 (EDT)
Received: (qmail 15681 invoked by uid 103); 9 Jun 2012 08:53:09 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 9 Jun 2012 08:53:09 -0000
Received: from smarthost04.mail.zen.net.uk (HELO smarthost04.mail.zen.net.uk) (212.23.1.4) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Sat, 09 Jun 2012 01:53:06 -0700
Received: from [82.70.68.182] (helo=www.chrestomanci.org) by smarthost04.mail.zen.net.uk with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <david [...] electric-spoon.com>) id 1SdHPr-0006MG-9O for bug-HTML-Tree [...] rt.cpan.org; Sat, 09 Jun 2012 08:53:03 +0000
Received: by www.chrestomanci.org (Postfix, from userid 1013) id C506342F4D; Sat, 9 Jun 2012 09:53:01 +0100 (BST)
Received: from [192.168.1.5] (unknown [82.70.68.182]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: david) by www.chrestomanci.org (Postfix) with ESMTPSA id E0B1942166; Sat, 9 Jun 2012 09:49:01 +0100 (BST)
Delivered-To: cpan-bug+HTML-Tree [...] hipster.bestpractical.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120428 Thunderbird/12.0.1
Subject: Re: [rt.cpan.org #68097] Patch for HTML::Tree to add new_from_url() (Using LWP)
Return-Path: <david [...] electric-spoon.com>
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+HTML-Tree [...] hipster.bestpractical.com
X-RT-Mail-Extension: html-tree
X-Old-Spam-Status: No, score=-6.9 required=4.0 tests=ALL_TRUSTED,BAYES_00 autolearn=unavailable version=3.3.1
Date: Sat, 09 Jun 2012 09:48:50 +0100
X-Old-Spam-Level:
X-Spam-Level:
To: bug-HTML-Tree [...] rt.cpan.org
Content-Transfer-Encoding: 7bit
From: David Pottage <david [...] electric-spoon.com>
RT-Message-ID: <rt-3.8.HEAD-3366-1339231993-1761.68097-0-0 [...] rt.cpan.org>
Content-Length: 446
Download (untitled) / with headers
text/plain 446b
On 09/06/2012 06:18, Christopher J. Madsen via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=68097> > > On Fri, Jun 8, 2012 1:43:46 PM, david@chrestomanci.org wrote:
>> My concern is that there are no extra tests to cover the case where >> the remote server returns non html content.
> Good point. I've added some tests to > https://metacpan.org/release/CJM/HTML-Tree-4.903-TRIAL/
Looks good. -- David Pottage spudsoup@cpan.org


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.