Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 68097
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: cjm [...] cpan.org
Requestors: spudsoup [...] cpan.org
Cc: Jeff.Fearn [...] gmail.com
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 4.2
Fixed in: 4.903-TRIAL



CC: Jeff.Fearn [...] gmail.com
Subject: Patch for HTML::Tree to add new_from_url() (Using LWP)
Download (untitled) / with headers
text/plain 190b
As discussed, I have written an enhancement to add a new_from_url() static constructor that used LWP to fetch content from a url. -- David Pottage (AKA: SPUDSOUP) david@electric-spoon.com
Subject: new_from_url.patch
Download new_from_url.patch
text/x-diff 2.9k
diff --git a/Makefile.PL b/Makefile.PL index afe58c2..83c20e8 100644 --- a/Makefile.PL +++ b/Makefile.PL @@ -15,6 +15,7 @@ WriteMakefile 'Test::Exception' => 0, 'HTML::Tagset' => '3.02', 'HTML::Parser' => '3.46', + 'LWP::UserAgent' => 0, 'Module::Build' => 0 } ) diff --git a/lib/HTML/Tree.pm b/lib/HTML/Tree.pm index 31b5fde..d05f90f 100644 --- a/lib/HTML/Tree.pm +++ b/lib/HTML/Tree.pm @@ -55,6 +55,12 @@ sub new_from_content { goto &HTML::TreeBuilder::new_from_content; } +sub new_from_url { + shift; + unshift @_, 'HTML::TreeBuilder'; + goto &HTML::TreeBuilder::new_from_url; +} + 1; __END__ diff --git a/lib/HTML/TreeBuilder.pm b/lib/HTML/TreeBuilder.pm index 3e8b1ff..1000ce3 100644 --- a/lib/HTML/TreeBuilder.pm +++ b/lib/HTML/TreeBuilder.pm @@ -52,6 +52,7 @@ BEGIN { #--------------------------------------------------------------------------- +use LWP::UserAgent; use HTML::Entities (); use HTML::Tagset 3.02 (); @@ -109,6 +110,19 @@ sub new_from_content { # from any number of scalars return $new; } +sub new_from_url { # should accept anything that LWP does. + my $class = shift; + Carp::croak("new_from_url takes only one argument") + unless @_ == 1; + Carp::croak("new_from_url is a class method only") + if ref $class; + my $new = $class->new(); + + my $fetch_result = LWP::UserAgent->new->get( $_[0] ); + $new->parse( $fetch_result->content ); + return $new; +} + # TODO: document more fully? sub parse_content { # from any number of scalars my $tree = shift; diff --git a/t/construct_tree.t b/t/construct_tree.t index 6b1faed..6a23b6d 100644 --- a/t/construct_tree.t +++ b/t/construct_tree.t @@ -1,9 +1,9 @@ -#!/usr/bin/perl -T +#!/usr/bin/perl use warnings; use strict; -use Test::More tests => ( 3 + 7 * 8 ); +use Test::More tests => ( 3 + 7 * 10 ); #initial tests + number of tests in test_new_obj() * number of times called @@ -69,6 +69,29 @@ is( $HTMLPart1 . $HTMLPart2, $HTML, "split \$HTML correctly" ); test_new_obj( $parse_content_obj, "new(); parse_content Scalar" ); } +# URL tests +{ + SKIP: { + eval { + require URI::file; + require LWP::UserAgent; + }; + skip "URI::file or LWP::UserAgent not installed", 2 if $@; + + my $file_url = URI->new("file:".$TestInput); + + { + my $file_obj = HTML::Tree->new_from_url($file_url->as_string); + test_new_obj( $file_obj, "new_from_url Scalar" ); + } + + { + my $file_obj = HTML::Tree->new_from_url($file_url); + test_new_obj( $file_obj, "new_from_url Object" ); + } + } +} + # Scalar REF Tests { my $content_obj = HTML::Tree->new_from_content($HTML);
Download (untitled) / with headers
text/plain 184b
I have just spotted a bug in my patch. The skip line in the new tests I added to t/construct_tree.t should be skipping 14 tests not 2 if URI::file or LWP::UserAgent are not installed.
Download (untitled) / with headers
text/plain 459b
On Wed May 11 07:00:04 2011, SPUDSOUP wrote: Show quoted text
> I have just spotted a bug in my patch. > > The skip line in the new tests I added to t/construct_tree.t should be > skipping 14 tests not 2 if URI::file or LWP::UserAgent are not installed.
Hi I applied your patch, but I removed the skip as the test will fail if LWP::UserAgent isn't installed since it's used in TreeBuilder. https://github.com/jfearn/HTML-Tree/commit/5785d7fc8c33e7c43b8d3b678ed4c1f38beeab86
Download (untitled) / with headers
text/plain 113b
I added the skip back, because I made LWP an optional requirement. It's not loaded until new_from_url is called.
Download (untitled) / with headers
text/plain 297b
I've made some changes to your patch, including checking for errors and non-HTML responses. Please take a look at https://metacpan.org/release/CJM/HTML-Tree-4.902-TRIAL/ and/or https://github.com/madsen/HTML-Tree/blob/master/lib/HTML/TreeBuilder.pm#L116 and see if you have any comments.
CC: spudsoup [...] cpan.org, Jeff.Fearn [...] gmail.com
Subject: Re: [rt.cpan.org #68097] Patch for HTML::Tree to add new_from_url() (Using LWP)
Date: Fri, 08 Jun 2012 19:43:20 +0100
To: bug-HTML-Tree [...] rt.cpan.org
From: David Pottage <david [...] chrestomanci.org>
Download (untitled) / with headers
text/plain 838b
On 07/06/12 01:19, Christopher J. Madsen via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=68097> > > I've made some changes to your patch, including checking for errors and > non-HTML responses. Please take a look at > > https://metacpan.org/release/CJM/HTML-Tree-4.902-TRIAL/ > > and/or > > > https://github.com/madsen/HTML-Tree/blob/master/lib/HTML/TreeBuilder.pm#L116 > > and see if you have any comments.
The code changes look sensible. My concern is that there are no extra tests to cover the case where the remote server returns non html content. We should probably add some more tests were a local plain texts file is passed to the new_from_url() and we check that we get the correct error. I can write a patch to the tests if you like, or you can if you prefer. -- David Pottage spudsoup@cpan.org
Download (untitled) / with headers
text/plain 272b
On Fri, Jun 8, 2012 1:43:46 PM, david@chrestomanci.org wrote: Show quoted text
> My concern is that there are no extra tests to cover the case where > the remote server returns non html content.
Good point. I've added some tests to https://metacpan.org/release/CJM/HTML-Tree-4.903-TRIAL/
CC: Jeff.Fearn [...] gmail.com
Subject: Re: [rt.cpan.org #68097] Patch for HTML::Tree to add new_from_url() (Using LWP)
Date: Sat, 09 Jun 2012 09:48:50 +0100
To: bug-HTML-Tree [...] rt.cpan.org
From: David Pottage <david [...] electric-spoon.com>
Download (untitled) / with headers
text/plain 446b
On 09/06/2012 06:18, Christopher J. Madsen via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=68097> > > On Fri, Jun 8, 2012 1:43:46 PM, david@chrestomanci.org wrote:
>> My concern is that there are no extra tests to cover the case where >> the remote server returns non html content.
> Good point. I've added some tests to > https://metacpan.org/release/CJM/HTML-Tree-4.903-TRIAL/
Looks good. -- David Pottage spudsoup@cpan.org


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.