Skip Menu |
 

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Web-Scraper CPAN distribution.

Report information
The Basics
Id: 29799
Status: open
Priority: 0/
Queue: Web-Scraper

People
Owner: Nobody in particular
Requestors: jmason [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.20
Fixed in: (no value)



Subject: <br> tag should create whitespace for TEXT type
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Type: text/plain; charset="utf8"
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 794
Download (untitled) / with headers
text/plain 794b
hi! quick report -- probably easiest if I demo it. This scraper: use URI; use Web::Scraper; my $s_show = scraper { process "span.tableListing-date", date => 'TEXT'; }; my $starturl = "http://www.ticketmaster.ie/venue/198299"; my $res = $s_show->scrape( URI->new($starturl)); use Data::Dumper; die "JMD ".Dumper($res); runs against a Ticketmaster page with this HTML: <span class="tableListing-date">Sat 06/10/07<br>20:00</span></td> it should produce something like JMD $VAR1 = { 'date' => 'Sat 06/10/07 20:00' }; (or maybe with a \n.) instead it produces JMD $VAR1 = { 'date' => 'Sat 06/10/0720:00' }; note the missing whitespace in place of the <br>. Web::Scraper is great fun btw, I'm amazed how easy this is ;)
MIME-Version: 1.0
X-Spam-Status: No, hits=-2.6 required=8.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VERIFIED,DK_SIGNED,SPF_PASS
In-Reply-To: <rt-3.6.HEAD-3487-1191626341-273.29799-4-0 [...] rt.cpan.org>
Content-Disposition: inline
References: <RT-Ticket-29799 [...] rt.cpan.org> <rt-3.6.HEAD-3487-1191626341-273.29799-4-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
X-RT-Original-Encoding: ISO-8859-1
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 8B46D4D8044 for <bug-Web-Scraper [...] rt.cpan.org>; Fri, 5 Oct 2007 19:35:14 -0400 (EDT)
Received: (qmail 28919 invoked by alias); 5 Oct 2007 23:35:13 -0000
Received: from wr-out-0506.google.com (HELO wr-out-0506.google.com) (64.233.184.235) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Fri, 05 Oct 2007 16:35:03 -0700
Received: by wr-out-0506.google.com with SMTP id 71so512081wri for <bug-Web-Scraper [...] rt.cpan.org>; Fri, 05 Oct 2007 16:34:59 -0700 (PDT)
Received: by 10.90.50.1 with SMTP id x1mr444013agx.1191627299183; Fri, 05 Oct 2007 16:34:59 -0700 (PDT)
Received: by 10.90.32.20 with HTTP; Fri, 5 Oct 2007 16:34:59 -0700 (PDT)
Delivered-To: cpan-bug+web-scraper [...] diesel.bestpractical.com
Subject: Re: [rt.cpan.org #29799] <br> tag should create whitespace for TEXT type
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=DyKg/LP//pB4xXYB5MlC9zudmn7n5wid9SNlx7vHI/AJZehI68eJyaQVhhU6Kh6zxlfGrepizgVF5aBSfSBfbettFBEO7SCBORkZEUXKbcqwpPoH22n+ahcoxQ9oAsP+lw9a+8/foGtQi85XqoizrypJIDna7YY8uWv0I36nOJw=
Return-Path: <miyagawa [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=2KRi/XEIEXUcU9nmXEkDZc0LlhxAnvRcVsqDULunKHI=; b=NchvEd6xL0SQMozFDo4ktfKPL7gv8Ccvf+rU0Z/SIVEMG+0iTEqdNUc1ChJiQV9MwmxcYe61d3XE0po/2VoDF5ZQQW0Huuh7EXR59FSurfGa58CSceuOIo81uqjH5Q6W4aYMlBHKHvA+6e3T4q94tRL/CLy3HLq2J81UTHONwI0=
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: bug-Web-Scraper [...] rt.cpan.org
Date: Fri, 5 Oct 2007 16:34:59 -0700
Message-Id: <693254b90710051634m57c02e11uc6d82ad8cba0c74a [...] mail.gmail.com>
To: bug-Web-Scraper [...] rt.cpan.org
Content-Transfer-Encoding: 7bit
From: "Tatsuhiko Miyagawa" <miyagawa [...] gmail.com>
X-RT-Original-Encoding: utf-8
RT-Message-ID: <rt-3.6.HEAD-3453-1191627320-1944.29799-0-0 [...] rt.cpan.org>
Content-Length: 1443
Download (untitled) / with headers
text/plain 1.4k
Thanks for the report. I think it's a problem of HTML::Element because it just calls as_text method of HTML::Element. Make a report for the module? On 10/5/07, via RT <bug-Web-Scraper@rt.cpan.org> wrote: Show quoted text
> > Fri Oct 05 19:18:59 2007: Request 29799 was acted upon. > Transaction: Ticket created by JMASON > Queue: Web-Scraper > Subject: <br> tag should create whitespace for TEXT type > Broken in: 0.20 > Severity: Normal > Owner: Nobody > Requestors: JMASON@cpan.org > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=29799 > > > > hi! > > quick report -- probably easiest if I demo it. This scraper: > > use URI; > use Web::Scraper; > my $s_show = scraper { process "span.tableListing-date", date => > 'TEXT'; }; > my $starturl = "http://www.ticketmaster.ie/venue/198299"; > my $res = $s_show->scrape( URI->new($starturl)); > use Data::Dumper; die "JMD ".Dumper($res); > > runs against a Ticketmaster page with this HTML: > > <span class="tableListing-date">Sat > 06/10/07<br>20:00</span></td> > > it should produce something like > > JMD $VAR1 = { > 'date' => 'Sat 06/10/07 20:00' > }; > > (or maybe with a \n.) instead it produces > > JMD $VAR1 = { > 'date' => 'Sat 06/10/0720:00' > }; > > > note the missing whitespace in place of the <br>. > > Web::Scraper is great fun btw, I'm amazed how easy this is ;) >
-- Tatsuhiko Miyagawa
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Message-Id: <rt-3.6.HEAD-3501-1191667946-1434.29799-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 65
MIME-Version: 1.0
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Message-Id: <rt-3.6.HEAD-3491-1191696695-17.29799-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 870
Download (untitled) / with headers
text/plain 870b
hmm, judging by the response on that bug, as_text my not be an appropriate method to use -- 'If I have a block of HTML 3, for example, that reads: <xmp><br></xmp> That <br> should not be converted, but a blind regexp engine would convert it. Beyond that, <br> is not the only element that would need this treatment. People expect the same with <hr> as well as <p>, <div>, <blockquote> and other block-level elements. as_text was never intended to be used as a sanitization method nor a display method - the man page specifically states that it is the concatenation of text elements as the tree is descended. Changing that is a design decision and won't be considered until the major version is bumped up to 4.0, which is down the road quite a ways.' I don't agree, but I can see his point to a degree. I guess some other way of rendering text blocks is necessary :(


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.