Skip Menu |
 

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Web-Scraper CPAN distribution.

Report information
The Basics
Id: 85443
Status: open
Priority: 0/
Queue: Web-Scraper

People
Owner: Nobody in particular
Requestors: ipluta [...] wp.pl
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: don't call decoded_content if content is already unicode encoded
MIME-Version: 1.0
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
Message-ID: <rt-4.0.12-8728-1368975552-1184.0-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 938
Download (untitled) / with headers
text/plain 938b
When using URI or HTTP::Response object as an argument to scrape(), simply $stuff->content should be used as $html content, in place of unconditional $stuff->decoded_content if $stuff->content is already utf-8 encoded. "wide character" errors may follow, otherwise. Here's a patch ($VERSION = '0.37'): diff --git a/lib/Web/Scraper.pm b/lib/Web/Scraper.pm index aca019c..7ad9b7f 100644 --- a/lib/Web/Scraper.pm +++ b/lib/Web/Scraper.pm @@ -64,7 +64,10 @@ sub scrape { return $self->scrape($res, $stuff->as_string); } elsif (blessed($stuff) && $stuff->isa('HTTP::Response')) { if ($stuff->is_success) { - $html = $stuff->decoded_content; + $html = + $stuff->content_charset =~ /utf\-8/i + ? $stuff->content + : $stuff->decoded_content; } else { croak "GET " . $stuff->request->uri . " failed: ", $stuff->status_line; }
From miyagawa [...] gmail.com Sun May 19 14: 43:35 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-5.384 tagged_above=-99.9 required=10 tests=[AWL=0.836, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-4.0.12-8728-1368975552-1920.85443-4-0 [...] rt.cpan.org>
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-85443 [...] rt.cpan.org> <rt-4.0.12-8728-1368975552-1920.85443-4-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.180.37.208 with SMTP id a16mr7236466wik.24.1368988996966; Sun, 19 May 2013 11:43:16 -0700 (PDT)
Message-ID: <CADGpoaMBfdk44dG2vx2S=FnTOimvzegLnZxNOcLQStV1KDp9mA [...] mail.gmail.com>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.384
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id E58AB2405C4 for <cpan-bug+Web-Scraper [...] hipster.bestpractical.com>; Sun, 19 May 2013 14:43:34 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id o5ucrpFbXFKz for <cpan-bug+Web-Scraper [...] hipster.bestpractical.com>; Sun, 19 May 2013 14:43:30 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id B28AF24043B for <bug-Web-Scraper [...] rt.cpan.org>; Sun, 19 May 2013 14:43:29 -0400 (EDT)
Received: (qmail 22867 invoked by alias); 19 May 2013 18:43:28 -0000
Received: from mail-wi0-f174.google.com (HELO mail-wi0-f174.google.com) (209.85.212.174) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sun, 19 May 2013 11:43:21 -0700
Received: by mail-wi0-f174.google.com with SMTP id c10so1529449wiw.7 for <bug-Web-Scraper [...] rt.cpan.org>; Sun, 19 May 2013 11:43:17 -0700 (PDT)
Received: by 10.194.172.39 with HTTP; Sun, 19 May 2013 11:42:56 -0700 (PDT)
Delivered-To: cpan-bug+Web-Scraper [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #85443] don't call decoded_content if content is already unicode encoded
Return-Path: <miyagawa [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=s5w4EAh69wBBu56fAoldF565PSsytg5G9zfmhx+/2YQ=; b=TauN0BvlVS6/qKNEsZw342t77ejZprPQKoVkOgqAjB1TOyWoNXmBcQPhMHg/IPINmS /r4ey9pevmLW/JKl+e0R6HcSLdeQ2cm4sdfVro6Pa/RnTYk656KE8mugyTgF3DRf9ce/ Ck//lHaAKQzpfOjxP8a47SXd90boZO7xcDC3etlVv8OuMqUTmbcvFosoyOR+Cjy267RW /1LOQIy9u9Mjs4CmUzS2+f0oQzfES0ag1imYnrQAGRD129ohwuRW1aHA6PMkpfi7XGXL HZHHFttnHQUqWYW02G8HKNfHI+M/63A4y53pIE850XPdI4sKErTBqkF/2oiVB4zsThBh /CCA==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+Web-Scraper [...] hipster.bestpractical.com
X-RT-Mail-Extension: web-scraper
Date: Sun, 19 May 2013 11:42:56 -0700
X-Spam-Level:
To: bug-Web-Scraper [...] rt.cpan.org
Content-Transfer-Encoding: quoted-printable
From: Tatsuhiko Miyagawa <miyagawa [...] gmail.com>
RT-Message-ID: <rt-4.0.12-18539-1368989015-1417.85443-0-0 [...] rt.cpan.org>
Content-Length: 1412
Download (untitled) / with headers
text/plain 1.3k
On Sun, May 19, 2013 at 7:59 AM, Ireneusz Pluta via RT <bug-Web-Scraper@rt.cpan.org> wrote: Show quoted text
> When using URI or HTTP::Response object as an argument to scrape(), simply $stuff->content should be used as $html content, in place of unconditional $stuff->decoded_content if $stuff->content is already utf-8 encoded. "wide character" errors may follow, otherwise.
You might not understand what `decode_content` does since if the content is utf-8 "encoded", decoding them is obviously the right thing to do. If you have "Wide character" warnings (not errors I assume) elsewhere that sounds like more of an issue that has to be fixed there, not inside Web::Scraper like this. Show quoted text
> > Here's a patch ($VERSION = '0.37'): > > diff --git a/lib/Web/Scraper.pm b/lib/Web/Scraper.pm > index aca019c..7ad9b7f 100644 > --- a/lib/Web/Scraper.pm > +++ b/lib/Web/Scraper.pm > @@ -64,7 +64,10 @@ sub scrape { > return $self->scrape($res, $stuff->as_string); > } elsif (blessed($stuff) && $stuff->isa('HTTP::Response')) { > if ($stuff->is_success) { > - $html = $stuff->decoded_content; > + $html = > + $stuff->content_charset =~ /utf\-8/i > + ? $stuff->content > + : $stuff->decoded_content; > } else { > croak "GET " . $stuff->request->uri . " failed: ", $stuff->status_line; > } > > >
-- Tatsuhiko Miyagawa
MIME-Version: 1.0
In-Reply-To: <rt-4.0.12-18539-1368989015-1417.85443-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: API
References: <RT-Ticket-85443 [...] rt.cpan.org> <rt-4.0.12-8728-1368975552-1920.85443-4-0 [...] rt.cpan.org> <CADGpoaMBfdk44dG2vx2S=FnTOimvzegLnZxNOcLQStV1KDp9mA [...] mail.gmail.com> <rt-4.0.12-18539-1368989015-1417.85443-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.12-18539-1368994221-1934.0-0-0 [...] rt.cpan.org>
Message-ID: <rt-4.0.12-18539-1368994221-124.85443-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
From: ipluta [...] wp.pl
Content-Length: 1546
Download (untitled) / with headers
text/plain 1.5k
On Nd 19 Maj 2013, 14:43:35, miyagawa@gmail.com wrote: Show quoted text
> You might not understand what `decode_content` does since if the > content is utf-8 "encoded", decoding them is obviously the right thing > to do. > > If you have "Wide character" warnings (not errors I assume) elsewhere > that sounds like more of an issue that has to be fixed there, not > inside Web::Scraper like this.
Tatsuhiko, thanks for your response. That's true that my understanding of Perl unicode stuff is somewhat behind of what it should be :-). Anyway, could you please take a look at the following paste of session with your bin/scraper interactive utility, scraping a fragment of Polish Perl Mongers site? Note the "wide character" warning at 'y' command: $ scraper http://warszawa.pm.org/ Show quoted text
scraper> process 'p', 'p', 'text'; scraper> y
Wide character in warn at /usr/local/perl/bin/scraper line 70. --- p: 'Grupa Warszawa.pm składa się z osób zajmujących się zawodowo lub hobby’stycznie językiem Perl, dynamicznymi językami programowania oraz całym mnóstwem zagadnień mniej lub bardziej związanych ze społecznością języka Perl i open source. Jednak, żeby być szczerym, trzeba powiedzieć, iż całego czasu wolnego nie spędzamy rozwiązując zagadki programistyczne, o czym świadczą choćby nasze spotkania!' Show quoted text
scraper> c
#!/usr/local/perl-5.16.3/bin/perl use strict; use Web::Scraper; use URI; my $uri = URI->new("http://warszawa.pm.org/"); my $scraper = scraper { process 'p', 'p', 'text'; }; my $result = $scraper->scrape($uri); Show quoted text
scraper>
From miyagawa [...] gmail.com Sun May 19 16: 13:46 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-5.474 tagged_above=-99.9 required=10 tests=[AWL=0.746, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-4.0.12-18539-1368994221-1231.85443-5-0 [...] rt.cpan.org>
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-85443 [...] rt.cpan.org> <rt-4.0.12-8728-1368975552-1920.85443-4-0 [...] rt.cpan.org> <CADGpoaMBfdk44dG2vx2S=FnTOimvzegLnZxNOcLQStV1KDp9mA [...] mail.gmail.com> <rt-4.0.12-18539-1368989015-1417.85443-5-0 [...] rt.cpan.org> <rt-4.0.12-18539-1368994221-1231.85443-5-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.180.105.231 with SMTP id gp7mr7697321wib.23.1368994406530; Sun, 19 May 2013 13:13:26 -0700 (PDT)
Message-ID: <CADGpoaP_C+jgXf4PU=w6ZST3EF0oAaZSpKzg6hvxdqHcvxzeWg [...] mail.gmail.com>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.474
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id C9CF3240897 for <cpan-bug+Web-Scraper [...] hipster.bestpractical.com>; Sun, 19 May 2013 16:13:46 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HE5I+fDwbh9z for <cpan-bug+Web-Scraper [...] hipster.bestpractical.com>; Sun, 19 May 2013 16:13:41 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 33F9B24043B for <bug-Web-Scraper [...] rt.cpan.org>; Sun, 19 May 2013 16:13:41 -0400 (EDT)
Received: (qmail 27166 invoked by alias); 19 May 2013 20:13:40 -0000
Received: from mail-wg0-f44.google.com (HELO mail-wg0-f44.google.com) (74.125.82.44) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Sun, 19 May 2013 13:13:32 -0700
Received: by mail-wg0-f44.google.com with SMTP id a12so2335713wgh.11 for <bug-Web-Scraper [...] rt.cpan.org>; Sun, 19 May 2013 13:13:26 -0700 (PDT)
Received: by 10.194.172.39 with HTTP; Sun, 19 May 2013 13:13:06 -0700 (PDT)
Delivered-To: cpan-bug+Web-Scraper [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #85443] don't call decoded_content if content is already unicode encoded
Return-Path: <miyagawa [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=aQUXg25WhxJNAhejpewYGixjA7ig45RqT90/fgwomeM=; b=MmRWB2mLdl3renSnEbiqRXXW/RgVTF8stSHPQlCAy6azOCzTQy9+qWAq0mI487BAfT 6V8J8MjqSLz7ux95UPliMBIOpArzO7K69QPEx0BfeTPX/Xk1qI9e+LcKQBKz+ZwvqTGV OsBMXTGQZYVqnb/Fwcl8oIRcVXSSzruru/qdkBFJgtMh515xOP5z83MJWen/faxMEDLH z0Gk/YvaIsaqjvXcyGTz2NPv02w2Ep/SLdMwXM2W5iCGPtzQ4akjH5A3bOoYxZTQhrdz Zb/hkCt5fRfTHwpYvHMOWWiiLKZAuOC8Mf+9xzRuhgw5qJgO3GaI+Bcsq/MN0NHqvHq2 AJhw==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+Web-Scraper [...] hipster.bestpractical.com
X-RT-Mail-Extension: web-scraper
Date: Sun, 19 May 2013 13:13:06 -0700
X-Spam-Level:
To: bug-Web-Scraper [...] rt.cpan.org
Content-Transfer-Encoding: quoted-printable
From: Tatsuhiko Miyagawa <miyagawa [...] gmail.com>
RT-Message-ID: <rt-4.0.12-8728-1368994427-566.85443-0-0 [...] rt.cpan.org>
Content-Length: 1930
Download (untitled) / with headers
text/plain 1.8k
That's just a warning that tries to "warn" decoded strings in Unicode to the terminal, and you can totally ignore it. On Sun, May 19, 2013 at 1:10 PM, Ireneusz Pluta via RT <bug-Web-Scraper@rt.cpan.org> wrote: Show quoted text
> Queue: Web-Scraper > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=85443 > > > On Nd 19 Maj 2013, 14:43:35, miyagawa@gmail.com wrote:
>> You might not understand what `decode_content` does since if the >> content is utf-8 "encoded", decoding them is obviously the right thing >> to do. >> >> If you have "Wide character" warnings (not errors I assume) elsewhere >> that sounds like more of an issue that has to be fixed there, not >> inside Web::Scraper like this.
> > Tatsuhiko, > > thanks for your response. That's true that my understanding of Perl unicode stuff is somewhat behind of what it should be :-). > > Anyway, could you please take a look at the following paste of session with your bin/scraper interactive utility, scraping a fragment of Polish Perl Mongers site? Note the "wide character" warning at 'y' command: > > $ scraper http://warszawa.pm.org/
> scraper> process 'p', 'p', 'text'; > scraper> y
> Wide character in warn at /usr/local/perl/bin/scraper line 70. > --- > p: 'Grupa Warszawa.pm składa się z osób zajmujących się zawodowo lub hobby'stycznie językiem Perl, dynamicznymi językami programowania oraz całym mnóstwem zagadnień mniej lub bardziej związanych ze społecznością języka Perl i open source. Jednak, żeby być szczerym, trzeba powiedzieć, iż całego czasu wolnego nie spędzamy rozwiązując zagadki programistyczne, o czym świadczą choćby nasze spotkania!'
> scraper> c
> #!/usr/local/perl-5.16.3/bin/perl > use strict; > use Web::Scraper; > use URI; > > my $uri = URI->new("http://warszawa.pm.org/"); > my $scraper = scraper { > process 'p', 'p', 'text'; > }; > my $result = $scraper->scrape($uri);
> scraper>
-- Tatsuhiko Miyagawa


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.