X-Amavis-Alert: | BAD HEADER SECTION, Improper folded header field made up entirely of whitespace (char 20 hex): X-Virus-Checked: Checked\n \n Content previ[...] |
MIME-Version: | 1.0 |
X-Spam-Flag: | NO |
X-Virus-Checked: | Checked Content preview: LS, I've been playing around with the excellent ISBN scrapers. But I couldn't get the GoogleBooks one to install as it failed the tests, not capturing the number of pages correctly. With a bit of digging I found that google books redirected me to the dutch site google.books.nl. Which your code captured and adapted the language for correctly. But for some reason it didn't capture the length of the book. Looking at the source of the HTML page, I could not see directly what was wrong. [...] Content analysis details: (-1.8 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -0.6 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain 0.0 HTML_MESSAGE BODY: HTML included in message 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4736] -1.0 AWL AWL: From: address is in the auto white-list |
Content-Type: | multipart/alternative; boundary="nextPart4162140.iQeEGDI6Jc" |
X-Virus-Scanned: | Debian amavisd-new at bestpractical.com |
X-Spam-Score: | -1.899 |
Received: | from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 203E2240631 for <cpan-bug+WWW-Scraper-ISBN-GoogleBooks_Driver@hipster.bestpractical.com>; Fri, 21 Feb 2014 05:50:40 -0500 (EST) |
Received: | from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XCclay3UJrN6 for <cpan-bug+WWW-Scraper-ISBN-GoogleBooks_Driver@hipster.bestpractical.com>; Fri, 21 Feb 2014 05:50:35 -0500 (EST) |
Received: | from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 45E8A240615 for <bug-WWW-Scraper-ISBN-GoogleBooks_Driver@rt.cpan.org>; Fri, 21 Feb 2014 05:50:34 -0500 (EST) |
Received: | (qmail 11391 invoked by alias); 21 Feb 2014 10:50:33 -0000 |
Received: | from smtpq1.tb.mail.iss.as9143.net (HELO smtpq1.tb.mail.iss.as9143.net) (212.54.42.164) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Fri, 21 Feb 2014 02:50:30 -0800 |
Received: | from [212.54.42.137] (helo=smtp6.tb.mail.iss.as9143.net) by smtpq1.tb.mail.iss.as9143.net with esmtp (Exim 4.71) (envelope-from <lyon.lemmens@redlemon.nl>) id 1WGngX-0001wO-3a for bug-WWW-Scraper-ISBN-GoogleBooks_Driver@rt.cpan.org; Fri, 21 Feb 2014 11:50:25 +0100 |
Received: | from 5351a60f.cm-6-2c.dynamic.ziggo.nl ([83.81.166.15] helo=smtp.redlemon.nl) by smtp6.tb.mail.iss.as9143.net with esmtp (Exim 4.71) (envelope-from <lyon.lemmens@redlemon.nl>) id 1WGngW-0007SG-Jl for bug-WWW-Scraper-ISBN-GoogleBooks_Driver@rt.cpan.org; Fri, 21 Feb 2014 11:50:25 +0100 |
Received: | from brutus.redlemon.nl ([192.168.178.11] helo=brutus.localnet) by smtp.redlemon.nl with esmtp (Exim 4.80.1) (envelope-from <lyon.lemmens@redlemon.nl>) id 1WGngT-00008I-9W for bug-WWW-Scraper-ISBN-GoogleBooks_Driver@rt.cpan.org; Fri, 21 Feb 2014 11:50:24 +0100 |
Delivered-To: | cpan-bug+WWW-Scraper-ISBN-GoogleBooks_Driver@hipster.bestpractical.com |
Subject: | GoogleBooks ISBN Scraper fail test (+solution) |
X-Spam-Check-BY: | la.mx.develooper.com |
Date: | Fri, 21 Feb 2014 11:50:20 +0100 |
X-Spam-Level: | |
X-Ziggo-Spam-Status: | No |
X-Quarantine-ID: | <XCclay3UJrN6> |
To: | bug-WWW-Scraper-ISBN-GoogleBooks_Driver@rt.cpan.org |
Content-Transfer-Encoding: | 7Bit |
From lyon.lemmens@redlemon.nl Fri Feb 21 05: | 50:41 2014 |
X-Toutatis-Spam-Report: | Spam detection software, running on the system "toutatis.redlemon.nl", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. |
X-Spam-Status: | No, score=-1.899 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham |
X-Ziggo-Spamscore: | 1.5 |
X-Toutatis-Spam-Bar: | - |
X-Ziggo-Spamreport: | BAYES_50=0.8,CM_REPLY_NOARROW=0.3,HTML_MESSAGE=0.001,RDNS_DYNAMIC=0.982,RP_MATCHES_RCVD=-0.574 |
Reply-To: | lyon.lemmens@redlemon.nl |
Message-ID: | <1957680.rRyq9K18zO@brutus> |
X-Toutatis-Spam-Score: | -1.8 |
X-Ziggo-Spambar: | + |
User-Agent: | KMail/4.12.1 (Linux/3.11.0-17-generic; KDE/4.12.1; x86_64; ; ) |
Return-Path: | <lyon.lemmens@redlemon.nl> |
X-Original-To: | cpan-bug+WWW-Scraper-ISBN-GoogleBooks_Driver@hipster.bestpractical.com |
X-RT-Mail-Extension: | www-scraper-isbn-googlebooks_driver |
From: | Lyon Lemmens <lyon.lemmens@redlemon.nl> |
X-RT-Interface: | |
Content-Length: | 0 |
content-type: | text/plain; charset="utf-8" |
Content-Transfer-Encoding: | 7Bit |
X-RT-Original-Encoding: | ascii |
Content-Length: | 1151 |
content-type: | text/html; charset="utf-8" |
Content-Transfer-Encoding: | 7Bit |
X-RT-Original-Encoding: | ascii |
Content-Length: | 4864 |
LS,
I've been playing around with the excellent ISBN scrapers. But I couldn't get the GoogleBooks one to install as it failed the tests, not capturing the number of pages correctly.
With a bit of digging I found that google books redirected me to the dutch site google.books.nl. Which your code captured and adapted the language for correctly. But for some reason it didn't capture the length of the book. Looking at the source of the HTML page, I could not see directly what was wrong.
However I did notice that there is a flag in the URL that tells you if a redirect has taken place (redir_esc=y). Setting this flag to 'n' in the first place prevented the redirection completely.
This means that by setting this flag, you would always go to the main site and you wouldn't need to jump through the language hoops. That would probably simplify the code a bit.
Anyway, for now I made one change to the code:
124c124
< $data->{url} = $code->{'ISBN:'.$isbn}{info_url};
---
> $data->{url} = $code->{'ISBN:'.$isbn}{info_url} . '&redir_esc=n';
This makes it always use the main site and all tests now run OK.
--
Regards
Lyon Lemmens