Skip Menu |
 

This queue is for tickets about the libwww-perl CPAN distribution.

Report information
The Basics
Id: 5974
Status: resolved
Priority: 0/
Queue: libwww-perl

People
Owner: Nobody in particular
Requestors: ville.skytta [...] iki.fi
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 5.77
Fixed in: (no value)

Attachments


MIME-Version: 1.0
X-Mailer: MIME-tools 5.411 (Entity 5.404)
Subject: robots.txt User-Agent substring match: inverted logic
Content-Type: multipart/mixed; boundary="----------=_1081374265-26749-0"
Content-Length: 0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 638
Download (untitled) / with headers
text/plain 638b
The way I interpret the robots.txt spec, a robot should try a substring match by trying to find the string it parsed from robots.txt case insensitively from its own versionless user-agent string, not the other way around as LWP up to 5.78 seems to do. So, IMO a robot "FooBarBot" should match "User-Agent: Bar" in robots.txt, not the other way around. The "not-yet-deployed" draft puts this slightly better than the original, compare http://www.robotstxt.org/wc/norobots.html with http://www.robotstxt.org/wc/norobots-rfc.html (section 3.2.1). The included patch fixes this, and includes some test cases (+ a trivial comment typo fix).
Content-Type: text/x-patch; name="lwp-robot-substr.patch"
Content-Disposition: inline; filename="lwp-robot-substr.patch"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: iso-8859-1
Content-Length: 1994
Index: lib/WWW/RobotRules.pm =================================================================== RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v retrieving revision 1.29 diff -a -u -r1.29 RobotRules.pm --- lib/WWW/RobotRules.pm 6 Apr 2004 11:37:32 -0000 1.29 +++ lib/WWW/RobotRules.pm 7 Apr 2004 21:32:21 -0000 @@ -13,7 +13,7 @@ sub new { my($class, $ua) = @_; - # This ugly hack is needed to ensure backwards compatability. + # This ugly hack is needed to ensure backwards compatibility. # The "WWW::RobotRules" class is now really abstract. $class = "WWW::RobotRules::InCore" if $class eq "WWW::RobotRules"; @@ -121,7 +121,7 @@ # See whether my short-name is a substring of the # "User-Agent: ..." line that we were passed: - if(index(lc($ua_line), lc($me)) >= 0) { + if(index(lc($me), lc($ua_line)) >= 0) { LWP::Debug::debug("\"$ua_line\" applies to \"$me\"") if defined &LWP::Debug::debug; return 1; Index: t/robot/rules.t =================================================================== RCS file: /cvsroot/libwww-perl/lwp5/t/robot/rules.t,v retrieving revision 1.5 diff -a -u -r1.5 rules.t --- t/robot/rules.t 7 Apr 2000 20:23:01 -0000 1.5 +++ t/robot/rules.t 7 Apr 2004 21:32:22 -0000 @@ -15,7 +15,7 @@ use Carp; use strict; -print "1..32\n"; # for Test::Harness +print "1..38\n"; # for Test::Harness # We test a number of different /robots.txt files, # @@ -133,6 +133,18 @@ 30 => "http://foo/" => 1, 31 => "http://foo/this" => 1, 32 => "http://bar/" => 1, + ], + + [$content4, "MomSpiderJr" => # should match "MomSpider" + 33 => 'http://foo/private' => 1, + 34 => 'http://foo/also_private' => 1, + 35 => 'http://foo/this/' => 0, + ], + + [$content4, "SvartEnk" => # should match "*" + 36 => "http://foo/" => 1, + 37 => "http://foo/private/" => 0, + 38 => "http://bar/" => 1, ], # when adding tests, remember to increase
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
MIME-Version: 1.0
X-Mailer: MIME-tools 5.411 (Entity 5.404)
From: ville.skytta [...] iki.fi
X-RT-Original-Encoding: iso-8859-1
Content-Length: 215
Download (untitled) / with headers
text/plain 215b
[SCOP - Wed Apr 7 17:44:26 2004]: Hm, actually the comment above the 2nd hunk of the patch should probably be also fixed to be: # See whether the "User-Agent: ..." line is a substring of # my short name.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.