Skip Menu |
 

This queue is for tickets about the Plucene CPAN distribution.

Report information
The Basics
Id: 12226
Status: open
Priority: 0/
Queue: Plucene

People
Owner: Nobody in particular
Requestors: mintywalker [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 1.21
Fixed in: (no value)



Subject: Bug fix for terms that are the single character 0
Download (untitled) / with headers
text/plain 535b
The single term 0 (the digit zero) causes problems when indexing. To reproduce, try indexing using a WhiteSpaceAnalyzer the text "a 0 is higher than . in ascii" and/or "a 0 causes problems with 0.0.0.0" The attached file has one patch each for lib/Plucene/Index/TermInfosWriter.pm lib/Plucene/Index/SegmentTermEnum.pm A unit test (t/regress-05.t) is also included that tests for this problem. For more details see the thread titled "out-of-order term" http://www.kasei.com/pipermail/plucene/2005-April/thread.html#345
=============================================================================== lib/Plucene/Index/TermInfosWriter.pm 131c131,132 < my $text = $term->text || ""; --- > my $text = $term->text; > if (not defined($text)) { $text = ''; } =============================================================================== lib/Plucene/Index/SegmentTermEnum.pm 136c136 < $self->{buffer} ||= " " x $length; --- > if (not defined($self->{buffer})) { $self->{buffer} = " " x $length; } =============================================================================== t/regress-05.t #!/usr/bin/perl -w =head1 NAME regress-05.t Check an index is created with the terms you expect. Introduced for testing bugs in Plucene v 1.21 which had problems dealing with a term that was the single character zero (0). We create an index using various chunks of text, then test that each term in the index matches what we are expecting. =cut use strict; use warnings; use Plucene::Document; use Plucene::Document::Field; use Plucene::Index::Writer; use Plucene::Analysis::WhitespaceAnalyzer; use Plucene::Search::IndexSearcher; use File::Temp qw(tempdir); require Test::More; $| = 0; my $dir = tempdir(CLEANUP => 1); my @strings = ( 'a simple test that should pass', 'something lower than 0 in ascii is . (aka a period)', 'a test with a 0 and 0.0.0.0 terms', ); Test::More->import(tests => scalar(@strings)); foreach (@strings) { &test_build($_); } sub test_build { my $string = shift; # Setup out index my $analyzer = Plucene::Analysis::WhitespaceAnalyzer->new(); my $writer = Plucene::Index::Writer->new($dir, $analyzer, 1); my $doc = Plucene::Document->new; # Index the string and close the writer/index. $doc->add(Plucene::Document::Field->Text("content", $string)); $writer->add_document($doc); $writer->optimize(); # This invalidates $writer undef $writer; # Forces $writer->DESTROY() to be called, merging segments # Read the index back in and compare each term my $searcher = Plucene::Search::IndexSearcher->new( $dir ); my $enum = $searcher->reader->terms(); my @all = sort split(/\s+/, $string); my @keys; for (my $i = 0; $i < scalar(@all); $i++) { if ( ($i > 0) and ($all[$i-1] eq $all[$i])) { next; } push(@keys, $all[$i]); } my ($pos, $success) = (0,1); while($enum->next) { if ($enum->term->text ne $keys[$pos++]) { $success = 0; last; } } if (not $success) { ok(0, "Term not matching expected result\n" . "Expecting term '" . $keys[$pos - 1] . "' but got '" . $enum->term->text . "'\nwhile testing the string '$string'"); } elsif (scalar(@keys ne $pos)) { ok(0, "Not enough terms in the index\n" . "Expecting " . scalar(@keys) . " but only found $pos\n" . "while testing the string '$string'"); } else { ok(1); } } ===============================================================================
Download (untitled) / with headers
text/plain 503b
[guest - Sun Apr 10 04:49:29 2005]: Show quoted text
> The single term 0 (the digit zero) causes problems when indexing. > A unit test (t/regress-05.t) is also included that tests for this > problem. > > For more details see the thread titled "out-of-order term" > http://www.kasei.com/pipermail/plucene/2005-April/thread.html#345
I'm not really liking the test here - it seems a little too low level. Can we not just have a test that indexes and searches, rather than reading the index back in? Thanks, Tony
Download (untitled) / with headers
text/plain 223b
On Sun Jul 17 06:35:24 2005, TMTM wrote: Show quoted text
> I'm not really liking the test here - it seems a little too low level. > Can we not just have a test that indexes and searches, rather than > reading the index back in?
attached.
Download regress-05.t
text/x-perl 2.3k
#!/usr/bin/perl -w =head1 NAME regress-05.t Check an index is created with the terms you expect. Introduced for testing bugs in Plucene v 1.21 which had problems dealing with a term that was the single character zero (0). Also tests for a bug present up to 1.24 that causes numeric terms to be incorrectly indexed. We create an index using various chunks of text, then test that we can search the index correctly for those terms. =cut use strict; use warnings; use Plucene::Document; use Plucene::Document::Field; use Plucene::Index::Writer; use Plucene::Analysis::WhitespaceAnalyzer; use Plucene::Search::IndexSearcher; use Plucene::QueryParser; use File::Temp qw(tempdir); require Test::More; $| = 0; my $dir = tempdir(CLEANUP => 1); my @strings = ( 'a simple test that should pass', 'something lower than 0 in ascii is . [aka a period]', 'a test with a 0 and 0.0.0.0 terms', ); Test::More->import(tests => scalar(@strings)); foreach (@strings) { &test_build($_); } sub test_build { my $string = shift; # Setup our index my $analyzer = Plucene::Analysis::WhitespaceAnalyzer->new(); my $writer = Plucene::Index::Writer->new($dir, $analyzer, 1); my $doc = Plucene::Document->new; # Index the string and close the writer/index. $doc->add(Plucene::Document::Field->Text("content", $string)); $writer->add_document($doc); $writer->optimize(); # This invalidates $writer undef $writer; # Forces $writer->DESTROY() to be called, merging segments # Prepare to search on the index my $searcher = Plucene::Search::IndexSearcher->new( $dir ); my $parser = Plucene::QueryParser->new({ analyzer => Plucene::Analysis::WhitespaceAnalyzer->new(), default => 'content' }); # Split the indexed term into words and check each exists in # the index. my $hit = 0; my @terms = split(/\s+/, $string); my @missed; foreach my $term (@terms) { #print("-$term-\n"); my $query = $parser->parse("content:$term"); my $hits = $searcher->search($query); if ($hits->length() > 0) { $hit++; } else { push(@missed, $term); } } if ($hit == scalar(@terms)) { ok(1); } else { my $msg = "The following terms (minus the quotes) were either " . "not indexed, or failed to be found when searched for:\n "; foreach my $missed (@missed) { $msg .= "'$missed',"; } chop($msg); $msg .= "\nwhile testing the string '$string'"; ok(0, $msg); } }
Download (untitled) / with headers
text/plain 743b
I have a similar problem with the WhitespaceAnalyzer when characters other than a-z or 0-9 are involved. When using the default values for /usr/local/share/perl/5.8.4/Plucene/Analysis/WhitespaceTokenizer.pm sub token_re { qr/\S+/ } the indexing will fail with an error similar to: Docs out of order (44 < 53) at /usr/local/share/perl/5.8.4/Plucene/Index/SegmentMerger.pm line 149. But when changing the token_re function into: sub token_re { qr/[a-z\d]+/ } which will only allow a-z and 0-9 the indexing has no problems what so ever (at least I dont get the above error message). This is using plucene 1.24 downloaded through cpan using perl -MCPAN -e 'install Plucene' on a debian box running linux 2.6 kernel and perl 5.8.4.
Subject: Re: [rt.cpan.org #12226] Bug fix for terms that are the single character 0
Date: Fri, 3 Mar 2006 07:26:56 +0000
To: Guest via RT <bug-plucene [...] rt.cpan.org>
From: Tony Bowden <tony [...] kasei.com>
Download (untitled) / with headers
text/plain 330b
On Thu, Mar 02, 2006 at 05:50:12PM -0500, Guest via RT wrote: Show quoted text
> When using the default values > for /usr/local/share/perl/5.8.4/Plucene/Analysis/WhitespaceTokenizer.pm > sub token_re { qr/\S+/ } > the indexing will fail with an error similar to: > Docs out of order (44 < 53)
Any chance of a test case for this? Thanks, Tony
Subject: Re: [rt.cpan.org #12226] Bug fix for terms that are the single character 0
Date: Fri, 3 Mar 2006 08:39:27 +0000
To: bug-plucene [...] rt.cpan.org
From: Minty <mintywalker [...] gmail.com>
Download (untitled) / with headers
text/plain 474b
not me, but I'll email the guy and see if he can help :) On 3/3/06, Tony Bowden via RT <bug-plucene@rt.cpan.org> wrote: Show quoted text
> On Thu, Mar 02, 2006 at 05:50:12PM -0500, Guest via RT wrote:
> > When using the default values > > for /usr/local/share/perl/5.8.4/Plucene/Analysis/WhitespaceTokenizer.pm > > sub token_re { qr/\S+/ } > > the indexing will fail with an error similar to: > > Docs out of order (44 < 53)
> > Any chance of a test case for this? > > Thanks, > > Tony > >
From: Apachez
Download (untitled) / with headers
text/plain 607b
I have emailed Minty a sample of data where the error occurs along with the script I use to send data from the database (mysql) into plucene. During my more aggressive tests to collect data for the sample I received another error which might in more detail point to where the actual error can be located: " Docs out of order (44 < 49) at /usr/local/share/perl/5.8.4/Plucene/Index/SegmentMerger.pm line 149. (in cleanup) Can't call method "seek" on an undefined value at /usr/local/share/perl/5.8.4/Plucene/Index/TermInfosWriter.pm line 146 during global destruction. " Kind Regards Apachez


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.