Skip Menu |
 

This queue is for tickets about the Text-LevenshteinXS CPAN distribution.

Report information
The Basics
Id: 36685
Status: resolved
Priority: 0/
Queue: Text-LevenshteinXS

People
Owner: JGOLDBERG [...] cpan.org
Requestors: bkb [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 0.03
Fixed in: (no value)



Subject: It's not utf8 safe
Download (untitled) / with headers
text/plain 129b
The module isn't utf8 safe - it breaks on utf8 encoded strings and gives the byte distance rather than the character difference.
Download (untitled) / with headers
text/plain 133b
Without details this cannot be investigated further. All testing I have done shows it giving character distance for utf8 just fine.
Download (untitled) / with headers
text/plain 477b
Am Fr 07. Nov 2008, 18:50:17, JGOLDBERG schrieb: Show quoted text
> Without details this cannot be investigated further. All testing I > have done shows it giving > character distance for utf8 just fine.
We start to use your module with a lot of utf-8 data from a database and found some problems. We add some test cases that demonstrate the problems and patch the package, so it works in our environment. Thanks for your work, it gives us a 20 times speedup to the original used function.
Subject: Text-LevenshteinXS-0.03_utf8.diff
diff -rupN Text-LevenshteinXS-0.03/LevenshteinXS.pm Text-LevenshteinXS-0.03.patched/LevenshteinXS.pm --- Text-LevenshteinXS-0.03/LevenshteinXS.pm 2004-06-29 23:41:56.000000000 +0200 +++ Text-LevenshteinXS-0.03.patched/LevenshteinXS.pm 2010-03-10 15:19:31.000000000 +0100 @@ -19,7 +19,7 @@ our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all' our @EXPORT = qw( distance ); -our $VERSION = '0.03'; +our $VERSION = '0.04'; bootstrap Text::LevenshteinXS $VERSION; diff -rupN Text-LevenshteinXS-0.03/LevenshteinXS.xs Text-LevenshteinXS-0.03.patched/LevenshteinXS.xs --- Text-LevenshteinXS-0.03/LevenshteinXS.xs 2004-06-29 23:40:47.000000000 +0200 +++ Text-LevenshteinXS-0.03.patched/LevenshteinXS.xs 2010-03-10 16:05:58.000000000 +0100 @@ -12,22 +12,30 @@ #include <stdlib.h> #include <string.h> -int levenshtein_distance(char *s,char*t); +int levenshtein_distance(SV *s,SV *t); int minimum(int a,int b,int c); -int levenshtein_distance(char *s,char*t) +int levenshtein_distance(SV* s, SV* t) /*Compute levenshtein distance between s and t*/ { //Step 1 int k,i,j,n,m,cost,*d,distance; - if (strcmp(s,t) == 0) {return 0;} - n=strlen(s); - m=strlen(t); + + n=sv_len_utf8(s); + m=sv_len_utf8(t); + + U8* ps = SvPVX(s); + U8* pt = SvPVX(t); + + // optimisation for equal strings + if(m == n && memEQ(ps, pt, n)) { return 0; } + if(n==0) {return m;} if(m==0) {return n;} d=malloc((sizeof(int))*(m+1)*(n+1)); + m++; n++; //Step 2 @@ -39,8 +47,11 @@ int levenshtein_distance(char *s,char*t) for(i=1;i<n;i++) for(j=1;j<m;j++) { + U8* si = utf8_hop(ps, i-1); + U8* ti = utf8_hop(pt, j-1); + //Step 5 - if(s[i-1]==t[j-1]) + if(memEQ(si, ti, UTF8SKIP(si))) cost=0; else cost=1; @@ -67,8 +78,8 @@ MODULE = Text::LevenshteinXS PACKAGE = int distance(s,t) - char * s - char * t + SV * s + SV * t CODE: RETVAL = levenshtein_distance(s,t); OUTPUT: diff -rupN Text-LevenshteinXS-0.03/META.yml Text-LevenshteinXS-0.03.patched/META.yml --- Text-LevenshteinXS-0.03/META.yml 2004-06-29 23:43:35.000000000 +0200 +++ Text-LevenshteinXS-0.03.patched/META.yml 2010-03-10 15:20:28.000000000 +0100 @@ -1,7 +1,7 @@ # http://module-build.sourceforge.net/META-spec.html #XXXXXXX This is a prototype!!! It will change in the future!!! XXXXX# name: Text-LevenshteinXS -version: 0.03 +version: 0.04 version_from: LevenshteinXS.pm installdirs: site requires: diff -rupN Text-LevenshteinXS-0.03/test.pl Text-LevenshteinXS-0.03.patched/test.pl --- Text-LevenshteinXS-0.03/test.pl 2004-03-05 03:11:59.000000000 +0100 +++ Text-LevenshteinXS-0.03.patched/test.pl 2010-03-10 16:21:11.000000000 +0100 @@ -1,16 +1,41 @@ -use Test; -BEGIN { plan tests => 6 }; - -use Text::LevenshteinXS qw(distance); - -ok(1); -if (distance("foo","four") == 2) {ok(1)} else {ok(0)} -if (distance("foo","foo") == 0) {ok(1)} else {ok(0)} -if (distance("foo","") == 3) {ok(1)} else {ok(0)} -if (distance("four","foo") == 2) {ok(1)} else {ok(0)} -if (distance("foo","bar") == 3) {ok(1)} else {ok(0)} - - - - - +use Test::More qw/ no_plan /; + +use Text::LevenshteinXS qw(distance); + +use Encode; +use Unicode::Normalize; + +ok(1); + +# basic tests +is(distance("foo","four"), 2); +is(distance("foo","foo"), 0); +is(distance("foo",""), 3); +is(distance("four","foo"), 2); +is(distance("foo","bar"), 3); + +# more tests from the bugtracking system +is(distance('headache','heavy load'), 7); +is(distance('Distinction courses', 'Distinction Courses'), 1, 'upper/lower'); +is(distance("fool's handiwork bar", 'food handling bar'), 8); + + +# utf-8 tests +is(distance("äöü","äöü"), 0); +is(distance("strase","straße"), 1); +is(distance("strasse","straße"), 2); +is(distance("äöü","123"), 3); + +is(distance("","str"), 3, 'length check'); +is(distance("","straße"), 6, 'length check ß'); +is(distance("", decode("utf8", "stra\x{00DF}e")), 6, 'length check \x'); + +is(distance("äöü",decode("latin1", "äöü")), 0); +is(distance("123","abcde"), 5, 'foo'); + +my $umls = "äöü"; + +{ + use encoding "latin1"; + is(distance($umls, "äöü"), 0, 'encoding latin1'); +}
Subject: Re: [spam?] [rt.cpan.org #36685] It's not utf8 safe
Date: Wed, 10 Mar 2010 11:09:52 -0800
To: bug-Text-LevenshteinXS [...] rt.cpan.org
From: Josh Goldberg <josh [...] 3io.com>
Download (untitled) / with headers
text/plain 761b
Thanks for the patch! On Wed, Mar 10, 2010 at 8:53 AM, http://www.google.com/profiles/sven.schoradt via RT < bug-Text-LevenshteinXS@rt.cpan.org> wrote: Show quoted text
> Queue: Text-LevenshteinXS > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=36685 > > > Am Fr 07. Nov 2008, 18:50:17, JGOLDBERG schrieb:
> > Without details this cannot be investigated further. All testing I > > have done shows it giving > > character distance for utf8 just fine.
> > We start to use your module with a lot of utf-8 data from a database and > found some problems. > > We add some test cases that demonstrate the problems and patch the > package, so it works in our environment. > > Thanks for your work, it gives us a 20 times speedup to the original > used function. > >
From: Sven Schoradt
Download (untitled) / with headers
text/plain 296b
After a second lock to the source, I find some optimisation points that make the performance of the UTF8 version comparable to the original version. I attach a second patch for the original code (version 0.03). I do some performance analysis with the performance.pl script, thats in the patch.
Subject: Text-LevenshteinXS-0.03.utf8_optimized.diff
diff -rupN Text-LevenshteinXS-0.03.orig/LevenshteinXS.pm Text-LevenshteinXS-0.03/LevenshteinXS.pm --- Text-LevenshteinXS-0.03.orig/LevenshteinXS.pm 2004-06-29 23:41:56.000000000 +0200 +++ Text-LevenshteinXS-0.03/LevenshteinXS.pm 2010-03-17 21:48:04.000000000 +0100 @@ -19,7 +19,7 @@ our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all' our @EXPORT = qw( distance ); -our $VERSION = '0.03'; +our $VERSION = '0.04'; bootstrap Text::LevenshteinXS $VERSION; diff -rupN Text-LevenshteinXS-0.03.orig/LevenshteinXS.xs Text-LevenshteinXS-0.03/LevenshteinXS.xs --- Text-LevenshteinXS-0.03.orig/LevenshteinXS.xs 2004-06-29 23:40:47.000000000 +0200 +++ Text-LevenshteinXS-0.03/LevenshteinXS.xs 2010-03-17 21:48:04.000000000 +0100 @@ -12,22 +12,30 @@ #include <stdlib.h> #include <string.h> -int levenshtein_distance(char *s,char*t); +int levenshtein_distance(SV *s,SV *t); int minimum(int a,int b,int c); -int levenshtein_distance(char *s,char*t) +int levenshtein_distance(SV* s, SV* t) /*Compute levenshtein distance between s and t*/ { //Step 1 int k,i,j,n,m,cost,*d,distance; - if (strcmp(s,t) == 0) {return 0;} - n=strlen(s); - m=strlen(t); + + n=sv_len_utf8(s); + m=sv_len_utf8(t); + + U8* ps = SvPVX(s); + U8* pt = SvPVX(t); + + // optimisation for equal strings + if(m == n && memEQ(ps, pt, n)) { return 0; } + if(n==0) {return m;} if(m==0) {return n;} d=malloc((sizeof(int))*(m+1)*(n+1)); + m++; n++; //Step 2 @@ -35,31 +43,53 @@ int levenshtein_distance(char *s,char*t) d[k]=k; for(k=0;k<m;k++) d[k*n]=k; + + + U8* si = ps; + U8* ti = pt; + + int size; + //Step 3 and 4 for(i=1;i<n;i++) + { + size = UTF8SKIP(si); + + ti = pt; + for(j=1;j<m;j++) { - //Step 5 - if(s[i-1]==t[j-1]) + if(memEQ(si, ti, size)) cost=0; else cost=1; - //Step 6 + d[j*n+i]=minimum(d[(j-1)*n+i]+1,d[j*n+i-1]+1,d[(j-1)*n+i-1]+cost); + + ti = utf8_hop(ti, 1); } + + si = utf8_hop(si, 1); + } + distance=d[n*m-1]; + free(d); + return distance; } -int minimum(int a,int b,int c) /*Gets the minimum of three values*/ +int minimum(int a, int b, int c) { int min=a; + if(b<min) min=b; + if(c<min) min=c; + return min; } @@ -67,8 +97,8 @@ MODULE = Text::LevenshteinXS PACKAGE = int distance(s,t) - char * s - char * t + SV * s + SV * t CODE: RETVAL = levenshtein_distance(s,t); OUTPUT: diff -rupN Text-LevenshteinXS-0.03.orig/Makefile.old Text-LevenshteinXS-0.03/Makefile.old --- Text-LevenshteinXS-0.03.orig/Makefile.old 2010-03-17 21:41:20.000000000 +0100 +++ Text-LevenshteinXS-0.03/Makefile.old 2010-03-17 21:48:04.000000000 +0100 @@ -52,11 +52,11 @@ DIRFILESEP = / DFSEP = $(DIRFILESEP) NAME = Text::LevenshteinXS NAME_SYM = Text_LevenshteinXS -VERSION = 0.03 +VERSION = 0.04 VERSION_MACRO = VERSION -VERSION_SYM = 0_03 +VERSION_SYM = 0_04 DEFINE_VERSION = -D$(VERSION_MACRO)=\"$(VERSION)\" -XS_VERSION = 0.03 +XS_VERSION = 0.04 XS_VERSION_MACRO = XS_VERSION XS_DEFINE_VERSION = -D$(XS_VERSION_MACRO)=\"$(XS_VERSION)\" INST_ARCHLIB = blib/arch @@ -177,13 +177,10 @@ PERL_ARCHIVE = PERL_ARCHIVE_AFTER = -TO_INST_PM = LevenshteinXS.pm \ - performance.pl +TO_INST_PM = LevenshteinXS.pm PM_TO_BLIB = LevenshteinXS.pm \ - $(INST_LIB)/Text/LevenshteinXS.pm \ - performance.pl \ - $(INST_LIB)/Text/performance.pl + $(INST_LIB)/Text/LevenshteinXS.pm # --- MakeMaker platform_constants section: @@ -258,7 +255,7 @@ RCS_LABEL = rcs -Nv$(VERSION_SYM): -q DIST_CP = best DIST_DEFAULT = tardist DISTNAME = Text-LevenshteinXS -DISTVNAME = Text-LevenshteinXS-0.03 +DISTVNAME = Text-LevenshteinXS-0.04 # --- MakeMaker macro section: @@ -559,7 +556,7 @@ metafile : create_distdir $(NOECHO) $(ECHO) Generating META.yml $(NOECHO) $(ECHO) '--- #YAML:1.0' > META_new.yml $(NOECHO) $(ECHO) 'name: Text-LevenshteinXS' >> META_new.yml - $(NOECHO) $(ECHO) 'version: 0.03' >> META_new.yml + $(NOECHO) $(ECHO) 'version: 0.04' >> META_new.yml $(NOECHO) $(ECHO) 'abstract: ~' >> META_new.yml $(NOECHO) $(ECHO) 'license: ~' >> META_new.yml $(NOECHO) $(ECHO) 'author: ~' >> META_new.yml @@ -891,7 +888,7 @@ testdb_static :: pure_all $(MAP_TARGET) # --- MakeMaker ppd section: # Creates a PPD (Perl Package Description) for a binary distribution. ppd : - $(NOECHO) $(ECHO) '<SOFTPKG NAME="$(DISTNAME)" VERSION="0,03,0,0">' > $(DISTNAME).ppd + $(NOECHO) $(ECHO) '<SOFTPKG NAME="$(DISTNAME)" VERSION="0,04,0,0">' > $(DISTNAME).ppd $(NOECHO) $(ECHO) ' <TITLE>$(DISTNAME)</TITLE>' >> $(DISTNAME).ppd $(NOECHO) $(ECHO) ' <ABSTRACT></ABSTRACT>' >> $(DISTNAME).ppd $(NOECHO) $(ECHO) ' <AUTHOR></AUTHOR>' >> $(DISTNAME).ppd @@ -908,8 +905,7 @@ ppd : pm_to_blib : $(TO_INST_PM) $(NOECHO) $(ABSPERLRUN) -MExtUtils::Install -e 'pm_to_blib({@ARGV}, '\''$(INST_LIB)/auto'\'', '\''$(PM_FILTER)'\'')' -- \ - LevenshteinXS.pm $(INST_LIB)/Text/LevenshteinXS.pm \ - performance.pl $(INST_LIB)/Text/performance.pl + LevenshteinXS.pm $(INST_LIB)/Text/LevenshteinXS.pm $(NOECHO) $(TOUCH) pm_to_blib diff -rupN Text-LevenshteinXS-0.03.orig/META.yml Text-LevenshteinXS-0.03/META.yml --- Text-LevenshteinXS-0.03.orig/META.yml 2004-06-29 23:43:35.000000000 +0200 +++ Text-LevenshteinXS-0.03/META.yml 2010-03-17 21:48:04.000000000 +0100 @@ -1,7 +1,7 @@ # http://module-build.sourceforge.net/META-spec.html #XXXXXXX This is a prototype!!! It will change in the future!!! XXXXX# name: Text-LevenshteinXS -version: 0.03 +version: 0.04 version_from: LevenshteinXS.pm installdirs: site requires: diff -rupN Text-LevenshteinXS-0.03.orig/test.pl Text-LevenshteinXS-0.03/test.pl --- Text-LevenshteinXS-0.03.orig/test.pl 2004-03-05 03:11:59.000000000 +0100 +++ Text-LevenshteinXS-0.03/test.pl 2010-03-17 21:48:04.000000000 +0100 @@ -1,16 +1,41 @@ -use Test; -BEGIN { plan tests => 6 }; - -use Text::LevenshteinXS qw(distance); - -ok(1); -if (distance("foo","four") == 2) {ok(1)} else {ok(0)} -if (distance("foo","foo") == 0) {ok(1)} else {ok(0)} -if (distance("foo","") == 3) {ok(1)} else {ok(0)} -if (distance("four","foo") == 2) {ok(1)} else {ok(0)} -if (distance("foo","bar") == 3) {ok(1)} else {ok(0)} - - - - - +use Test::More qw/ no_plan /; + +use Text::LevenshteinXS qw(distance); + +use Encode; +use Unicode::Normalize; + +ok(1); + +# basic tests +is(distance("foo","four"), 2); +is(distance("foo","foo"), 0); +is(distance("foo",""), 3); +is(distance("four","foo"), 2); +is(distance("foo","bar"), 3); + +# more tests from the bugtracking system +is(distance('headache','heavy load'), 7); +is(distance('Distinction courses', 'Distinction Courses'), 1, 'upper/lower'); +is(distance("fool's handiwork bar", 'food handling bar'), 8); + + +# utf-8 tests +is(distance("äöü","äöü"), 0); +is(distance("strase","straße"), 1); +is(distance("strasse","straße"), 2); +is(distance("äöü","123"), 3); + +is(distance("","str"), 3, 'length check'); +is(distance("","straße"), 6, 'length check ß'); +is(distance("", decode("utf8", "stra\x{00DF}e")), 6, 'length check \x'); + +is(distance("äöü",decode("latin1", "äöü")), 0); +is(distance("123","abcde"), 5, 'foo'); + +my $umls = "äöü"; + +{ + use encoding "latin1"; + is(distance($umls, "äöü"), 0, 'encoding latin1'); +}
Download (untitled) / with headers
text/plain 451b
On Wed Mar 17 16:58:39 2010, http://www.google.com/profiles/sven.schoradt wrote: Show quoted text
> I attach a second patch for the original code (version 0.03). I do some > performance analysis with the performance.pl script, thats in the patch.
Patched version fails e.g. for: distance(pack('U2',252,65),pack('U2',252,66)); ... returns 0, but ought to return 1 due to the final substitution 'A'->'B' (chr(65)->chr(66)). My guess is a bug in the length checks.
applied sven's patch and uploaded version 0.04.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.