|Subject:||bug in lesk normalization|
|X-Mailer:||MIME-tools 5.504 (Entity 5.504)|
|Message-ID:||<rt-4.0.13-31703-1372259783-1872.0-0-0 [...] rt.cpan.org>|
Lesk normalization has always been a little unstable (and can provide scores greater than 1). The following report is from Ryan Simmons and provides more details. -------------------------------------------- The past few days I have been working with the Lesk normalization feature, which (as has been mentioned previously) doesn't always constrain the output to the upper bound of 1, as it should. I am not sure if I have found the problem, or merely another symptom of it ... I haven't had the chance to experiment that much, but I figured I would let you know what I found. I am not an expert at WordNet or Perl programming in general, so please point out my mistakes. When Lesk is calculated, function scores are obtained for various relation pairs (from the lesk-relation.dat file). The default/example file has 88 pairs (also-also, also-attr, etc.). For each pair, the overlap score is calculated (and normalized if that option is activated). The score is determined for each relation pair, then added to the main score. So, the main score is the sum of the individual scores for each relation pair within the super gloss. In the lesk.pm file, the score obtained by counting the glosses for each relation pair is normalized according to the size of glosses; however, these numbers are still added together, so the main score will exceed 1. For example: Say you compare "dog#n#1" and "dog#n#1" with Lesk. To make my example a little simpler, I used the following lesk-relation.dat file, instead of the default one: RelationFile also-also attr-attr caus-caus enta-enta example-example glosexample-glosexample glos-glos holo-holo hype-hype hypo-hypo mero-mero part-part pert-pert sim-sim syns-syns The output for this is 5.15428512949297. Now, I ran Lesk again, using 15 separate relation.dat files, each with only a single relation pair each. So, "also-also", than "attr-attr", than "caus-caus", etc. Here are the values: also-also = 0 attr-attr = 0 caus-caus = 0 enta-enta = 0 example-example = 1 glosexample-glosexample = 1 glos-glos = 1 holo-holo = 0.505190311418685 hype-hype = 0.470663265306122 hypo-hypo = 0.0584315527681661 mero-mero = 1 part-part = 0 pert-pert = 0 sim-sim = 0 syns-syns = 0.12 They, predictably, add up to 5.15428512949297. Now, this example isn't perfect (under the default relation.dat file, the score for dog#n#1 and dog#n#1 is 4.25107026707742), but I think it illustrates the issue. Since the [0,1] normalization only occurs at the relation pair level, in cases of identity it will add up to be greater than 1. I am not sure what the best workaround for this will be (or even if I have the problem really nailed down ... my example might not be a representative one, and I haven't had the time to check a whole lot more, under different/more variable conditions). But, so far as I can tell from looking at the output and the .pm file, this is where the problem is occurring.