Can someone help me use Lingua::Stem::Snowball more efficiently?
I want to count the total number of times a word stem appears in a
hash. Here is a short example:
use strict;
use Lingua::Stem::Snowball;
my $idea = 'books';
my %words = ( 'books' => 5,
'library' => 6,
'librarianship' => 5,
'librarians' => 3,
'librarian' => 3,
'book' => 3,
'museums' => 2
);
my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' );
my $idea_stem = $stemmer->stem( $idea );
print "$idea ($idea_stem)\n";
my $total = 0;
foreach my $word ( keys %words ) {
my $word_stem = $stemmer->stem( $word );
print "\t$word ($word_stem)\n";
if ( $idea_stem eq $word_stem ) { $total += $words{ $word } }
}
print "$total\n";
In the end, the value of $total equals 8. That is, more or less, what
I expect, but how can I make the foreach loop more efficient? In
reality, my application fills %words up as many as 150,000 keys.
Moreover, $idea is really just a single element in an array of about
100 words. Doing the math, the if statement in my foreach loop will
get executed as many as 1,500,000 times. To make matters even worse, I
plan to run the whole program about 10,000 times. That is a whole lot
of processing just to count words!
Is there someway I could short-circuit the foreach loop? I saw
Lingua::Stem::Snowball's stem_in_place method, but to use it I must
pass it an array disassociating my keys from their values.
Second, is there a way I can make the stemming more aggressive? For
example, I was hoping the stem of library would equal the stems of
library, librarianship, and librarian, but alas, they don't.
Any suggestions?
--
Eric Lease Morgan
|