In the use case at hand, only 250 titles known to be unique must be tracked
and simplicity as well as transparency is desired. Besides, until you get
into huge sets with nonunique names and titles, variations in word order
might that resolved to the same thing might indicate that it is the same
title.
kyle
On Fri, May 28, 2010 at 9:26 AM, Alex Bronstein <[log in to unmask]>wrote:
> Hi Eric,
>
> That's not ideal. checksums generate the same number if the letters in the
> string are moved. For example "The cat chases the dog" and "The dog chases
> the cat" would result in the same checksum.
>
> You'd be better off using md5(): http://perldoc.perl.org/Digest/MD5.html
>
> Something like:
> # If you want a short integer (2 bytes: 0 - 65535)
> my ($integer) = unpack('S', md5($author . $title));
>
> # If you want a long integer (4 bytes: 0 - 4 billion)
> my ($integer) = unpack('L', md5($author . $title));
>
> That would give you uniqueness to within the capability of a short or long
> int. If you have few enough items in the list that you're willing to
> increase the odds of non-uniqueness in exchange for a smaller maximum
> number, you can use the % operator as in:
>
> # If you want an integer between 0 and 9999
> my ($integer) = unpack('S', md5($author . $title));
> $integer = $integer % 10000;
>
> Alex.
>
> Eric Lease Morgan wrote:
>
>> Using Perl, how can I convert the author/title combination into some sort
>>> of integer, checksum, or unique value that is the same every time I run my
>>> script? I don't want to have to remember what was used before because I
>>> don't want to maintain a list of previously used keys. Should I use some
>>> form of the pack function? Should I sum the ASCII values of each character
>>> in the author/title combination?
>>>
>>>
>>
>>
>> Thank you for the prompt replies, and invariably I resolved my own
>> question. Using Perl's unpack function I can generate a checksum based on
>> the concatenation of the authors and titles:
>>
>> my $integer = unpack( "%32C*", "$author$title" ) % 65535;
>>
>> The result is a unique four-digit number that will be consistently
>> generated as my list of author/title combinations grows. At the same time,
>> my solution looks much like an incantation -- with magic. Perl-specific and
>> at a level of computing that is beyond my day-to-day understanding.
>>
>> TGIF
>>
>>
>>
>
--
----------------------------------------------------------
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[log in to unmask] / 503.999.9787
|