Thanks for the clarification, Sol. You're right: it depends on the
checksum algorithm: http://en.wikipedia.org/wiki/Checksum. I'm not sure
what algorithm Perl uses as part of unpack('%32C*'), but you're right
that POSIX cksum uses a CRC algorithm
(http://en.wikipedia.org/wiki/Cyclic_redundancy_check) that is
position-dependent.
Sol Lederman wrote:
> Alex,
>
> Permuting the characters in a string does not produce the same checksum. If
> it did, that would make checksums really weak. I don't know of any checksum
> algorithm that produces the same checksum when you merely permute the
> characters.
>
> Here's an example on my iMac.
>
> echo "The cat chases the dog" > foo1
> echo "The dog chases the cat" > foo2
> cksum foo1
> 414128224 23 foo1
> cksum foo2
> 2453586855 23 foo2
>
> Sol
>
> On Fri, May 28, 2010 at 10:26 AM, Alex Bronstein
> <[log in to unmask]>wrote:
>
>
>> Hi Eric,
>>
>> That's not ideal. checksums generate the same number if the letters in the
>> string are moved. For example "The cat chases the dog" and "The dog chases
>> the cat" would result in the same checksum.
>>
>> You'd be better off using md5(): http://perldoc.perl.org/Digest/MD5.html
>>
>> Something like:
>> # If you want a short integer (2 bytes: 0 - 65535)
>> my ($integer) = unpack('S', md5($author . $title));
>>
>> # If you want a long integer (4 bytes: 0 - 4 billion)
>> my ($integer) = unpack('L', md5($author . $title));
>>
>> That would give you uniqueness to within the capability of a short or long
>> int. If you have few enough items in the list that you're willing to
>> increase the odds of non-uniqueness in exchange for a smaller maximum
>> number, you can use the % operator as in:
>>
>> # If you want an integer between 0 and 9999
>> my ($integer) = unpack('S', md5($author . $title));
>> $integer = $integer % 10000;
>>
>> Alex.
>>
>>
>> Eric Lease Morgan wrote:
>>
>>
>>> Using Perl, how can I convert the author/title combination into some sort
>>>
>>>> of integer, checksum, or unique value that is the same every time I run my
>>>> script? I don't want to have to remember what was used before because I
>>>> don't want to maintain a list of previously used keys. Should I use some
>>>> form of the pack function? Should I sum the ASCII values of each character
>>>> in the author/title combination?
>>>>
>>>>
>>>>
>>> Thank you for the prompt replies, and invariably I resolved my own
>>> question. Using Perl's unpack function I can generate a checksum based on
>>> the concatenation of the authors and titles:
>>>
>>> my $integer = unpack( "%32C*", "$author$title" ) % 65535;
>>>
>>> The result is a unique four-digit number that will be consistently
>>> generated as my list of author/title combinations grows. At the same time,
>>> my solution looks much like an incantation -- with magic. Perl-specific and
>>> at a level of computing that is beyond my day-to-day understanding.
>>>
>>> TGIF
>>>
>>>
>>>
>>>
|