I've altered my previous function (https://gist.github.com/1468557) into
something that's pretty much a straight letter-substitution cipher. It
could be turned back into plaintext pretty easily by someone who really
wanted to (by using frequency analysis and other hints like single-letter
words), but I can't imagine anyone going to the trouble over finding aids.
:) This keeps words (and therefore word frequency/distribution) consistent,
even across changes in case. But if you really want it to index
realistically, it would need to be altered to leave common stems (-s, -ies,
-ed, -ing, etc.) alone (assuming the indexer uses some sort of stemming
algorithm).
On Mon, Dec 12, 2011 at 12:06 PM, Brian Tingle <
[log in to unmask]> wrote:
> On Mon, Dec 12, 2011 at 10:56 AM, Michael B. Klein <[log in to unmask]
> >wrote:
>
> > Here's a snippet that will completely randomize the contents of an
> > arbitrary string while replacing the general flow (vowels replaced with
> > vowels, consonants replaced with consonants (with case retained in both
> > instances), digits replaced with digits, and everything else is left
> alone.
> >
> > https://gist.github.com/1468557 <https://gist.github.com/1468557>
>
>
> I like the way the output looks; but one problem with the random output is
> that the same word might come out to different values. The distribution of
> unique words would also be affected, not sure if that would
> impact relevance/searching/index size. Also, I was sort of hoping to be
> able to have some sort of browsing, so I'm looking for something that is
> like a pronounceable hash one way hash. Maybe if I take the md5 of the
> word; and then use that as the seed for random, and then run
> your algorithm then NASA would always "hash" to the same thing?
>
> Potential contributors of specimens would have to be okay with the fact
> that a determined person could recreate their original records. The goal
> is that an end user who might stumble across a random XTF tutorial
> installation would not mistake what they are seeing for a real collection
> description.
>
> Hopefully nothing transforms to a swear word, I guess that is a problem
> with pig latin as well...
>
> Thanks for the feedback and the suggestion. I'll play with this some
> tonight and see if setting the seed based on the input word works to get
> the same pseudo-random result, seems like it should.
>
|