Print

Print


It should be noted that even using this "Stochastic Pseudonymization"
technique still likely still falls under the category of "PII"

I put in the other README that "It's essential to understand that even
pseudonymized data remains within the realm of personal data as per the
GDPR and many other regulations and laws. This categorization is because
such data can be linked back to an individual when complemented with
supplementary details".

Since you can basically take a known value, like email address, run it
through the Stochastic Pseudonymization process, and return with a
deterministic result means that you could *still *link the data back to its
original identifier. I think that's why it's important to have an "app
secret" that you protect and keep as protected as possible.

I'm not sure if there's anything you could do to solve that type of
problem, other than protecting your "app secret" that should go into the
salt for the hash.

--Ray



On Fri, Sep 22, 2023 at 1:56 PM Ray Voelker <[log in to unmask]> wrote:

> I hereby dub this technique "Stochastic Pseudonymization" :-)
>
> Here's a quick implementation.
>
> https://colab.research.google.com/gist/rayvoelker/74278aa82ee95e3c6dbf0caa993f1ebe/stochastic_pseudonymization.ipynb
>
> I think the trick is of course picking a number of bits in order to
> sufficiently cover the size of your patron population, and to introduce
> this chance of collision into your data--in order to as Steve put it, "...
> to leave a *modicum of deliberate uncertainty*, introduced by hashing
> collision, to keep it from being possible to prove that any one patron had
> some specific behavior in the past".
>
> --Ray
>
> On Fri, Sep 22, 2023 at 11:44 AM Hammer, Erich F <[log in to unmask]> wrote:
>
>> Ray,
>>
>> I liked your original pseudononymous idea and was thinking about trying
>> to re-write it in PowerShell, but looking into the "birthday paradox"
>> comment opens some very interesting possibilities.
>>
>> As I understand it, you can uniquely identify someone using just a
>> portion of a hash of their unique ID (e.g. email address).  Even better for
>> us in library land is that if you use just the right number of bits from
>> the hash, you can create a legal-level of plausible deniability while
>> maintaining statistically valid data.  For example, with the right
>> calculations, you can introduce enough of a chance of two patrons having
>> the same "anonymized" identifiers (calculated from a hash of their unique
>> ID) that a patron can't be definitively identified, but at the same time,
>> the vast majority of the identifiers will be unique and thus any statistics
>> will still be highly accurate.
>>
>> Good stuff!
>>
>> Thanks,
>>
>> Erich
>>
>>
>>
>> On Friday, September 22, 2023 at 09:43, Ray Voelker eloquently inscribed:
>>
>> > Hi code4lib folks .. and again ... happy Friday!!
>> >
>> > I just wanted to post an update to this. I wrote in to the Security Now!
>> > podcast (fantastic show by the way and fully worth listening to on a
>> > regular basis) about this notion, and it was made the main topic of show
>> > number 940!
>> >
>> > https://twit.tv/shows/security-now/episodes/940?autostart=false
>> >
>> > The discussion starts around the 1:36 mark.
>> >
>> > Here's what I wrote to Steve Gibson:
>> >
>> > In addition to being an avid listener to Security Now, I'm also a System
>> >> Administrator for a large public library system in Ohio. Libraries
>> >> often struggle with data—being especially sensitive around data related
>> >> to patrons and patron behavior in terms of borrowing, library program
>> >> attendance, reference questions, etc. The common practice is for
>> >> libraries to aggregate and then promptly destroy this data within a
>> >> short time frame—which is typically one month. However, administrators
>> >> and local government officials, who are often instrumental in
>> >> allocating library funding and guiding operational strategies,
>> >> frequently ask questions on a larger time scale than one month to
>> >> validate the library's significance and its operational strategies.
>> >> Disaggregation of this data to answer these types of questions is very
>> >> difficult and arguably impossible. This puts people like me, and many
>> >> others like me, in a tough spot in terms of storing and later using
>> >> sensitive data to provide the answers to these questions of pretty
>> >> serious consequence—like, what should we spend money on, or why we
>> >> should continue to exist.
>> >
>> > I’m sure you’re aware, but there are many interesting historical reasons
>> >> for this sensitivity, and organizations like the American Library
>> >> Association (ALA) and other international library associations have
>> >> even codified the protection of patron privacy into their codes of
>> >> ethics. For example, the ALA's Code of Ethics states: "We protect each
>> >> library user's right to privacy and confidentiality with respect to
>> >> information sought or received and resources consulted, borrowed,
>> >> acquired or transmitted." While I deeply respect and admire this
>> >> stance, it doesn't provide a solution for those of us wrestling with
>> >> the aforementioned existential questions.
>> >>
>> >
>> > In this context, I'd be immensely grateful if you could share your
>> insights
>> >> on the technique of "Pseudonymization" ( https://
>> >> en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for
>> >> PII data. Additionally, I'd appreciate a brief review of a Python
>> >> module I'm developing, which aims to assist me (and potentially other
>> >> library professionals) in retaining crucial data for subsequent
>> >> analysis while ensuring data subject privacy.
>> >> https://gist.github.com/rayvoelker/80c 0dfa5cb47e63c7e498bd064d3c0b6
>> >> <https://t.co/aAapRKgElr> Thank you once again, Steve, for your
>> >> invaluable contributions to the security community. I eagerly await
>> >> your feedback!
>> >>
>> >>
>> >  I think the even better solution compared to Pseudonymization involves
>> the
>> > Birthday Paradox. It's a direction I hadn't even thought of for this!
>> >
>> > --Ray
>> >
>> > On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]>
>> wrote:
>> >
>> >> Hi code4lib folks .. happy Friday!
>> >>
>> >> I started putting together a little Python utility for doing
>> >> Pseudonymization tasks
>> >> (https://en.wikipedia.org/wiki/Pseudonymization). The goal is to be
>> >> able to do more analysis on data related to circulation while securely
>> >> maintaining patron privacy.
>> >>
>> >> For a little bit of background I wanted something *like a hash* (but
>> >> more secure than a hash), for replacing select fields related to patron
>> >> records. I also wanted something that could possibly be reversed given
>> >> an encrypted private key that would be stored well outside of the scope
>> >> of the project. I'm thinking that if you wanted to geocode addresses
>> >> for example, you could temporarily decrypt each field needed for the
>> >> task, use the *pseudonymized* patron id as the identifier, and then
>> >> send your data off to the geocoder of your choice. Another example
>> >> would be to store a pseudonymized patron id as the identifier in things
>> >> like circulation data used for later analysis, or for transmitting to
>> >> trusted 3rd parties who may do analysis for you.
>> >>
>> >> I'm humbly asking for anyone with some background in using encryption
>> to
>> >> review the code I have and maybe offer some comments / concerns /
>> >> suggestions / jokes about this.
>> >>
>> >> Thanks in advance!
>> >>
>> >> https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6
>> >>
>> >> --
>> >> Ray Voelker
>> >>
>> >
>> >
>>
>>
>>
>
> --
> Ray Voelker
> (937) 620-1830
>


-- 
Ray Voelker
(937) 620-1830