LISTSERV 16.5 - CODE4LIB Archives

I hereby dub this technique "Stochastic Pseudonymization" :-)

Here's a quick implementation.
https://colab.research.google.com/gist/rayvoelker/74278aa82ee95e3c6dbf0caa993f1ebe/stochastic_pseudonymization.ipynb

I think the trick is of course picking a number of bits in order to
sufficiently cover the size of your patron population, and to introduce
this chance of collision into your data--in order to as Steve put it, "...
to leave a *modicum of deliberate uncertainty*, introduced by hashing
collision, to keep it from being possible to prove that any one patron had
some specific behavior in the past".

--Ray

On Fri, Sep 22, 2023 at 11:44 AM Hammer, Erich F <[log in to unmask]> wrote:

> Ray,
>
> I liked your original pseudononymous idea and was thinking about trying to
> re-write it in PowerShell, but looking into the "birthday paradox" comment
> opens some very interesting possibilities.
>
> As I understand it, you can uniquely identify someone using just a portion
> of a hash of their unique ID (e.g. email address).  Even better for us in
> library land is that if you use just the right number of bits from the
> hash, you can create a legal-level of plausible deniability while
> maintaining statistically valid data.  For example, with the right
> calculations, you can introduce enough of a chance of two patrons having
> the same "anonymized" identifiers (calculated from a hash of their unique
> ID) that a patron can't be definitively identified, but at the same time,
> the vast majority of the identifiers will be unique and thus any statistics
> will still be highly accurate.
>
> Good stuff!
>
> Thanks,
>
> Erich
>
>
>
> On Friday, September 22, 2023 at 09:43, Ray Voelker eloquently inscribed:
>
> > Hi code4lib folks .. and again ... happy Friday!!
> >
> > I just wanted to post an update to this. I wrote in to the Security Now!
> > podcast (fantastic show by the way and fully worth listening to on a
> > regular basis) about this notion, and it was made the main topic of show
> > number 940!
> >
> > https://twit.tv/shows/security-now/episodes/940?autostart=false
> >
> > The discussion starts around the 1:36 mark.
> >
> > Here's what I wrote to Steve Gibson:
> >
> > In addition to being an avid listener to Security Now, I'm also a System
> >> Administrator for a large public library system in Ohio. Libraries
> >> often struggle with data—being especially sensitive around data related
> >> to patrons and patron behavior in terms of borrowing, library program
> >> attendance, reference questions, etc. The common practice is for
> >> libraries to aggregate and then promptly destroy this data within a
> >> short time frame—which is typically one month. However, administrators
> >> and local government officials, who are often instrumental in
> >> allocating library funding and guiding operational strategies,
> >> frequently ask questions on a larger time scale than one month to
> >> validate the library's significance and its operational strategies.
> >> Disaggregation of this data to answer these types of questions is very
> >> difficult and arguably impossible. This puts people like me, and many
> >> others like me, in a tough spot in terms of storing and later using
> >> sensitive data to provide the answers to these questions of pretty
> >> serious consequence—like, what should we spend money on, or why we
> >> should continue to exist.
> >
> > I’m sure you’re aware, but there are many interesting historical reasons
> >> for this sensitivity, and organizations like the American Library
> >> Association (ALA) and other international library associations have
> >> even codified the protection of patron privacy into their codes of
> >> ethics. For example, the ALA's Code of Ethics states: "We protect each
> >> library user's right to privacy and confidentiality with respect to
> >> information sought or received and resources consulted, borrowed,
> >> acquired or transmitted." While I deeply respect and admire this
> >> stance, it doesn't provide a solution for those of us wrestling with
> >> the aforementioned existential questions.
> >>
> >
> > In this context, I'd be immensely grateful if you could share your
> insights
> >> on the technique of "Pseudonymization" ( https://
> >> en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for
> >> PII data. Additionally, I'd appreciate a brief review of a Python
> >> module I'm developing, which aims to assist me (and potentially other
> >> library professionals) in retaining crucial data for subsequent
> >> analysis while ensuring data subject privacy.
> >> https://gist.github.com/rayvoelker/80c 0dfa5cb47e63c7e498bd064d3c0b6
> >> <https://t.co/aAapRKgElr> Thank you once again, Steve, for your
> >> invaluable contributions to the security community. I eagerly await
> >> your feedback!
> >>
> >>
> >  I think the even better solution compared to Pseudonymization involves
> the
> > Birthday Paradox. It's a direction I hadn't even thought of for this!
> >
> > --Ray
> >
> > On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]>
> wrote:
> >
> >> Hi code4lib folks .. happy Friday!
> >>
> >> I started putting together a little Python utility for doing
> >> Pseudonymization tasks
> >> (https://en.wikipedia.org/wiki/Pseudonymization). The goal is to be
> >> able to do more analysis on data related to circulation while securely
> >> maintaining patron privacy.
> >>
> >> For a little bit of background I wanted something *like a hash* (but
> >> more secure than a hash), for replacing select fields related to patron
> >> records. I also wanted something that could possibly be reversed given
> >> an encrypted private key that would be stored well outside of the scope
> >> of the project. I'm thinking that if you wanted to geocode addresses
> >> for example, you could temporarily decrypt each field needed for the
> >> task, use the *pseudonymized* patron id as the identifier, and then
> >> send your data off to the geocoder of your choice. Another example
> >> would be to store a pseudonymized patron id as the identifier in things
> >> like circulation data used for later analysis, or for transmitting to
> >> trusted 3rd parties who may do analysis for you.
> >>
> >> I'm humbly asking for anyone with some background in using encryption to
> >> review the code I have and maybe offer some comments / concerns /
> >> suggestions / jokes about this.
> >>
> >> Thanks in advance!
> >>
> >> https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6
> >>
> >> --
> >> Ray Voelker
> >>
> >
> >
>
>
>

-- 
Ray Voelker
(937) 620-1830