I hereby dub this technique "Stochastic Pseudonymization" :-) Here's a quick implementation. https://colab.research.google.com/gist/rayvoelker/74278aa82ee95e3c6dbf0caa993f1ebe/stochastic_pseudonymization.ipynb I think the trick is of course picking a number of bits in order to sufficiently cover the size of your patron population, and to introduce this chance of collision into your data--in order to as Steve put it, "... to leave a *modicum of deliberate uncertainty*, introduced by hashing collision, to keep it from being possible to prove that any one patron had some specific behavior in the past". --Ray On Fri, Sep 22, 2023 at 11:44 AM Hammer, Erich F <[log in to unmask]> wrote: > Ray, > > I liked your original pseudononymous idea and was thinking about trying to > re-write it in PowerShell, but looking into the "birthday paradox" comment > opens some very interesting possibilities. > > As I understand it, you can uniquely identify someone using just a portion > of a hash of their unique ID (e.g. email address). Even better for us in > library land is that if you use just the right number of bits from the > hash, you can create a legal-level of plausible deniability while > maintaining statistically valid data. For example, with the right > calculations, you can introduce enough of a chance of two patrons having > the same "anonymized" identifiers (calculated from a hash of their unique > ID) that a patron can't be definitively identified, but at the same time, > the vast majority of the identifiers will be unique and thus any statistics > will still be highly accurate. > > Good stuff! > > Thanks, > > Erich > > > > On Friday, September 22, 2023 at 09:43, Ray Voelker eloquently inscribed: > > > Hi code4lib folks .. and again ... happy Friday!! > > > > I just wanted to post an update to this. I wrote in to the Security Now! > > podcast (fantastic show by the way and fully worth listening to on a > > regular basis) about this notion, and it was made the main topic of show > > number 940! > > > > https://twit.tv/shows/security-now/episodes/940?autostart=false > > > > The discussion starts around the 1:36 mark. > > > > Here's what I wrote to Steve Gibson: > > > > In addition to being an avid listener to Security Now, I'm also a System > >> Administrator for a large public library system in Ohio. Libraries > >> often struggle with data—being especially sensitive around data related > >> to patrons and patron behavior in terms of borrowing, library program > >> attendance, reference questions, etc. The common practice is for > >> libraries to aggregate and then promptly destroy this data within a > >> short time frame—which is typically one month. However, administrators > >> and local government officials, who are often instrumental in > >> allocating library funding and guiding operational strategies, > >> frequently ask questions on a larger time scale than one month to > >> validate the library's significance and its operational strategies. > >> Disaggregation of this data to answer these types of questions is very > >> difficult and arguably impossible. This puts people like me, and many > >> others like me, in a tough spot in terms of storing and later using > >> sensitive data to provide the answers to these questions of pretty > >> serious consequence—like, what should we spend money on, or why we > >> should continue to exist. > > > > I’m sure you’re aware, but there are many interesting historical reasons > >> for this sensitivity, and organizations like the American Library > >> Association (ALA) and other international library associations have > >> even codified the protection of patron privacy into their codes of > >> ethics. For example, the ALA's Code of Ethics states: "We protect each > >> library user's right to privacy and confidentiality with respect to > >> information sought or received and resources consulted, borrowed, > >> acquired or transmitted." While I deeply respect and admire this > >> stance, it doesn't provide a solution for those of us wrestling with > >> the aforementioned existential questions. > >> > > > > In this context, I'd be immensely grateful if you could share your > insights > >> on the technique of "Pseudonymization" ( https:// > >> en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for > >> PII data. Additionally, I'd appreciate a brief review of a Python > >> module I'm developing, which aims to assist me (and potentially other > >> library professionals) in retaining crucial data for subsequent > >> analysis while ensuring data subject privacy. > >> https://gist.github.com/rayvoelker/80c 0dfa5cb47e63c7e498bd064d3c0b6 > >> <https://t.co/aAapRKgElr> Thank you once again, Steve, for your > >> invaluable contributions to the security community. I eagerly await > >> your feedback! > >> > >> > > I think the even better solution compared to Pseudonymization involves > the > > Birthday Paradox. It's a direction I hadn't even thought of for this! > > > > --Ray > > > > On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]> > wrote: > > > >> Hi code4lib folks .. happy Friday! > >> > >> I started putting together a little Python utility for doing > >> Pseudonymization tasks > >> (https://en.wikipedia.org/wiki/Pseudonymization). The goal is to be > >> able to do more analysis on data related to circulation while securely > >> maintaining patron privacy. > >> > >> For a little bit of background I wanted something *like a hash* (but > >> more secure than a hash), for replacing select fields related to patron > >> records. I also wanted something that could possibly be reversed given > >> an encrypted private key that would be stored well outside of the scope > >> of the project. I'm thinking that if you wanted to geocode addresses > >> for example, you could temporarily decrypt each field needed for the > >> task, use the *pseudonymized* patron id as the identifier, and then > >> send your data off to the geocoder of your choice. Another example > >> would be to store a pseudonymized patron id as the identifier in things > >> like circulation data used for later analysis, or for transmitting to > >> trusted 3rd parties who may do analysis for you. > >> > >> I'm humbly asking for anyone with some background in using encryption to > >> review the code I have and maybe offer some comments / concerns / > >> suggestions / jokes about this. > >> > >> Thanks in advance! > >> > >> https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6 > >> > >> -- > >> Ray Voelker > >> > > > > > > > -- Ray Voelker (937) 620-1830