Print

Print


Ray,

I liked your original pseudononymous idea and was thinking about trying to re-write it in PowerShell, but looking into the "birthday paradox" comment opens some very interesting possibilities.  

As I understand it, you can uniquely identify someone using just a portion of a hash of their unique ID (e.g. email address).  Even better for us in library land is that if you use just the right number of bits from the hash, you can create a legal-level of plausible deniability while maintaining statistically valid data.  For example, with the right calculations, you can introduce enough of a chance of two patrons having the same "anonymized" identifiers (calculated from a hash of their unique ID) that a patron can't be definitively identified, but at the same time, the vast majority of the identifiers will be unique and thus any statistics will still be highly accurate.  

Good stuff!

Thanks,

Erich



On Friday, September 22, 2023 at 09:43, Ray Voelker eloquently inscribed:

> Hi code4lib folks .. and again ... happy Friday!!
> 
> I just wanted to post an update to this. I wrote in to the Security Now!
> podcast (fantastic show by the way and fully worth listening to on a
> regular basis) about this notion, and it was made the main topic of show
> number 940!
> 
> https://twit.tv/shows/security-now/episodes/940?autostart=false
> 
> The discussion starts around the 1:36 mark.
> 
> Here's what I wrote to Steve Gibson:
> 
> In addition to being an avid listener to Security Now, I'm also a System
>> Administrator for a large public library system in Ohio. Libraries
>> often struggle with data—being especially sensitive around data related
>> to patrons and patron behavior in terms of borrowing, library program
>> attendance, reference questions, etc. The common practice is for
>> libraries to aggregate and then promptly destroy this data within a
>> short time frame—which is typically one month. However, administrators
>> and local government officials, who are often instrumental in
>> allocating library funding and guiding operational strategies,
>> frequently ask questions on a larger time scale than one month to
>> validate the library's significance and its operational strategies.
>> Disaggregation of this data to answer these types of questions is very
>> difficult and arguably impossible. This puts people like me, and many
>> others like me, in a tough spot in terms of storing and later using
>> sensitive data to provide the answers to these questions of pretty
>> serious consequence—like, what should we spend money on, or why we
>> should continue to exist.
> 
> I’m sure you’re aware, but there are many interesting historical reasons
>> for this sensitivity, and organizations like the American Library
>> Association (ALA) and other international library associations have
>> even codified the protection of patron privacy into their codes of
>> ethics. For example, the ALA's Code of Ethics states: "We protect each
>> library user's right to privacy and confidentiality with respect to
>> information sought or received and resources consulted, borrowed,
>> acquired or transmitted." While I deeply respect and admire this
>> stance, it doesn't provide a solution for those of us wrestling with
>> the aforementioned existential questions.
>> 
> 
> In this context, I'd be immensely grateful if you could share your insights
>> on the technique of "Pseudonymization" ( https://
>> en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for
>> PII data. Additionally, I'd appreciate a brief review of a Python
>> module I'm developing, which aims to assist me (and potentially other
>> library professionals) in retaining crucial data for subsequent
>> analysis while ensuring data subject privacy.
>> https://gist.github.com/rayvoelker/80c 0dfa5cb47e63c7e498bd064d3c0b6
>> <https://t.co/aAapRKgElr> Thank you once again, Steve, for your
>> invaluable contributions to the security community. I eagerly await
>> your feedback!
>> 
>> 
>  I think the even better solution compared to Pseudonymization involves the
> Birthday Paradox. It's a direction I hadn't even thought of for this!
> 
> --Ray
> 
> On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]> wrote:
> 
>> Hi code4lib folks .. happy Friday!
>> 
>> I started putting together a little Python utility for doing
>> Pseudonymization tasks
>> (https://en.wikipedia.org/wiki/Pseudonymization). The goal is to be
>> able to do more analysis on data related to circulation while securely
>> maintaining patron privacy.
>> 
>> For a little bit of background I wanted something *like a hash* (but
>> more secure than a hash), for replacing select fields related to patron
>> records. I also wanted something that could possibly be reversed given
>> an encrypted private key that would be stored well outside of the scope
>> of the project. I'm thinking that if you wanted to geocode addresses
>> for example, you could temporarily decrypt each field needed for the
>> task, use the *pseudonymized* patron id as the identifier, and then
>> send your data off to the geocoder of your choice. Another example
>> would be to store a pseudonymized patron id as the identifier in things
>> like circulation data used for later analysis, or for transmitting to
>> trusted 3rd parties who may do analysis for you.
>> 
>> I'm humbly asking for anyone with some background in using encryption to
>> review the code I have and maybe offer some comments / concerns /
>> suggestions / jokes about this.
>> 
>> Thanks in advance!
>> 
>> https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6
>> 
>> --
>> Ray Voelker
>> 
> 
>