Print

Print


Ray,

Love it!  

You might consider pointing out that in a population of 300,000 patrons, having ~11 collisions (i.e. 22 patrons) shouldn't have a significant effect on any statistical data you need to report.  It's a win-win.

I also noticed a typo:  In the green box just before the "Birthday Paradox", I think you want "represented", not "represent".  

Thanks for taking the time and doing the work.

Erich


On Friday, September 29, 2023 at 09:38, Ray Voelker eloquently inscribed:

> It's Friday again, so another update on this project! lol
> 
> I put together what I think is a pretty good explanation and framing of the
> problem and a possible solution!
> 
> https://chimpy.me/blog/
> 
> --Ray
> 
> On Fri, Sep 22, 2023 at 9:43 AM Ray Voelker <[log in to unmask]> wrote:
> 
>> Hi code4lib folks .. and again ... happy Friday!!
>> 
>> I just wanted to post an update to this. I wrote in to the Security Now!
>> podcast (fantastic show by the way and fully worth listening to on a
>> regular basis) about this notion, and it was made the main topic of show
>> number 940!
>> 
>> https://twit.tv/shows/security-now/episodes/940?autostart=false
>> 
>> The discussion starts around the 1:36 mark.
>> 
>> Here's what I wrote to Steve Gibson:
>> 
>> In addition to being an avid listener to Security Now, I'm also a System
>>> Administrator for a large public library system in Ohio. Libraries
>>> often struggle with data—being especially sensitive around data
>>> related to patrons and patron behavior in terms of borrowing, library
>>> program attendance, reference questions, etc. The common practice is
>>> for libraries to aggregate and then promptly destroy this data within
>>> a short time frame—which is typically one month. However,
>>> administrators and local government officials, who are often
>>> instrumental in allocating library funding and guiding operational
>>> strategies, frequently ask questions on a larger time scale than one
>>> month to validate the library's significance and its operational
>>> strategies. Disaggregation of this data to answer these types of
>>> questions is very difficult and arguably impossible. This puts people
>>> like me, and many others like me, in a tough spot in terms of storing
>>> and later using sensitive data to provide the answers to these
>>> questions of pretty serious consequence—like, what should we spend
>>> money on, or why we should continue to exist.
>> 
>> 
>> I’m sure you’re aware, but there are many interesting historical reasons
>>> for this sensitivity, and organizations like the American Library
>>> Association (ALA) and other international library associations have
>>> even codified the protection of patron privacy into their codes of
>>> ethics. For example, the ALA's Code of Ethics states: "We protect each
>>> library user's right to privacy and confidentiality with respect to
>>> information sought or received and resources consulted, borrowed,
>>> acquired or transmitted." While I deeply respect and admire this
>>> stance, it doesn't provide a solution for those of us wrestling with
>>> the aforementioned existential questions.
>>> 
>> 
>> In this context, I'd be immensely grateful if you could share your
>>> insights on the technique of "Pseudonymization" ( https://
>>> en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for
>>> PII data. Additionally, I'd appreciate a brief review of a Python
>>> module I'm developing, which aims to assist me (and potentially other
>>> library professionals) in retaining crucial data for subsequent
>>> analysis while ensuring data subject privacy.
>>> https://gist.github.com/rayvoelker/80c 0dfa5cb47e63c7e498bd064d3c0b6
>>> <https://t.co/aAapRKgElr> Thank you once again, Steve, for your
>>> invaluable contributions to the security community. I eagerly await
>>> your feedback!
>>> 
>>> 
>>  I think the even better solution compared to Pseudonymization involves
>> the Birthday Paradox. It's a direction I hadn't even thought of for this!
>> 
>> --Ray
>> 
>> On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]>
>> wrote:
>> 
>>> Hi code4lib folks .. happy Friday!
>>> 
>>> I started putting together a little Python utility for doing
>>> Pseudonymization tasks
>>> (https://en.wikipedia.org/wiki/Pseudonymization). The goal is to be
>>> able to do more analysis on data related to circulation while securely
>>> maintaining patron privacy.
>>> 
>>> For a little bit of background I wanted something *like a hash* (but
>>> more secure than a hash), for replacing select fields related to
>>> patron records. I also wanted something that could possibly be
>>> reversed given an encrypted private key that would be stored well
>>> outside of the scope of the project. I'm thinking that if you wanted
>>> to geocode addresses for example, you could temporarily decrypt each
>>> field needed for the task, use the *pseudonymized* patron id as the
>>> identifier, and then send your data off to the geocoder of your
>>> choice. Another example would be to store a pseudonymized patron id as
>>> the identifier in things like circulation data used for later
>>> analysis, or for transmitting to trusted 3rd parties who may do
>>> analysis for you.
>>> 
>>> I'm humbly asking for anyone with some background in using encryption
>>> to review the code I have and maybe offer some comments / concerns /
>>> suggestions / jokes about this.
>>> 
>>> Thanks in advance!
>>> 
>>> https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6
>>> 
>>> --
>>> Ray Voelker
>>> 
>> 
>> 
>> --
>> Ray Voelker
>> (937) 620-1830
>> 
> 
>