Ray, Love it! You might consider pointing out that in a population of 300,000 patrons, having ~11 collisions (i.e. 22 patrons) shouldn't have a significant effect on any statistical data you need to report. It's a win-win. I also noticed a typo: In the green box just before the "Birthday Paradox", I think you want "represented", not "represent". Thanks for taking the time and doing the work. Erich On Friday, September 29, 2023 at 09:38, Ray Voelker eloquently inscribed: > It's Friday again, so another update on this project! lol > > I put together what I think is a pretty good explanation and framing of the > problem and a possible solution! > > https://chimpy.me/blog/ > > --Ray > > On Fri, Sep 22, 2023 at 9:43 AM Ray Voelker <[log in to unmask]> wrote: > >> Hi code4lib folks .. and again ... happy Friday!! >> >> I just wanted to post an update to this. I wrote in to the Security Now! >> podcast (fantastic show by the way and fully worth listening to on a >> regular basis) about this notion, and it was made the main topic of show >> number 940! >> >> https://twit.tv/shows/security-now/episodes/940?autostart=false >> >> The discussion starts around the 1:36 mark. >> >> Here's what I wrote to Steve Gibson: >> >> In addition to being an avid listener to Security Now, I'm also a System >>> Administrator for a large public library system in Ohio. Libraries >>> often struggle with data—being especially sensitive around data >>> related to patrons and patron behavior in terms of borrowing, library >>> program attendance, reference questions, etc. The common practice is >>> for libraries to aggregate and then promptly destroy this data within >>> a short time frame—which is typically one month. However, >>> administrators and local government officials, who are often >>> instrumental in allocating library funding and guiding operational >>> strategies, frequently ask questions on a larger time scale than one >>> month to validate the library's significance and its operational >>> strategies. Disaggregation of this data to answer these types of >>> questions is very difficult and arguably impossible. This puts people >>> like me, and many others like me, in a tough spot in terms of storing >>> and later using sensitive data to provide the answers to these >>> questions of pretty serious consequence—like, what should we spend >>> money on, or why we should continue to exist. >> >> >> I’m sure you’re aware, but there are many interesting historical reasons >>> for this sensitivity, and organizations like the American Library >>> Association (ALA) and other international library associations have >>> even codified the protection of patron privacy into their codes of >>> ethics. For example, the ALA's Code of Ethics states: "We protect each >>> library user's right to privacy and confidentiality with respect to >>> information sought or received and resources consulted, borrowed, >>> acquired or transmitted." While I deeply respect and admire this >>> stance, it doesn't provide a solution for those of us wrestling with >>> the aforementioned existential questions. >>> >> >> In this context, I'd be immensely grateful if you could share your >>> insights on the technique of "Pseudonymization" ( https:// >>> en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for >>> PII data. Additionally, I'd appreciate a brief review of a Python >>> module I'm developing, which aims to assist me (and potentially other >>> library professionals) in retaining crucial data for subsequent >>> analysis while ensuring data subject privacy. >>> https://gist.github.com/rayvoelker/80c 0dfa5cb47e63c7e498bd064d3c0b6 >>> <https://t.co/aAapRKgElr> Thank you once again, Steve, for your >>> invaluable contributions to the security community. I eagerly await >>> your feedback! >>> >>> >> I think the even better solution compared to Pseudonymization involves >> the Birthday Paradox. It's a direction I hadn't even thought of for this! >> >> --Ray >> >> On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]> >> wrote: >> >>> Hi code4lib folks .. happy Friday! >>> >>> I started putting together a little Python utility for doing >>> Pseudonymization tasks >>> (https://en.wikipedia.org/wiki/Pseudonymization). The goal is to be >>> able to do more analysis on data related to circulation while securely >>> maintaining patron privacy. >>> >>> For a little bit of background I wanted something *like a hash* (but >>> more secure than a hash), for replacing select fields related to >>> patron records. I also wanted something that could possibly be >>> reversed given an encrypted private key that would be stored well >>> outside of the scope of the project. I'm thinking that if you wanted >>> to geocode addresses for example, you could temporarily decrypt each >>> field needed for the task, use the *pseudonymized* patron id as the >>> identifier, and then send your data off to the geocoder of your >>> choice. Another example would be to store a pseudonymized patron id as >>> the identifier in things like circulation data used for later >>> analysis, or for transmitting to trusted 3rd parties who may do >>> analysis for you. >>> >>> I'm humbly asking for anyone with some background in using encryption >>> to review the code I have and maybe offer some comments / concerns / >>> suggestions / jokes about this. >>> >>> Thanks in advance! >>> >>> https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6 >>> >>> -- >>> Ray Voelker >>> >> >> >> -- >> Ray Voelker >> (937) 620-1830 >> > >