Thanks for sharing this. I’m also a long time listener to Security Now (along with a bunch of other TWIT rework podcasts) and heard the response to your question yesterday. It was great to hear Steve's deep dive on a topic that I've done a little work on- fortunately confirming the approach that I previously used in the analysis of our proxy logs to troubleshoot an issue with Google Scholar blocking our proxy IP address.
Listening yesterday made me think that a needed additional step in generating the hashes from identifiable elements in our data is salting the hashes (adding an additional constant but random) value to the values being hashed. This could prevent the use of lists of known values (e.g. user email addresses or IDs that have been harvested from public directories) to calculate related hashes for comparison with those in a dataset - enabling de-anonymization.
Thanks, Karl
Schedule an appointment: Online booking page<[log in to unmask]" target="_blank">https:[log in to unmask]>
________________________________
From: Code for Libraries <[log in to unmask]> on behalf of Ray Voelker <[log in to unmask]>
Sent: Friday, September 22, 2023 7:43:26 AM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: [CODE4LIB] Patron Data Pseudonymization Review Request ...
[You don't often get email from [log in to unmask] Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
[EXTERNAL]
Hi code4lib folks .. and again ... happy Friday!!
I just wanted to post an update to this. I wrote in to the Security Now!
podcast (fantastic show by the way and fully worth listening to on a
regular basis) about this notion, and it was made the main topic of show
number 940!
https://twit.tv/shows/security-now/episodes/940?autostart=false
The discussion starts around the 1:36 mark.
Here's what I wrote to Steve Gibson:
In addition to being an avid listener to Security Now, I'm also a System
> Administrator for a large public library system in Ohio. Libraries often
> struggle with data—being especially sensitive around data related to
> patrons and patron behavior in terms of borrowing, library program
> attendance, reference questions, etc. The common practice is for libraries
> to aggregate and then promptly destroy this data within a short time
> frame—which is typically one month. However, administrators and local
> government officials, who are often instrumental in allocating library
> funding and guiding operational strategies, frequently ask questions on a
> larger time scale than one month to validate the library's significance and
> its operational strategies. Disaggregation of this data to answer these
> types of questions is very difficult and arguably impossible. This puts
> people like me, and many others like me, in a tough spot in terms of
> storing and later using sensitive data to provide the answers to these
> questions of pretty serious consequence—like, what should we spend money
> on, or why we should continue to exist.
I’m sure you’re aware, but there are many interesting historical reasons
> for this sensitivity, and organizations like the American Library
> Association (ALA) and other international library associations have even
> codified the protection of patron privacy into their codes of ethics. For
> example, the ALA's Code of Ethics states: "We protect each library user's
> right to privacy and confidentiality with respect to information sought or
> received and resources consulted, borrowed, acquired or transmitted." While
> I deeply respect and admire this stance, it doesn't provide a solution for
> those of us wrestling with the aforementioned existential questions.
>
In this context, I'd be immensely grateful if you could share your insights
> on the technique of "Pseudonymization" ( https://
> en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for PII
> data. Additionally, I'd appreciate a brief review of a Python module I'm
> developing, which aims to assist me (and potentially other library
> professionals) in retaining crucial data for subsequent analysis while
> ensuring data subject privacy. https://gist.github.com/rayvoelker/80c
> 0dfa5cb47e63c7e498bd064d3c0b6 <https://t.co/aAapRKgElr> Thank you once
> again, Steve, for your invaluable contributions to the security community.
> I eagerly await your feedback!
>
I think the even better solution compared to Pseudonymization involves the
Birthday Paradox. It's a direction I hadn't even thought of for this!
--Ray
On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]> wrote:
> Hi code4lib folks .. happy Friday!
>
> I started putting together a little Python utility for doing
> Pseudonymization tasks (https://en.wikipedia.org/wiki/Pseudonymization).
> The goal is to be able to do more analysis on data related to circulation
> while securely maintaining patron privacy.
>
> For a little bit of background I wanted something *like a hash* (but more
> secure than a hash), for replacing select fields related to patron records.
> I also wanted something that could possibly be reversed given an encrypted
> private key that would be stored well outside of the scope of the project.
> I'm thinking that if you wanted to geocode addresses for example, you could
> temporarily decrypt each field needed for the task, use the
> *pseudonymized* patron id as the identifier, and then send your data off
> to the geocoder of your choice. Another example would be to store a
> pseudonymized patron id as the identifier in things like circulation data
> used for later analysis, or for transmitting to trusted 3rd parties who may
> do analysis for you.
>
> I'm humbly asking for anyone with some background in using encryption to
> review the code I have and maybe offer some comments / concerns /
> suggestions / jokes about this.
>
> Thanks in advance!
>
> https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6
>
> --
> Ray Voelker
>
--
Ray Voelker
(937) 620-1830
|