Print

Print


Sorry if this has already been mentioned in this thread or related links and I’ve missed it – I believe that anonymization of this sort can be broken when other data sources contain data relating to the anonymized subjects. The idea is called trail re-identification; see work by Latanya Sweeney, Bradley Malin, and Elaine Newton at

https://dataprivacylab.org/people/sweeney/trails1.html

and

https://pubmed.ncbi.nlm.nih.gov/15196482/

- Ben

From: Code for Libraries <[log in to unmask]> on behalf of Ray Voelker <[log in to unmask]>
Date: Friday, September 22, 2023 at 3:31 PM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: [CODE4LIB] Patron Data Pseudonymization Review Request ...
> This could prevent the use of lists of known values (e.g.  user email
> addresses or IDs that have been harvested from public directories) to
> calculate related hashes for comparison with those in a dataset - enabling
> de-anonymization.
>

Basically the exact problem of "Known Plaintext" (
https://en.wikipedia.org/wiki/Known-plaintext_attack)

I wonder if doing something like basing the PBKDF2HMAC iteration count on
some static integer related to the patron record (like the patron number or
id in the local ILS) would be helpful. I suspect that really you just want
to mix in an "app secret" with the salt coming from other static parts of
the patron record would be sufficient.

--Ray


On Fri, Sep 22, 2023 at 3:00 PM Karl Benedict <[log in to unmask]> wrote:

> Thanks for sharing this. I’m also a long time listener to Security Now
> (along with a bunch of other TWIT  rework podcasts) and heard the response
> to your question yesterday. It was great to hear Steve's deep dive on a
> topic that I've done a little work on- fortunately confirming the approach
> that I previously used in the analysis of our proxy logs to troubleshoot an
> issue with Google Scholar blocking our proxy IP address.
>
> Listening yesterday made me think that a needed additional step in
> generating the hashes from identifiable elements in our data is salting the
> hashes (adding an additional constant but random) value to the values being
> hashed. This could prevent the use of lists of known values (e.g.  user
> email addresses or IDs that have been harvested from public directories) to
> calculate related hashes for comparison with those in a dataset - enabling
> de-anonymization.
>
>
> Thanks, Karl
>
> Schedule an appointment: Online booking page<
> https:[log in to unmask]
> >
> ________________________________
> From: Code for Libraries <[log in to unmask]> on behalf of Ray
> Voelker <[log in to unmask]>
> Sent: Friday, September 22, 2023 7:43:26 AM
> To: [log in to unmask] <[log in to unmask]>
> Subject: Re: [CODE4LIB] Patron Data Pseudonymization Review Request ...
>
> [You don't often get email from [log in to unmask] Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
>   [EXTERNAL]
>
> Hi code4lib folks .. and again ... happy Friday!!
>
> I just wanted to post an update to this. I wrote in to the Security Now!
> podcast (fantastic show by the way and fully worth listening to on a
> regular basis) about this notion, and it was made the main topic of show
> number 940!
>
> https://twit.tv/shows/security-now/episodes/940?autostart=false
>
> The discussion starts around the 1:36 mark.
>
> Here's what I wrote to Steve Gibson:
>
> In addition to being an avid listener to Security Now, I'm also a System
> > Administrator for a large public library system in Ohio. Libraries often
> > struggle with data—being especially sensitive around data related to
> > patrons and patron behavior in terms of borrowing, library program
> > attendance, reference questions, etc. The common practice is for
> libraries
> > to aggregate and then promptly destroy this data within a short time
> > frame—which is typically one month. However, administrators and local
> > government officials, who are often instrumental in allocating library
> > funding and guiding operational strategies, frequently ask questions on a
> > larger time scale than one month to validate the library's significance
> and
> > its operational strategies. Disaggregation of this data to answer these
> > types of questions is very difficult and arguably impossible. This puts
> > people like me, and many others like me, in a tough spot in terms of
> > storing and later using sensitive data to provide the answers to these
> > questions of pretty serious consequence—like, what should we spend money
> > on, or why we should continue to exist.
>
>
> I’m sure you’re aware, but there are many interesting historical reasons
> > for this sensitivity, and organizations like the American Library
> > Association (ALA) and other international library associations have even
> > codified the protection of patron privacy into their codes of ethics. For
> > example, the ALA's Code of Ethics states: "We protect each library user's
> > right to privacy and confidentiality with respect to information sought
> or
> > received and resources consulted, borrowed, acquired or transmitted."
> While
> > I deeply respect and admire this stance, it doesn't provide a solution
> for
> > those of us wrestling with the aforementioned existential questions.
> >
>
> In this context, I'd be immensely grateful if you could share your insights
> > on the technique of "Pseudonymization" ( https://
> > en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for
> PII
> > data. Additionally, I'd appreciate a brief review of a Python module I'm
> > developing, which aims to assist me (and potentially other library
> > professionals) in retaining crucial data for subsequent analysis while
> > ensuring data subject privacy. https://gist.github.com/rayvoelker/80c
> > 0dfa5cb47e63c7e498bd064d3c0b6 <https://t.co/aAapRKgElr> Thank you once
> > again, Steve, for your invaluable contributions to the security
> community.
> > I eagerly await your feedback!
> >
>
>  I think the even better solution compared to Pseudonymization involves the
> Birthday Paradox. It's a direction I hadn't even thought of for this!
>
> --Ray
>
> On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]> wrote:
>
> > Hi code4lib folks .. happy Friday!
> >
> > I started putting together a little Python utility for doing
> > Pseudonymization tasks (https://en.wikipedia.org/wiki/Pseudonymization).
> > The goal is to be able to do more analysis on data related to circulation
> > while securely maintaining patron privacy.
> >
> > For a little bit of background I wanted something *like a hash* (but more
> > secure than a hash), for replacing select fields related to patron
> records.
> > I also wanted something that could possibly be reversed given an
> encrypted
> > private key that would be stored well outside of the scope of the
> project.
> > I'm thinking that if you wanted to geocode addresses for example, you
> could
> > temporarily decrypt each field needed for the task, use the
> > *pseudonymized* patron id as the identifier, and then send your data off
> > to the geocoder of your choice. Another example would be to store a
> > pseudonymized patron id as the identifier in things like circulation data
> > used for later analysis, or for transmitting to trusted 3rd parties who
> may
> > do analysis for you.
> >
> > I'm humbly asking for anyone with some background in using encryption to
> > review the code I have and maybe offer some comments / concerns /
> > suggestions / jokes about this.
> >
> > Thanks in advance!
> >
> > https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6
> >
> > --
> > Ray Voelker
> >
>
>
> --
> Ray Voelker
> (937) 620-1830
>


--
Ray Voelker
(937) 620-1830