Ben
Those are all super critical things to be aware of with data of this
nature. I think that's where having a protected app secret, and mixing
other static record data (like a patron record num / id, creation date,
etc.) into a salt would be VERY important so that data is more shielded
from these types of issues and attacks.
But again, this data wouldn't (and shouldn't) be considered fully
anonymized for the reason that given the database of known data subjects,
and the application secrets, you could possibly build a lookup table and
re-attribute patrons to their past activity. Data (even encrypted data) is
of course only private so long as you can properly protect secrets--which
should NEVER be stored along with the processed data. With this in mind,
this type of technique is still very useful for being able to perform
statistical analysis on your library data, while still maintaining and
respecting patron privacy.
https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6#limitations
--Ray
On Fri, Sep 22, 2023 at 4:11 PM Steinberg, Benjamin <
[log in to unmask]> wrote:
> Sorry if this has already been mentioned in this thread or related links
> and I’ve missed it – I believe that anonymization of this sort can be
> broken when other data sources contain data relating to the anonymized
> subjects. The idea is called trail re-identification; see work by Latanya
> Sweeney, Bradley Malin, and Elaine Newton at
>
> https://dataprivacylab.org/people/sweeney/trails1.html
>
> and
>
> https://pubmed.ncbi.nlm.nih.gov/15196482/
>
> - Ben
>
> From: Code for Libraries <[log in to unmask]> on behalf of Ray
> Voelker <[log in to unmask]>
> Date: Friday, September 22, 2023 at 3:31 PM
> To: [log in to unmask] <[log in to unmask]>
> Subject: Re: [CODE4LIB] Patron Data Pseudonymization Review Request ...
> > This could prevent the use of lists of known values (e.g. user email
> > addresses or IDs that have been harvested from public directories) to
> > calculate related hashes for comparison with those in a dataset -
> enabling
> > de-anonymization.
> >
>
> Basically the exact problem of "Known Plaintext" (
> https://en.wikipedia.org/wiki/Known-plaintext_attack)
>
> I wonder if doing something like basing the PBKDF2HMAC iteration count on
> some static integer related to the patron record (like the patron number or
> id in the local ILS) would be helpful. I suspect that really you just want
> to mix in an "app secret" with the salt coming from other static parts of
> the patron record would be sufficient.
>
> --Ray
>
>
> On Fri, Sep 22, 2023 at 3:00 PM Karl Benedict <[log in to unmask]> wrote:
>
> > Thanks for sharing this. I’m also a long time listener to Security Now
> > (along with a bunch of other TWIT rework podcasts) and heard the
> response
> > to your question yesterday. It was great to hear Steve's deep dive on a
> > topic that I've done a little work on- fortunately confirming the
> approach
> > that I previously used in the analysis of our proxy logs to troubleshoot
> an
> > issue with Google Scholar blocking our proxy IP address.
> >
> > Listening yesterday made me think that a needed additional step in
> > generating the hashes from identifiable elements in our data is salting
> the
> > hashes (adding an additional constant but random) value to the values
> being
> > hashed. This could prevent the use of lists of known values (e.g. user
> > email addresses or IDs that have been harvested from public directories)
> to
> > calculate related hashes for comparison with those in a dataset -
> enabling
> > de-anonymization.
> >
> >
> > Thanks, Karl
> >
> > Schedule an appointment: Online booking page<
> >
> [log in to unmask]" target="_blank">https:[log in to unmask]
> > >
> > ________________________________
> > From: Code for Libraries <[log in to unmask]> on behalf of Ray
> > Voelker <[log in to unmask]>
> > Sent: Friday, September 22, 2023 7:43:26 AM
> > To: [log in to unmask] <[log in to unmask]>
> > Subject: Re: [CODE4LIB] Patron Data Pseudonymization Review Request ...
> >
> > [You don't often get email from [log in to unmask] Learn why this is
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > [EXTERNAL]
> >
> > Hi code4lib folks .. and again ... happy Friday!!
> >
> > I just wanted to post an update to this. I wrote in to the Security Now!
> > podcast (fantastic show by the way and fully worth listening to on a
> > regular basis) about this notion, and it was made the main topic of show
> > number 940!
> >
> > https://twit.tv/shows/security-now/episodes/940?autostart=false
> >
> > The discussion starts around the 1:36 mark.
> >
> > Here's what I wrote to Steve Gibson:
> >
> > In addition to being an avid listener to Security Now, I'm also a System
> > > Administrator for a large public library system in Ohio. Libraries
> often
> > > struggle with data—being especially sensitive around data related to
> > > patrons and patron behavior in terms of borrowing, library program
> > > attendance, reference questions, etc. The common practice is for
> > libraries
> > > to aggregate and then promptly destroy this data within a short time
> > > frame—which is typically one month. However, administrators and local
> > > government officials, who are often instrumental in allocating library
> > > funding and guiding operational strategies, frequently ask questions
> on a
> > > larger time scale than one month to validate the library's significance
> > and
> > > its operational strategies. Disaggregation of this data to answer these
> > > types of questions is very difficult and arguably impossible. This puts
> > > people like me, and many others like me, in a tough spot in terms of
> > > storing and later using sensitive data to provide the answers to these
> > > questions of pretty serious consequence—like, what should we spend
> money
> > > on, or why we should continue to exist.
> >
> >
> > I’m sure you’re aware, but there are many interesting historical reasons
> > > for this sensitivity, and organizations like the American Library
> > > Association (ALA) and other international library associations have
> even
> > > codified the protection of patron privacy into their codes of ethics.
> For
> > > example, the ALA's Code of Ethics states: "We protect each library
> user's
> > > right to privacy and confidentiality with respect to information sought
> > or
> > > received and resources consulted, borrowed, acquired or transmitted."
> > While
> > > I deeply respect and admire this stance, it doesn't provide a solution
> > for
> > > those of us wrestling with the aforementioned existential questions.
> > >
> >
> > In this context, I'd be immensely grateful if you could share your
> insights
> > > on the technique of "Pseudonymization" ( https://
> > > en.wikipedia.org/wiki/Pseudonymization <https://t.co/gVKvpmzoxp>) for
> > PII
> > > data. Additionally, I'd appreciate a brief review of a Python module
> I'm
> > > developing, which aims to assist me (and potentially other library
> > > professionals) in retaining crucial data for subsequent analysis while
> > > ensuring data subject privacy. https://gist.github.com/rayvoelker/80c
> > > 0dfa5cb47e63c7e498bd064d3c0b6 <https://t.co/aAapRKgElr> Thank you once
> > > again, Steve, for your invaluable contributions to the security
> > community.
> > > I eagerly await your feedback!
> > >
> >
> > I think the even better solution compared to Pseudonymization involves
> the
> > Birthday Paradox. It's a direction I hadn't even thought of for this!
> >
> > --Ray
> >
> > On Fri, Sep 15, 2023 at 2:43 PM Ray Voelker <[log in to unmask]>
> wrote:
> >
> > > Hi code4lib folks .. happy Friday!
> > >
> > > I started putting together a little Python utility for doing
> > > Pseudonymization tasks (https://en.wikipedia.org/wiki/Pseudonymization
> ).
> > > The goal is to be able to do more analysis on data related to
> circulation
> > > while securely maintaining patron privacy.
> > >
> > > For a little bit of background I wanted something *like a hash* (but
> more
> > > secure than a hash), for replacing select fields related to patron
> > records.
> > > I also wanted something that could possibly be reversed given an
> > encrypted
> > > private key that would be stored well outside of the scope of the
> > project.
> > > I'm thinking that if you wanted to geocode addresses for example, you
> > could
> > > temporarily decrypt each field needed for the task, use the
> > > *pseudonymized* patron id as the identifier, and then send your data
> off
> > > to the geocoder of your choice. Another example would be to store a
> > > pseudonymized patron id as the identifier in things like circulation
> data
> > > used for later analysis, or for transmitting to trusted 3rd parties who
> > may
> > > do analysis for you.
> > >
> > > I'm humbly asking for anyone with some background in using encryption
> to
> > > review the code I have and maybe offer some comments / concerns /
> > > suggestions / jokes about this.
> > >
> > > Thanks in advance!
> > >
> > > https://gist.github.com/rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6
> > >
> > > --
> > > Ray Voelker
> > >
> >
> >
> > --
> > Ray Voelker
> > (937) 620-1830
> >
>
>
> --
> Ray Voelker
> (937) 620-1830
>
--
Ray Voelker
(937) 620-1830
|