Copy-paste errors are also a thing in search logs: eg where the user thinks they're pasting in their search term but whoops, their clipboard actually still has their credit card number. The only way I can think around that would be if you were to provide a specific list of acronyms/full terms you want to investigate, then the admins might be able to grep the logs for entries containing those terms. But then that wouldn't catch times people tried to search on the full term but included a typo.
Very tricky to get it completely safe! Ultimately it comes down to what the admins are comfortable with sharing / what level of trust they're willing to place in you. I managed to get hold of a dataset for research purposes once by solemnly swearing I'd [anonymise/aggregate/etc/etc] but that wasn't as sensitive as this kind of data.
Deborah
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Eric Phetteplace
Sent: Wednesday, April 10, 2024 5:16 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Are there "safe" ways to study and share user searches?
Caution: This email originated from outside our organisation. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Edwin,
This seems like a fairly difficult thing to do. Are you thinking of search sessions, where you might see several searches executed by the same user in a group? Or independent search strings? The latter, with little to no other metadata, might work. Definitely no IP or user identifier, but timestamps should also probably be discarded or sanitized (e.g. the year/month instead of exact time). I don't think a randomized sample accomplishes anything here since the queries in it pose the same problems as those in the full set.
Any given query could, on its own, be personally identifying. I would worry about a user including personal information in the query, though I don't know how common that is. Typing "eric phetteplace has a severe illness"
seems like an odd thing to do but nonetheless a possibility to be considered. Location details in a query seem more likely and also problematic. If you could use natural language processing to remove or censor certain types of named entities (person, place) that would go a long way but also still not be 100% certain to sanitize the data.
Best,
Eric Phetteplace
Systems Librarian
California College of the Arts
libraries.cca.edu
On Tue, Apr 9, 2024 at 9:51 AM EDWIN VINCENT SPERR < [log in to unmask]> wrote:
> Greetings!
>
> I have a question for the collective, especially for those
> knowledgeable about patron privacy issues.
>
> I would like to know more about the search behavior of users of a
> Large Public Medical Database. The admins are loath to share any
> search logs from this resource as they are (rightly!) concerned about
> protecting user privacy. This leaves investigators (like me) unable to
> use this data to do things like, to use a completely random example,
> figure out how often end-users are using abbreviations instead of
> spelling out a medical term in their search. I'm wondering if it's
> possible to demonstrate to the proprietors that it would be possible
> to share at least some of this information with me in a safe way.
>
> I understand that sharing a full log entry that includes such
> identifiable info as an IP address or userid is obviously a Bad Idea
> in terms of privacy. My question is whether there is a model of
> sharing search info with outsiders that's far enough to the *other*
> end of the spectrum as to presumptively be okay? What if one only
> shared bare searches and timestamps without being tying them to any
> information about the searcher? Would it help if you discarded the
> timestamps, and instead shared just a dump of all the search strings
> that were received between two dates? If *that* was also deemed to be
> too risky, would it matter if you were sharing instead a random sample of those searches?
>
> Any pointers for getting started on this question would be greatly
> appreciated!
>
>
> Edwin V. Sperr, MLIS
> Clinical Information Librarian
> AU/UGA Medical Partnership
> Office of Graduate Medical Education
>
> St. Mary's Hospital
> 1230 Baxter Street
> Athens, GA 30606
>
> (706) 389-3864
> [log in to unmask]<mailto:[log in to unmask]> | [log in to unmask]<mailto:
> [log in to unmask]>
>
________________________________
"The contents of this e-mail (including any attachments) may be confidential and/or subject to copyright. Any unauthorised use, distribution, or copying of the contents is expressly prohibited. If you have received this e-mail in error, please advise the sender by return e-mail or telephone and then delete this e-mail together with all attachments from your system."
|