Hi Edwin,
This seems like a fairly difficult thing to do. Are you thinking of search
sessions, where you might see several searches executed by the same user in
a group? Or independent search strings? The latter, with little to no other
metadata, might work. Definitely no IP or user identifier, but timestamps
should also probably be discarded or sanitized (e.g. the year/month instead
of exact time). I don't think a randomized sample accomplishes anything
here since the queries in it pose the same problems as those in the full
set.
Any given query could, on its own, be personally identifying. I would worry
about a user including personal information in the query, though I don't
know how common that is. Typing "eric phetteplace has a severe illness"
seems like an odd thing to do but nonetheless a possibility to be
considered. Location details in a query seem more likely and also
problematic. If you could use natural language processing to remove or
censor certain types of named entities (person, place) that would go a
long way but also still not be 100% certain to sanitize the data.
Best,
Eric Phetteplace
Systems Librarian
California College of the Arts
libraries.cca.edu
On Tue, Apr 9, 2024 at 9:51 AM EDWIN VINCENT SPERR <
[log in to unmask]> wrote:
> Greetings!
>
> I have a question for the collective, especially for those knowledgeable
> about patron privacy issues.
>
> I would like to know more about the search behavior of users of a Large
> Public Medical Database. The admins are loath to share any search logs from
> this resource as they are (rightly!) concerned about protecting user
> privacy. This leaves investigators (like me) unable to use this data to do
> things like, to use a completely random example, figure out how often
> end-users are using abbreviations instead of spelling out a medical term in
> their search. I'm wondering if it's possible to demonstrate to the
> proprietors that it would be possible to share at least some of this
> information with me in a safe way.
>
> I understand that sharing a full log entry that includes such identifiable
> info as an IP address or userid is obviously a Bad Idea in terms of
> privacy. My question is whether there is a model of sharing search info
> with outsiders that's far enough to the *other* end of the spectrum as to
> presumptively be okay? What if one only shared bare searches and timestamps
> without being tying them to any information about the searcher? Would it
> help if you discarded the timestamps, and instead shared just a dump of all
> the search strings that were received between two dates? If *that* was also
> deemed to be too risky, would it matter if you were sharing instead a
> random sample of those searches?
>
> Any pointers for getting started on this question would be greatly
> appreciated!
>
>
> Edwin V. Sperr, MLIS
> Clinical Information Librarian
> AU/UGA Medical Partnership
> Office of Graduate Medical Education
>
> St. Mary's Hospital
> 1230 Baxter Street
> Athens, GA 30606
>
> (706) 389-3864
> [log in to unmask]<mailto:[log in to unmask]> | [log in to unmask]<mailto:
> [log in to unmask]>
>
|