There's a lot of work in this area using deep learning, recurring neural network techniques. discusses some policy and other approaches from a U.S. HIPAA perspective.


John P. Rees
Archivist and Digital Resources Manager
History of Medicine Division
National Library of Medicine

-----Original Message-----
From: Kyle Banerjee [mailto:[log in to unmask]] 
Sent: Friday, May 11, 2018 7:17 PM
To: [log in to unmask]
Subject: [CODE4LIB] Best way to partially anonymize data?

Howdy all,

We need to share large datasets containing medical imagery without revealing PHI. The images themselves don't present a problem due to their nature but the embedded metadata does.

What approaches might work ?

Our first reaction was to encrypt problematic fields, embed a public key for each item in the metadata, and have that dataset owner hold a separate private key for each image that allows authorized users to decrypt fields.
Keys would be transmitted via the same secure channels that would normally be used for for authorized PHI.

There's an obvious key management problem (any ideas for this -- central store would counteract the benefits the keys offer), but I'm not sure if we really have to worry about that. Significant key loss would be expected but since that data disseminated is only a copy, a new dataset with new keys could be created from the original if keys were lost or known to be compromised.

This approach has a number of flaws, but we're thinking it may be a practical way to achieve the effect needed without compromising private data.

Any ideas would be appreciated. Thanks,