Disclaimer: Some of this is probably going to be redundant with respect to
what Becky K has already said.

Are you using DICOMs?  If so, it's pretty straightforward to anonymize
scans with pydicom <>.  You'd just need to
identify which tags contain PHI, iterate over all the DICOM files with
os.walk, read in each file as you iterate, then replace the tags on the
pydicom DICOM object with empty strings, 'None' or similar, and finally
write out the modified DICOM to somewhere else.  You could then distribute
the anonymized DICOMs, and as long as the traditional 18 identifiers
<> have been removed you'd be
fine, since the images would no longer be considered to contain PHI.

If for some reason you can't remove all the identifiers, you could set up a
system where outside researchers who want to look at the data could sign a
data usage agreement to get access to the data with PHI.  This, however,
would require a lawyer, as well as someone to manage the agreements
(although if you're interacting with PHI or even thinking about
distributing medical data in any way, you should have a lawyer available

I've never encountered encrypting PHI fields
<> before and have
never experienced it with datasets I've worked with in the past when I was
in science.  It seems like a good idea if you absolutely need to keep the
PHI in the headers for whatever reason, but in terms of legality,
person-hours, stress and parsimony you're probably better off just keeping
a separate anonymized dataset and sharing that with outsiders.

Another specific question, but are you working with neuroimaging data in
particular?  If so, you might want to consider BIDS
<>.  Since BIDS is NifTI-based, tons of headers
(including those with PHI) are discarded, and the DICOM tags that
scientists actually care about (which are typically non-identifying) are
saved in sidecar JSON files that are paired with NifTI images.  Basically
it's a standard explicitly geared towards promoting data sharing.  You'll
also definitely want to make sure you deface
<> any images before
distribution to non-privileged researchers (assuming that you are working
with brain images) since faces are considered to be PHI.

On Tue, May 15, 2018 at 3:42 PM, Rees, John (NIH/NLM) [E] <
[log in to unmask]> wrote:

> There's a lot of work in this area using deep learning, recurring neural
> network techniques.
> discusses some policy and
> other approaches from a U.S. HIPAA perspective.
> John
> John P. Rees
> Archivist and Digital Resources Manager
> History of Medicine Division
> National Library of Medicine
> 301-827-4510
> -----Original Message-----
> From: Kyle Banerjee [mailto:[log in to unmask]]
> Sent: Friday, May 11, 2018 7:17 PM
> To: [log in to unmask]
> Subject: [CODE4LIB] Best way to partially anonymize data?
> Howdy all,
> We need to share large datasets containing medical imagery without
> revealing PHI. The images themselves don't present a problem due to their
> nature but the embedded metadata does.
> What approaches might work ?
> Our first reaction was to encrypt problematic fields, embed a public key
> for each item in the metadata, and have that dataset owner hold a separate
> private key for each image that allows authorized users to decrypt fields.
> Keys would be transmitted via the same secure channels that would normally
> be used for for authorized PHI.
> There's an obvious key management problem (any ideas for this -- central
> store would counteract the benefits the keys offer), but I'm not sure if we
> really have to worry about that. Significant key loss would be expected but
> since that data disseminated is only a copy, a new dataset with new keys
> could be created from the original if keys were lost or known to be
> compromised.
> This approach has a number of flaws, but we're thinking it may be a
> practical way to achieve the effect needed without compromising private
> data.
> Any ideas would be appreciated. Thanks,
> kyle