The only software package that I've heard of that fits that bill is Varonis
<https://www.varonis.com/products/data-classification-engine/>, which seems
to be one of Spirion's competitors. Honestly, if you're seriously
considering uploading PII to your archival storage, there really is no such
thing as "overkill". I'm not totally familiar with the regulatory
consequences for SSNs and credit card info being leaked, but for HIPAA/PHI
at least the fines can be quite high. Even if there aren't regulatory
penalties incurred, you should consider the fact that any leaked
information could be potentially life-altering for any of the individuals
whose info you've publicized, and that you might easily be sued and lose
more money than whatever the cost of Varonis or Spirion might be.
Overall, this is a matter where you really should consult internally with
your university's security/IT department. While it's true that you could
implement some bespoke, homegrown solution for detecting PII with regex,
the stakes are high enough that you really don't want to be messing around
with this without more finely-honed cybersecurity expertise.
On Fri, Apr 19, 2019 at 1:44 PM Lane, Jennifer (Library) <
[log in to unmask]> wrote:
> Could you use the patterns feature in Acrobat and regex?
> Jenny Lane | NPL |
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Kimberly Kennedy
> Sent: Friday, April 19, 2019 12:26 PM
> To: [log in to unmask]
> Subject: [CODE4LIB] Looking for lightweight tool to identify PII
> Attention: This email originated from a source external to Metro
> Government. Please exercise caution when opening any attachments or links
> from external sources.
> We are beginning a digitization project at my institution that involves
> scanning archival documents that may contain personal identifying
> information, such as social security numbers or credit card numbers. I'm
> looking for a tool that will examine the PDFs and identify the ones that
> may contain PII, so we can then redact them.
> I've experimented a bit with Bulk Extractor Viewer but haven't been able to
> get it to work on the scanned PDFs I've created. I talked to a sales rep
> at Spirion and that program seems like overkill for our purposes. Any
> suggestions for other things to try would be appreciated!
> Kimberly Kennedy
> Digital Production Coordinator
> Northeastern University Library
> [log in to unmask]