The only software package that I've heard of that fits that bill is Varonis <https://www.varonis.com/products/data-classification-engine/>, which seems to be one of Spirion's competitors. Honestly, if you're seriously considering uploading PII to your archival storage, there really is no such thing as "overkill". I'm not totally familiar with the regulatory consequences for SSNs and credit card info being leaked, but for HIPAA/PHI at least the fines can be quite high. Even if there aren't regulatory penalties incurred, you should consider the fact that any leaked information could be potentially life-altering for any of the individuals whose info you've publicized, and that you might easily be sued and lose more money than whatever the cost of Varonis or Spirion might be. Overall, this is a matter where you really should consult internally with your university's security/IT department. While it's true that you could implement some bespoke, homegrown solution for detecting PII with regex, the stakes are high enough that you really don't want to be messing around with this without more finely-honed cybersecurity expertise. On Fri, Apr 19, 2019 at 1:44 PM Lane, Jennifer (Library) < [log in to unmask]> wrote: > Could you use the patterns feature in Acrobat and regex? > http://blogs.adobe.com/acrolaw/2011/05/creating_and_using_custom_redact/ > > Jenny Lane | NPL | > 615-880-1622 > > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Kimberly Kennedy > Sent: Friday, April 19, 2019 12:26 PM > To: [log in to unmask] > Subject: [CODE4LIB] Looking for lightweight tool to identify PII > > Attention: This email originated from a source external to Metro > Government. Please exercise caution when opening any attachments or links > from external sources. > > > Hello! > > We are beginning a digitization project at my institution that involves > scanning archival documents that may contain personal identifying > information, such as social security numbers or credit card numbers. I'm > looking for a tool that will examine the PDFs and identify the ones that > may contain PII, so we can then redact them. > > I've experimented a bit with Bulk Extractor Viewer but haven't been able to > get it to work on the scanned PDFs I've created. I talked to a sales rep > at Spirion and that program seems like overkill for our purposes. Any > suggestions for other things to try would be appreciated! > > Thanks, > > Kim > > > Kimberly Kennedy > Digital Production Coordinator > Northeastern University Library > [log in to unmask] >