Hi Kim, Could you say a bit more about the documents, the scanning process, and how reliable the OCR is? I'd be leery of relying on OCR for identifying PII except as a secondary check (which may be already be your plan). PII takes many forms which often require a trained eye to spot -- particularly when it's a combination of points that are normally harmless by themselves. Even for well understood data such as names, addresses, CCs, SSNs, DLs, dates (DoB, medical procedures, etc), and the like, scan issues or other idiosyncrasies could cause you to miss all kinds of important stuff. Regexes are awesome, but they're sometimes a blunter tool than we need. kyle On Fri, Apr 19, 2019 at 10:27 AM Kimberly Kennedy < [log in to unmask]> wrote: > Hello! > > We are beginning a digitization project at my institution that involves > scanning archival documents that may contain personal identifying > information, such as social security numbers or credit card numbers. I'm > looking for a tool that will examine the PDFs and identify the ones that > may contain PII, so we can then redact them. > > I've experimented a bit with Bulk Extractor Viewer but haven't been able to > get it to work on the scanned PDFs I've created. I talked to a sales rep > at Spirion and that program seems like overkill for our purposes. Any > suggestions for other things to try would be appreciated! > > Thanks, > > Kim > > > Kimberly Kennedy > Digital Production Coordinator > Northeastern University Library > [log in to unmask] >