Could you say a bit more about the documents, the scanning process, and how
reliable the OCR is?
I'd be leery of relying on OCR for identifying PII except as a secondary
check (which may be already be your plan). PII takes many forms which often
require a trained eye to spot -- particularly when it's a combination of
points that are normally harmless by themselves. Even for well understood
data such as names, addresses, CCs, SSNs, DLs, dates (DoB, medical
procedures, etc), and the like, scan issues or other idiosyncrasies could
cause you to miss all kinds of important stuff. Regexes are awesome, but
they're sometimes a blunter tool than we need.
On Fri, Apr 19, 2019 at 10:27 AM Kimberly Kennedy <
[log in to unmask]> wrote:
> We are beginning a digitization project at my institution that involves
> scanning archival documents that may contain personal identifying
> information, such as social security numbers or credit card numbers. I'm
> looking for a tool that will examine the PDFs and identify the ones that
> may contain PII, so we can then redact them.
> I've experimented a bit with Bulk Extractor Viewer but haven't been able to
> get it to work on the scanned PDFs I've created. I talked to a sales rep
> at Spirion and that program seems like overkill for our purposes. Any
> suggestions for other things to try would be appreciated!
> Kimberly Kennedy
> Digital Production Coordinator
> Northeastern University Library
> [log in to unmask]