I'll echo Kyle's question regarding about the nature of the documents.
Depending on the nature of the documents, including what individuals are
included in the documents as well as the type of PII data itself, there is
the possibility of needing to look into multiple data classification and
redaction strategies to ensure compliance with any legal regulations and
institutional policies. If you are going to make research
documentation/data, institutional records, etc. publicly available, your
institution should have classification and redaction policies that can
help. Adding onto John's suggestion to work with campus IT, your compliance
department or the person(s) responsible for risk and compliance around
those policies are good resources to consult on these matters.
It's tempting to build something in house with regex, but unless you're
dealing with a collection that only has a limited, straightforward scope
w/r/t types of PII, it might be better to stick with out of the box
solutions. Several commercial and open source data classification and
redaction/obfuscation-type products make identifying more common types of
PII easier (depending on the quality of the scanned text). I know that
Spirion is used at various academic institutions, so if your institution
uses a tool similar to that, you might be able to get the institution to
lend you a license if you decide to go down that route.
Library Data Privacy Consultant
LDH Consulting Services
E: [log in to unmask]
T: +1 206 445 0733
> On Fri, Apr 19, 2019 at 10:27 AM Kimberly Kennedy <
> [log in to unmask]> wrote:
> > Hello!
> > We are beginning a digitization project at my institution that involves
> > scanning archival documents that may contain personal identifying
> > information, such as social security numbers or credit card numbers. I'm
> > looking for a tool that will examine the PDFs and identify the ones that
> > may contain PII, so we can then redact them.
> > I've experimented a bit with Bulk Extractor Viewer but haven't been able
> > get it to work on the scanned PDFs I've created. I talked to a sales rep
> > at Spirion and that program seems like overkill for our purposes. Any
> > suggestions for other things to try would be appreciated!
> > Thanks,
> > Kim
> > Kimberly Kennedy
> > Digital Production Coordinator
> > Northeastern University Library
> > [log in to unmask]