Hello Kim, I'll echo Kyle's question regarding about the nature of the documents. Depending on the nature of the documents, including what individuals are included in the documents as well as the type of PII data itself, there is the possibility of needing to look into multiple data classification and redaction strategies to ensure compliance with any legal regulations and institutional policies. If you are going to make research documentation/data, institutional records, etc. publicly available, your institution should have classification and redaction policies that can help. Adding onto John's suggestion to work with campus IT, your compliance department or the person(s) responsible for risk and compliance around those policies are good resources to consult on these matters. It's tempting to build something in house with regex, but unless you're dealing with a collection that only has a limited, straightforward scope w/r/t types of PII, it might be better to stick with out of the box solutions. Several commercial and open source data classification and redaction/obfuscation-type products make identifying more common types of PII easier (depending on the quality of the scanned text). I know that Spirion is used at various academic institutions, so if your institution uses a tool similar to that, you might be able to get the institution to lend you a license if you decide to go down that route. Thanks, Becky -------------------------- Becky Yoose (She/her) CIPP/US, MA-LIS Library Data Privacy Consultant LDH Consulting Services E: [log in to unmask] T: +1 206 445 0733 W: https://ldhconsultingservices.com > On Fri, Apr 19, 2019 at 10:27 AM Kimberly Kennedy < > [log in to unmask]> wrote: > > > Hello! > > > > We are beginning a digitization project at my institution that involves > > scanning archival documents that may contain personal identifying > > information, such as social security numbers or credit card numbers. I'm > > looking for a tool that will examine the PDFs and identify the ones that > > may contain PII, so we can then redact them. > > > > I've experimented a bit with Bulk Extractor Viewer but haven't been able > to > > get it to work on the scanned PDFs I've created. I talked to a sales rep > > at Spirion and that program seems like overkill for our purposes. Any > > suggestions for other things to try would be appreciated! > > > > Thanks, > > > > Kim > > > > > > Kimberly Kennedy > > Digital Production Coordinator > > Northeastern University Library > > [log in to unmask] > > >