Hi Erich,
You might try a tool to extract the OCRed text from the PDF. Xpdf includes a command-line tool, pdftotext, for example. Then you could see whether the extracted text is predictable enough to manipulate.
-Tod
Tod Olson <[log in to unmask]> (he/him)
Director of Integrated Library Systems
University of Chicago Library
On Jul 21, 2025, at 2:14 PM, Hammer, Erich F <[log in to unmask]> wrote:
Without going into details, we inherited a sizeable collection of physical materials from another library, and were only able to capture the unique MARC records in image (PDF) form.
Visually, they are quite readable and obviously MARC (to a human eye). They are OCR'd, but as you can imagine, the text is in blocks that when collectively copied do not paste into any useable order that would allow us to process them. Copy/pasting every little block of text into the right order would take as much time (likely more) than simply re-typing them all (although possibly with less error).
Does anyone know of a way to automatically convert these into useable MARC? It feels like something AI could do if trained, but I haven't a clue how to go about doing that.
Thanks,
Erich
--
Erich Hammer Head of Library Systems
[log in to unmask] University Libraries
518-442-3891 University @ Albany
"Belief gets in the way of learning." -- Robert Heinlein
|