It's a not-very-interesting story of disorganization, poor communication, too few employees and a touch of corporate greed:
A nearby, small college shuttered. Our University decided to try to scoop up the well-regarded early-education program and snag the former library's unique collection of educational "kits". The former site was scheduled for deletion in short-order, and ExLibris essentially tried to extort us for a ridiculously astronomical amount to give us the records. Nobody thought to ask our sole developer (who may have been able to scrape the records in a useable format) until they had just left for a 3-month parental leave, so someone assigned a student to manually bring up all the records to capture the information. Their solution was to generate PDFs of every page. The site and data is no more at this point, so we have what we have.
The PDFs were generated with text, not OCR'd (as I originally suggested), so the text is accurate. However, the strings are broken up, and of course, PDF readers don't know how the text "fits" together. Thus, selected text is recognized in columns, but not of the same length due to wrapping. It's a mess.
Erich
On Monday, July 21, 2025 at 21:48, Kyle Banerjee eloquently inscribed:
> On Mon, Jul 21, 2025 at 12:20 PM Hammer, Erich F <[log in to unmask]>
> wrote:
>
>> Without going into details, we inherited a sizeable collection of physical
>> materials from another library, and were only able to capture the unique
>> MARC records in image (PDF) form.
>
> The details provide the parameters for the easiest/best methods (and
> it's hard to imagine there's not a good story behind getting stuck with
> images of records without actually having records). I assume there's a
> reason you don't just do the conversion in Acrobat or use one of the
> many utilities or services.
>
> A true OCR process is likely to be error prone, I'd be concerned about
> positional data and encoding issues even if the other stuff is right.
> Parsing for identifiers and downloading actual MARC records might prove
> faster and more reliable if these aren't local only.
>
> kyle
|