LISTSERV 16.5 - CODE4LIB Archives

> On May 28, 2023, at 5:03 PM, Magnus Berg <[log in to unmask]> wrote:
> 
>  Hi Charles,
> 
> Is the PDF you're trying to extract text from a scanned document? If so,
> you likely can't highlight the text because it's technically an image. You
> can apply Optical Character Recognition (OCR) to rectify this. GIMP doesn't
> have OCR capabilities, though there are a few plugins floating around on
> Github. If you don't have the paid version of Acrobat, you can look into
> other OCR software options. Here is a list of projects
> <https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html>
> that use the Tesseract engine, many of which are simple drag and drop
> solutions.

And as there are many different ways to create PDFs, you can end up with some really weird results even when you *can* copy and paste the text.

I don’t remember what it was that I was dealing with, but each word in the file was a separate text box… so when you copied and pasted, you got the text, but minus any spaces between words.

I think that I ended up printing the documents, scanning them back in, and OCRing it all.

(I’d probably try to go through some of the various PDF libraries to try doing it with software these days)

Oh… and when you’re doing batch OCR conversion of all of your scans, make sure that you don’t tell it to overwrite the files.  I accidentally missed turning that setting off once, and I didn’t realize it was also set to not save the image, and the OCR was absolute crap as the images weren’t high enough contrast, and I had to spend many, many hours re-scanning everything.

Or maybe back up all of your scans before OCR.  Storage is cheap these days.

-Joe