LISTSERV 16.5 - CODE4LIB Archives

Extracting text is a hit and miss affair. As indicated scanned PDFs have
nothing to extract and require OCR solutions.

The remaining PDFs have a glyph layer which software resolves to text via
the toUnicode mappings in the PDF, these are font specific and are
dependent on the fonts camp table being accurate and complete. Also often
PDF writers subset fonts and a poor subsetting tool can make a hash of it.

For a PDF written with the right software and using the right fonts text
extraction is easy. Some PDFs text will need a bit of post editing, for
others .... There is OCR.

In terms of software, excluding OCR, most open source PDF readers should be
able to extract the text.

On Mon, 29 May 2023, 07:55 Joe Hourclé, <[log in to unmask]> wrote:

> > On May 28, 2023, at 5:03 PM, Magnus Berg <[log in to unmask]>
> wrote:
> >
> >  Hi Charles,
> >
> > Is the PDF you're trying to extract text from a scanned document? If so,
> > you likely can't highlight the text because it's technically an image.
> You
> > can apply Optical Character Recognition (OCR) to rectify this. GIMP
> doesn't
> > have OCR capabilities, though there are a few plugins floating around on
> > Github. If you don't have the paid version of Acrobat, you can look into
> > other OCR software options. Here is a list of projects
> > <
> https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html
> >
> > that use the Tesseract engine, many of which are simple drag and drop
> > solutions.
>
> And as there are many different ways to create PDFs, you can end up with
> some really weird results even when you *can* copy and paste the text.
>
> I don’t remember what it was that I was dealing with, but each word in the
> file was a separate text box… so when you copied and pasted, you got the
> text, but minus any spaces between words.
>
> I think that I ended up printing the documents, scanning them back in, and
> OCRing it all.
>
> (I’d probably try to go through some of the various PDF libraries to try
> doing it with software these days)
>
> Oh… and when you’re doing batch OCR conversion of all of your scans, make
> sure that you don’t tell it to overwrite the files.  I accidentally missed
> turning that setting off once, and I didn’t realize it was also set to not
> save the image, and the OCR was absolute crap as the images weren’t high
> enough contrast, and I had to spend many, many hours re-scanning everything.
>
> Or maybe back up all of your scans before OCR.  Storage is cheap these
> days.
>
> -Joe