I don't think I can answer your question but I we have a similar problem.
I'm not sure about all OCR programs, but the version of Tesseract I've seen
in Islandora creates two files, one is the .txt file you would expect and
the other is an hOCR file with very interesting mark up linking words in
the transcript to coordinates on their associated jpg or tiff. For
manuscript materials, we have human-generated transcripts that can be
swapped in Islandora with the machine generated OCR, but there's no way to
easily map the words onto the image since editing the hOCR by hand is only
useful if you have a really good sense of where the coordinates fit on your
There are programs out there to get better coordinates for human generated
transcripts and http://www.shared-canvas.org/ seems to be one of the better
tools available for that purpose, but I haven't found DM, T-PEN, Scripto,
etc. easy to integrate across really large collections. But that kind of
transcription program lets users match words to their locations on pages.
The most rational public transcription programs out there, IMO, is the DIY
History site at the University of Iowa (http://diyhistory.lib.uiowa.edu/),
but I don't see how those transcripts can get mapped onto images.
There are some uiowa.edu people on this listserv. I'm curious to know how
they make their images and transcripts speak to each other.
On Thu, Jan 16, 2014 at 11:21 AM, Padraic Stack <[log in to unmask]>wrote:
> Hi folks,
> I have a number of typescript / manuscript images on which it is quite
> time consuming to run OCR. (Or more accurately it is quite time consuming
> to correct the OCR).
> For some of these I have text files containing accurate transcriptions. In
> other cases I have TEI files with these transcriptions.
> What is a straightforward way to combine the text with overlaid images to
> create searchable pdfs?
> I know my way around the command line and can follow tutorials but I'm not
> a programmer so the more straightforward the solution the better.
> I have had a go with pdftkBuilder and a result can be seen here [
> https://www.dropbox.com/s/fxp6rnt24043aez/result3.pdf] but there are a
> number of problems:
> 1. it involves 'printing' the text to pdf and 'stamping' the image over
> it. The result entails a margin unless the image matches a standard paper
> 2. the underlying text doesn't match up to the image. I would love if it
> could but can live with it if can't.
> 3. it is very time consuming - ideally I would like a solution that could
> be scripted and left to run.
> Any advice would be greatly appreciated.
> The best I have
> Padraic Stack | Digital Humanities Support Officer | NUI Maynooth |
> [log in to unmask] |Phone: Mon: 01 474 7187 Tue - Fri: 01 474 7197