You could try to programatically match up each hOCR text block to a
corresponding fragment from the transcripts, based on textual similarity
(then replace the hOCR text with the "real" text). There's monotonicity in
terms of XY coordinates vs offset in the transcript, i.e. (X1,Y1) < (X2,Y2)
=> text1 before text2. Dynamic programming?
On Fri, Jan 17, 2014 at 7:41 PM, Daron Dierkes <[log in to unmask]>wrote:
> But Raffaele, how do you generate the hOCR in the first place if you're
> using human-generated transcripts and not OCR? Hand coding each page would
> take forever.
> On Fri, Jan 17, 2014 at 3:24 AM, raffaele messuti <
> [log in to unmask]> wrote:
> > Padraic Stack wrote:
> > > What is a straightforward way to combine the text with overlaid images
> > > to create searchable pdfs?
> > having transcription in hOCR format the tool you should need is
> > hocr2pdf.
> > i never tried for pdfs, years ago i made some djvu following this
> > tutorial
> >  http://en.wikipedia.org/wiki/HOCR
> >  http://manpages.ubuntu.com/manpages/lucid/man1/hocr2pdf.1.html
> >  https://philikon.wordpress.com/2009/07/23/digitizing-books-to-djvu/
> > ciao.
> > --
> > raffaele