Registration is now open for the next Open Preservation Foundation (OPF) webinar.
OCR improvements through machine learning methods and the impact on the long term preservation of digitized content
Thursday 11 November at 15:00 CET | 14:00 GMT
Roxana Maurer and Ralph Marschall, Bibliothèque nationale du Luxembourg
The National Library of Luxembourg (Bibliothèque nationale du Luxembourg) has been digitizing its national heritage collections since the early 2000’s. After a few years of image-only digitization projects, the library switched to a METS/ALTO output with multiple manifestations, gaining with the years a great expertise in creating digitized content enriched with both Optical Character Recognition (OCR) and Optical Layout Recognition (OLR). In 2020 the eLmA (eLuxemburgensia meets AI) project was born: correcting the full-text (ALTO files) of more than 6,000,000 articles on the eluxemburgensia.lu site. These articles have a varying quality for their OCR text, due to one or more reasons: the language of the text in which the text is written (German and French, to a lesser extent in Luxembourgish and English), the typography used (Gothic or Latin characters) or the quality of the digitization. This presentation will have a more in-depth look at the eLmA project, as well as its impact on the digital preservation of METS/ALTO content.
to manage your DLF-ANNOUNCE subscription, visit diglib.org/announce