Hadoop Driven Digital Preservation
Austrian National Library, Vienna
There is just one week left to sign up for our next hackathon: https://hadoop-driven-digital-preservation.eventbrite.co.uk.
This hackathon will focus on using Hadoop in two digital preservation scenarios:
Web-Archiving: File Format Identification/Characterisation
A web archive usually contains a wide range of different file types. From a curatorial perspective the question is: Do I need to be worried? Is there a risk that means I should take adequate measures right now? The first step is to reliably identify and characterise the content of a web archive. Linguistic analysis can help categorise the “text/plain” content into more precise content types. A detailed analysis of “application/pdf” content can help cluster properties of the files and identify characteristics that are of special interest. Using the Hadoop framework and prepared sample projects for processing web archive content, we will be able to perform any kind of processing or analysis that we come up with on a large scale using a Hadoop Cluster. Together we will discuss what are the requirements to enable this and we will find out what still needs to optimised.
Digital Books: Quality Assurance, text mining (OCR Quality)
The digital objects of the Austrian National Library's digital book collection consists of the aggregated book object with technical and descriptive meta data, and the images, layout and text content for the book pages. Due to the massive scale of digitisation in a relatively short time period and the fact that the digitised books are from the 18th century and older, there are different types of quality issues. Using the Hadoop framework, we provide the means to perform any kind of large scale book processing on a book or page level. Linguistic analysis and language detection, for example, can help us determining the quality of the OCR (Optical Character Recognition), or image analysis can help in detecting any technical or content related issues with the book page images.
Take a look at the full agenda here: http://wiki.opf-labs.org/display/SP/Agenda+-+Hadoop+Driven+Digital+Preservation.
Highlights of this hackathon include:
* Talks from our guest speaker, Jimmy Lin, University of Maryland
* Taking part in our competition for the best idea and visualisation
* A chance to gain hands-on experience carrying out identification and characterisation experiments
* Practitioners and developers working together to address digital preservation challenges
* The opportunity to share experiences and knowledge about implementing Hadoop
Who should attend?
Practitioners (digital librarians and archivists, digital curators, repository managers, or anyone responsible for managing digital collections) You will learn how Hadoop might fit your organisation, how to write requirements to guide development and gain some hands on experience using tools yourself and finding out how they work. To get the most out of this training course you will ideally have some knowledge or experience of digital preservation.
Developers of all experience can participate, from writing your first Hadoop jobs, to working on scalable solutions for issues identified in the scenarios.
We hope to see you in Vienna!
Membership and Communications Manager