LISTSERV 16.5 - CODE4LIB Archives

Thanks, everyone! It is always nice to hear what others do, even if it
doesn't end up being what works for us.

On Thu, Jan 28, 2016 at 4:06 AM, Owen Stephens <[log in to unmask]> wrote:

> To share the practice from a project I work on - the Jisc Historical Texts
> platform[1] which provides searching across digitised texts from the 16th
> to 19th centuries. In this case we had the option to build the search
> application from scratch, rather than using a product such as ContentDM
> etc. I should say that all the technical work was done by K-Int [2] and
> Gooii [3], I was there to advise on metadata and user requirements, and so
> the following is based on my understanding of how the system works, and any
> errors are down to me :)
>
> There are currently three major collections within the Historical Texts
> platform, with different data sources behind each one. In general the data
> we have for each collection consists of MARC metadata records, full text in
> XML documents (either from transcription or from OCR processes) and image
> files of the pages.
>
> The platform is build using the ElasticSearch [4] (ES) indexing software
> (as with Solr this is built on top of Lucene).
>
> We structure the data we index in ES in two layers - the ‘publication’
> record, which is essentially where all the MARC metadata lives (although
> not as MARC - we transform this to an internal scheme), and the ‘page’
> records - one record per page in the item. The text content lives in the
> page record, along with links to the image files for the page. The ‘page’
> records are all what ES calls ‘child’ records of the relevant publication
> record. We make this relationship through shared IDs in the MARC records
> and the XML fulltext documents.
>
> We create a whole range of indexes from this data. Obviously field
> specific searchs like title or author only search the relevant metadata
> fields. But we also have a (default) ’search all’ option which searches
> through all the metadata and fulltext. If the user wants to search the text
> only, they check an option and we limit the search to only text from
> records of the ‘page’ type.
>
> The results the user gets initially are always the publication level
> records - so essentially your results list is a list of books. For each
> result you can view ‘matches in text’ which shows snippets of where your
> search term appears in the fulltext. You can then either click to view the
> whole book, or click the relevant page from the list of snippets. When you
> view the book, the software retrieves all the ‘page’ records for the book,
> and from the page records can retrieve the image files. When the user goes
> to the book viewer, we also carry over the search terms from their search,
> so they can see the same text snippets of where the terms appear alongside
> the book viewer - so the user can navigate to the pages which contain the
> search terms easily.
>
> For more on the ES indexing side of this, Rob Tice from Knowledge
> Integration did a talk about the use of ES in this context at the London
> Elasticsearch usergroup [5]. Unfortunately the interface itself requires a
> login, but if you want to get a feel for how this all works in the UI,
> there is also a screencast which gives an overview of the UI available [6].
>
> Best wishes,
>
> Owen
>
> 1. https://historicaltexts.jisc.ac.uk
> 2. http://www.k-int.com
> 3. http://www.gooii.com
> 4. https://www.elastic.co
> 5.
> http://www.k-int.com/Rob-Tice-Elastic-London-complex-modelling-of-rich-text-data-in-Elasticsearch
> 6. http://historicaltexts.jisc.ac.uk/support
>
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: [log in to unmask]
> Telephone: 0121 288 6936
>
> > On 27 Jan 2016, at 00:30, Laura Buchholz <[log in to unmask]>
> wrote:
> >
> > Hi all,
> >
> > I'm trying to understand how digital library systems work when there is a
> > need to search both metadata and item text content (plain text/full
> text),
> > and when the item is made up of more than one file (so, think a digitized
> > multi-page yearbook or newspaper). I'm not looking for answers to a
> > specific problem, really, just looking to know what is the current state
> of
> > community practice.
> >
> > In our current system (ContentDM), the "full text" of something lives in
> > the metadata record, so it is indexed and searched along with the
> metadata,
> > and essentially treated as if it were metadata. (Correct?). This causes
> > problems in advanced searching and muddies the relationship between what
> is
> > typically a descriptive metadata record and the file that is associated
> > with the record. It doesn't seem like a great model for the average
> digital
> > library. True? I know the answer is "it depends", but humor me... :)
> >
> > If it isn't great, and there are better models, what are they? I was
> taught
> > METS in school, and based on that, I'd approach the metadata in a METS or
> > METS-like fashion. But I'm unclear on the steps from having a bunch of
> METS
> > records that include descriptive metadata and pointers to text files of
> the
> > OCR (we don't, but if we did...) to indexing and providing results to
> > users. I think another way of phrasing this question might be: how is the
> > full text of a compound object (in the sense of a digitized yearbook or
> > similar) typically indexed?
> >
> > The user requirements for this situation are essentially:
> > 1. User can search for something and get a list of results. If something
> > (let's say a pamphlet) appears in results based on a hit in full text,
> the
> > user selects the pamphlet which opens to the file (or page of the
> pamphlet)
> > that contains the text that was matched. This is pretty normal and does
> > work in our current system.
> > 2. In an advanced search, a user might search for a name in the "author"
> > field and a phrase in the "full text" field, and say they want both
> > conditions to be fulfilled. In our current system, this won't provide
> > results when it should, because the full text content is in one record
> and
> > the author's name is in another record, so the AND condition can't be
> met.
> > 3. Librarians can link description metadata records (DC in our case) to
> > particular files, sometimes one to one, sometimes many to one, sometimes
> > one to many.
> >
> > If this is too unclear, let me know...
> > Thanks!
> >
> > --
> > Laura Buchholz
> > Digital Projects Librarian
> > Reed College Library
> > 503-517-7629
> > [log in to unmask]
>



-- 
Laura Buchholz
Digital Projects Librarian
Reed College Library
503-517-7629
[log in to unmask]