Thanks, everyone! It is always nice to hear what others do, even if it doesn't end up being what works for us. On Thu, Jan 28, 2016 at 4:06 AM, Owen Stephens <[log in to unmask]> wrote: > To share the practice from a project I work on - the Jisc Historical Texts > platform[1] which provides searching across digitised texts from the 16th > to 19th centuries. In this case we had the option to build the search > application from scratch, rather than using a product such as ContentDM > etc. I should say that all the technical work was done by K-Int [2] and > Gooii [3], I was there to advise on metadata and user requirements, and so > the following is based on my understanding of how the system works, and any > errors are down to me :) > > There are currently three major collections within the Historical Texts > platform, with different data sources behind each one. In general the data > we have for each collection consists of MARC metadata records, full text in > XML documents (either from transcription or from OCR processes) and image > files of the pages. > > The platform is build using the ElasticSearch [4] (ES) indexing software > (as with Solr this is built on top of Lucene). > > We structure the data we index in ES in two layers - the ‘publication’ > record, which is essentially where all the MARC metadata lives (although > not as MARC - we transform this to an internal scheme), and the ‘page’ > records - one record per page in the item. The text content lives in the > page record, along with links to the image files for the page. The ‘page’ > records are all what ES calls ‘child’ records of the relevant publication > record. We make this relationship through shared IDs in the MARC records > and the XML fulltext documents. > > We create a whole range of indexes from this data. Obviously field > specific searchs like title or author only search the relevant metadata > fields. But we also have a (default) ’search all’ option which searches > through all the metadata and fulltext. If the user wants to search the text > only, they check an option and we limit the search to only text from > records of the ‘page’ type. > > The results the user gets initially are always the publication level > records - so essentially your results list is a list of books. For each > result you can view ‘matches in text’ which shows snippets of where your > search term appears in the fulltext. You can then either click to view the > whole book, or click the relevant page from the list of snippets. When you > view the book, the software retrieves all the ‘page’ records for the book, > and from the page records can retrieve the image files. When the user goes > to the book viewer, we also carry over the search terms from their search, > so they can see the same text snippets of where the terms appear alongside > the book viewer - so the user can navigate to the pages which contain the > search terms easily. > > For more on the ES indexing side of this, Rob Tice from Knowledge > Integration did a talk about the use of ES in this context at the London > Elasticsearch usergroup [5]. Unfortunately the interface itself requires a > login, but if you want to get a feel for how this all works in the UI, > there is also a screencast which gives an overview of the UI available [6]. > > Best wishes, > > Owen > > 1. https://historicaltexts.jisc.ac.uk > 2. http://www.k-int.com > 3. http://www.gooii.com > 4. https://www.elastic.co > 5. > http://www.k-int.com/Rob-Tice-Elastic-London-complex-modelling-of-rich-text-data-in-Elasticsearch > 6. http://historicaltexts.jisc.ac.uk/support > > Owen Stephens > Owen Stephens Consulting > Web: http://www.ostephens.com > Email: [log in to unmask] > Telephone: 0121 288 6936 > > > On 27 Jan 2016, at 00:30, Laura Buchholz <[log in to unmask]> > wrote: > > > > Hi all, > > > > I'm trying to understand how digital library systems work when there is a > > need to search both metadata and item text content (plain text/full > text), > > and when the item is made up of more than one file (so, think a digitized > > multi-page yearbook or newspaper). I'm not looking for answers to a > > specific problem, really, just looking to know what is the current state > of > > community practice. > > > > In our current system (ContentDM), the "full text" of something lives in > > the metadata record, so it is indexed and searched along with the > metadata, > > and essentially treated as if it were metadata. (Correct?). This causes > > problems in advanced searching and muddies the relationship between what > is > > typically a descriptive metadata record and the file that is associated > > with the record. It doesn't seem like a great model for the average > digital > > library. True? I know the answer is "it depends", but humor me... :) > > > > If it isn't great, and there are better models, what are they? I was > taught > > METS in school, and based on that, I'd approach the metadata in a METS or > > METS-like fashion. But I'm unclear on the steps from having a bunch of > METS > > records that include descriptive metadata and pointers to text files of > the > > OCR (we don't, but if we did...) to indexing and providing results to > > users. I think another way of phrasing this question might be: how is the > > full text of a compound object (in the sense of a digitized yearbook or > > similar) typically indexed? > > > > The user requirements for this situation are essentially: > > 1. User can search for something and get a list of results. If something > > (let's say a pamphlet) appears in results based on a hit in full text, > the > > user selects the pamphlet which opens to the file (or page of the > pamphlet) > > that contains the text that was matched. This is pretty normal and does > > work in our current system. > > 2. In an advanced search, a user might search for a name in the "author" > > field and a phrase in the "full text" field, and say they want both > > conditions to be fulfilled. In our current system, this won't provide > > results when it should, because the full text content is in one record > and > > the author's name is in another record, so the AND condition can't be > met. > > 3. Librarians can link description metadata records (DC in our case) to > > particular files, sometimes one to one, sometimes many to one, sometimes > > one to many. > > > > If this is too unclear, let me know... > > Thanks! > > > > -- > > Laura Buchholz > > Digital Projects Librarian > > Reed College Library > > 503-517-7629 > > [log in to unmask] > -- Laura Buchholz Digital Projects Librarian Reed College Library 503-517-7629 [log in to unmask]