I've messed around with UIMA, and it's a nice general architecture, but don't even think about trying to use it without the eclipse workflow gui. I have a slight preference for GATE (http://gate.ac.uk), but there is literally no difference in functionality between the two, since UIMA has a plugin to insert GATE workflows, and GATE has a plugin to insert UIMA workflows. ("Don't cross the streams!"). Both have wrappers around all sorts of third-party text mangling and machine learning libraries. The BBC has used GATE to do text mining and annotations<http://3roundstones.com/led_book/led-raimond-et-al.html>for their World Cup coverage and their Wildlife Finder collection<http://www.bbc.co.uk/ontologies/wildlife/2010-11-04.shtml>. It's a good idea to pick a small set of documents that you know well to play with, and to test them out with some of the sample workflows. The results will be surprisingly surprising (better and worse than your first guesses). Tuning and tweaking will give a much better feel than starting from scratch. Text mining definitely benefits from an application domain specific word-list, thesaurus or ontology, as well as rules for specialized types of entity. Sometimes these can be purchased, though often at a prohibitive cost. Often LAM applications require some hand crafting; for example, older text may have names that differ in formats from current names. In the legal domain, you could build rules for recognizing citations to sections of a state code in case reports, then learn associations between terms found in the vicinity of those cites (both finding which phrases in the code section are most commonly elided, as well as which terms not in the code have a high degree of linkage). GATE has tools for comparing results of different workflows and generating metrics; I don't know if UIMA's are better at the moment. The documentation for GATE is of mixed quality; UIMA documentation is more consistent. There are some good tutorial videos for GATE. Both speak lucene; I don't think they're directly solr-ized. Both have a lot of plugins for machine learning, and for different kinds of robust parsers. You may want to read some texts on these subjects as a lot of docs assume a base understanding. Morgan- Kaufman has a nice book on data mining by Ian Watten et. al<http://www.cs.waikato.ac.nz/ml/weka/book.html> of WEKA <http://www.cs.waikato.ac.nz/ml/weka/index.html> fame is nicely structured. The first part covers topics at a fairly high level; the second part, which covers those topics in much more detail can be skipped without breaking the flow. Simon On Jun 2, 2012 11:36 PM, "Wilhelmina Randtke" <[log in to unmask]> wrote: > Has anyone out there used Apache's UIMA (Unstructured Information > Management Architecture) to index documents or in any other way? > What I am interested in is knowing whether any libraries are using this, > and how they are using it. Seeing any example projects would be great. I > also want to know how difficult it was to implement, and get a feel for the > quality of the metadata produced. > > Are there any library projects which maybe aren't using this, but which are > using coding to index documents and create metadata? > > -Wilhelmina Randtke >