I've messed around with UIMA, and it's a nice general architecture, but
don't even think about trying to use it without the eclipse workflow gui.
I have a slight preference for GATE (http://gate.ac.uk), but there is
literally no difference in functionality between the two, since UIMA has a
plugin to insert GATE workflows, and GATE has a plugin to insert UIMA
workflows. ("Don't cross the streams!"). Both have wrappers around all
sorts of third-party text mangling and machine learning libraries.
The BBC has used GATE to do text mining and
annotations<http://3roundstones.com/led_book/led-raimond-et-al.html>for
their World Cup coverage and their Wildlife
Finder collection<http://www.bbc.co.uk/ontologies/wildlife/2010-11-04.shtml>.
It's a good idea to pick a small set of documents that you know well to
play with, and to test them out with some of the sample workflows. The
results will be surprisingly surprising (better and worse than your first
guesses). Tuning and tweaking will give a much better feel than starting
from scratch.
Text mining definitely benefits from an application domain specific
word-list, thesaurus or ontology, as well as rules for specialized types of
entity. Sometimes these can be purchased, though often at a prohibitive
cost. Often LAM applications require some hand crafting; for example,
older text may have names that differ in formats from current names.
In the legal domain, you could build rules for recognizing citations to
sections of a state code in case reports, then learn associations between
terms found in the vicinity of those cites (both finding which phrases in
the code section are most commonly elided, as well as which terms not in
the code have a high degree of linkage).
GATE has tools for comparing results of different workflows and generating
metrics; I don't know if UIMA's are better at the moment.
The documentation for GATE is of mixed quality; UIMA documentation is more
consistent. There are some good tutorial videos for GATE.
Both speak lucene; I don't think they're directly solr-ized.
Both have a lot of plugins for machine learning, and for different kinds
of robust parsers. You may want to read some texts on these subjects as a
lot of docs assume a base understanding. Morgan- Kaufman has a nice book on
data mining by Ian Watten et.
al<http://www.cs.waikato.ac.nz/ml/weka/book.html> of
WEKA <http://www.cs.waikato.ac.nz/ml/weka/index.html> fame is nicely
structured. The first part covers topics at a fairly high level; the
second part, which covers those topics in much more detail can be skipped
without breaking the flow.
Simon
On Jun 2, 2012 11:36 PM, "Wilhelmina Randtke" <[log in to unmask]> wrote:
> Has anyone out there used Apache's UIMA (Unstructured Information
> Management Architecture) to index documents or in any other way?
> What I am interested in is knowing whether any libraries are using this,
> and how they are using it. Seeing any example projects would be great. I
> also want to know how difficult it was to implement, and get a feel for the
> quality of the metadata produced.
>
> Are there any library projects which maybe aren't using this, but which are
> using coding to index documents and create metadata?
>
> -Wilhelmina Randtke
>
|