On 7/20/07, Eric Hellman <[log in to unmask]> wrote:
> Have people been able to do a decent job of identifying parts of
> speech in natural language?

I think trying to import broad NLP findings into our narrower problem of
citation parsing is not likely to be fruitful.... but on the other hand
stealing their tools seems perfectly reasonable, and this group seems to be
familiar with several.

About 8 years ago, I made use of a parser-genator called ANTLR (ANother Tool
For Language Recognition) that takes an EBNF grammar spec and builds a
parser.  Since then developers have improved the tool with some new versions
and even a GUI development environment.  The languages recognized in
practice all seem to be well-defined programming languages, but if you
wanted to roll your own (new) parser for citations, ANTLR might help.

I think ANTLR satisfies Eric's first two crtiteria for flexibility and ease
of extension and might be used to satisfy the third (broad contextual
info).  It now includes a kind of ability to back itself out of rule descent
and try other alternatives in the tree if the static gramar fails.  The
license is BSD.  Notably, it supports unicode and the new version does NOT
require a pre-specified number of look-ahead tokens. And the userbase is
fairly broad for such a specialized tool.

This might be considered an incongruous solution inasmuch as you are asking
for parser characteristics and I am recommending a parser generator that
*could* produce the kind of parser you want.  But I think that is
appropriate for the task described.