LISTSERV 16.5 - CODE4LIB Archives

On Jul 18, 2007, at 10:04 PM, Eric Hellman wrote:

> Anyway, almost all parsers rely on a set of heuristics. I have not
> seen any parsers that do a good job of managing their heuristics in a
> scaleable way. A successful open-source attack on this problem would
> have the following characteristics:
> 1. able to efficiently handle and manage large numbers of parsing and
> scoring heuristics
> 2. easy for contributors to add parsing and scoring heuristics
> 3. able to use contextual information (is the citation from a physics
> article or from a history monograph?) in application  and scoring of
> heuristics

One of the more problematic things is that we don't always get the
contextual information as to where citations occurred -- in fact,
it's quite rare to get that.

Also, even in (many) scholarly journals, editorial consistency is
almost unbelievably poor -- lots of times, the rules just aren't
followed. Punctuation gets missed, journal names (especially
abbreviations!) are misspelled... and so on. Rule-based and heuristic
systems are always going to have problems in those cases.

In a lot of ways, I think the problem is fundamentally similar to
identifying parts of speech in natural language (which has lots of
the same ambiguities) -- and the same techniques that succeed there
will probably yield the most robust results for citation parsing.

-n