On Jul 18, 2007, at 10:04 PM, Eric Hellman wrote: > Anyway, almost all parsers rely on a set of heuristics. I have not > seen any parsers that do a good job of managing their heuristics in a > scaleable way. A successful open-source attack on this problem would > have the following characteristics: > 1. able to efficiently handle and manage large numbers of parsing and > scoring heuristics > 2. easy for contributors to add parsing and scoring heuristics > 3. able to use contextual information (is the citation from a physics > article or from a history monograph?) in application and scoring of > heuristics One of the more problematic things is that we don't always get the contextual information as to where citations occurred -- in fact, it's quite rare to get that. Also, even in (many) scholarly journals, editorial consistency is almost unbelievably poor -- lots of times, the rules just aren't followed. Punctuation gets missed, journal names (especially abbreviations!) are misspelled... and so on. Rule-based and heuristic systems are always going to have problems in those cases. In a lot of ways, I think the problem is fundamentally similar to identifying parts of speech in natural language (which has lots of the same ambiguities) -- and the same techniques that succeed there will probably yield the most robust results for citation parsing. -n