LISTSERV 16.5 - CODE4LIB Archives

>On Jul 18, 2007, at 10:04 PM, Eric Hellman wrote:
>Also, even in (many) scholarly journals, editorial consistency is
>almost unbelievably poor -- lots of times, the rules just aren't
>followed. Punctuation gets missed, journal names (especially
>abbreviations!) are misspelled... and so on. Rule-based and heuristic
>systems are always going to have problems in those cases.

Heuristics are perhaps the only way to deal with lack of consistent
format. (i.e. "a cluster of words including "journal of" is likely to
contain a journal name")
If you have a halfway decent journal name parser (such as the one in
our openurl software) it already contains a large list of journal
misspellings.


>In a lot of ways, I think the problem is fundamentally similar to
>identifying parts of speech in natural language (which has lots of
>the same ambiguities) -- and the same techniques that succeed there
>will probably yield the most robust results for citation parsing.

Have people been able to do a decent job of identifying parts of
speech in natural language?
--

Eric Hellman, Director                            OCLC Openly
Informatics Division
[log in to unmask]      [log in to unmask]                   2 Broad St., Suite 208
tel 1-973-509-7800 fax 1-734-468-6216             Bloomfield, NJ 07003
http://openly.oclc.org/1cate/                      1 Click Access To Everything