On Jul 20, 2007, at 9:14 AM, Eric Hellman wrote:

> Heuristics are perhaps the only way to deal with lack of consistent
> format. (i.e. "a cluster of words including "journal of" is likely
> to contain a journal name")

You're right; in a lot of ways, it depends on what you consider a
heuristic; every algorithm involves human intervention to describe
what features it should consider; there's no magic there. In this
case, I'm positing that a system in which we manually identify and
weight rules isn't gonna scale; we'll need to use some kind of
machine learning.

In this specific case, I might instead try a feature that's more like
"the occurrence of this word in journal titles versus the occurrence
of this word in ordinary text" and then let some ML algorithm train
and weight that feature.

And "fun" problems abound in even finding delimiters between parts of
a citaion -- are we using ',' or '.' or something else? Is that
delimiter used in other contexts? What happens if a citation's
missing a delimiter...?

> Have people been able to do a decent job of identifying parts of
> speech in natural language?

Yeah... good PoS taggers (I'm looking at a paper on a Markov model-
based tagger now) run in the 95-98% accuracy range. The standard
dataaset, however, seems to be a collection of Wall Street Journal
articles, which are gonna be cleaner than our citation listings. Then
again, general language is more complex than citations, so... who knows?

Oddly, the literature has been relatively quiet on this topic for the
last few years -- lots of papers from the late '90s, but not so much
in the last couple years. But check Scholar; there's a lot to build on.