On Jul 20, 2007, at 9:14 AM, Eric Hellman wrote: > Heuristics are perhaps the only way to deal with lack of consistent > format. (i.e. "a cluster of words including "journal of" is likely > to contain a journal name") You're right; in a lot of ways, it depends on what you consider a heuristic; every algorithm involves human intervention to describe what features it should consider; there's no magic there. In this case, I'm positing that a system in which we manually identify and weight rules isn't gonna scale; we'll need to use some kind of machine learning. In this specific case, I might instead try a feature that's more like "the occurrence of this word in journal titles versus the occurrence of this word in ordinary text" and then let some ML algorithm train and weight that feature. And "fun" problems abound in even finding delimiters between parts of a citaion -- are we using ',' or '.' or something else? Is that delimiter used in other contexts? What happens if a citation's missing a delimiter...? > Have people been able to do a decent job of identifying parts of > speech in natural language? Yeah... good PoS taggers (I'm looking at a paper on a Markov model- based tagger now) run in the 95-98% accuracy range. The standard dataaset, however, seems to be a collection of Wall Street Journal articles, which are gonna be cleaner than our citation listings. Then again, general language is more complex than citations, so... who knows? Oddly, the literature has been relatively quiet on this topic for the last few years -- lots of papers from the late '90s, but not so much in the last couple years. But check Scholar; there's a lot to build on. -n