Having written a pretty decent citation parser 10 years ago (in
Applescript!), and having seen a lot of people take whacks at the
problem, I have to say that it's pretty easy to write one that works
on 70-80% of citations, particularly if you stick to one scholarly
subject area. On the other hand, it's really quite hard to write a
citation parser that gets better than 90 % of citations for a general
corpus .

The main problem is that scholarly works are written by creative,
ingenious people. When applied to citations, creativity and ingenuity
are disasters for automated parsers.

Parsers working on the computer science literature have come the
farthest, mostly because the convention in computer science
literature is to always include the article title. The most
impressive thing to me about Google Scholar when it was first
released was to see how far they had taken the citation parsing
outside of the computer science literature. Still, they have a ways
to go; most of the progress they've made seems to be by cheating (
i.e. backing the citation out of the linking, which means they're
just piggybacking on the work done by Inera and others).

(Hint: one of the very best performing open source citation parsers
was written (in perl) by Steve Lawrence, who was at NEC at the time,
as part of ResearchIndex AKA CiteSeer. It was released as pseudo open
source, but not so easy to separate. It relied heavily on the
availability of the article title. Steve has been at Google for a
while. Steve apparently wasn't involved in Scholar, but you have to
assume he and Anurag  did a fair amount of comparing notes.)

Anyway, almost all parsers rely on a set of heuristics. I have not
seen any parsers that do a good job of managing their heuristics in a
scaleable way. A successful open-source attack on this problem would
have the following characteristics:
1. able to efficiently handle and manage large numbers of parsing and
scoring heuristics
2. easy for contributors to add parsing and scoring heuristics
3. able to use contextual information (is the citation from a physics
article or from a history monograph?) in application  and scoring of


>It's on our list of Big Problems To Solve; I'm hoping to have time to
>tackle it later this year :)
>On Jul 18, 2007, at 12:57 PM, Jonathan Rochkind wrote:
>>Ha! If it's not too difficult, then with all the time you've spent
>>"looking at it extensively", how come you don't have a solution yet?
>>Just kidding. :)
>>Nathan Vack wrote:
>>>We've looked at this pretty extensively, and we're pretty certain
>>>there's nothing downloadable that does a "good enough" job. However,
>>>it's by no means impossible -- it seems to be undergrad thesis-level
>>>work in Singapore:
>>>There used to be a paper describing this approach (essentially
>>>treating citation parsing as a natural language processing task and
>>>using a maximum entropy algorithm) online... the page even cites
>>>it... but it seems to be gone now.
>>>FWIW, it didn't look too difficult.
>>>On Jul 17, 2007, at 6:16 PM, Jonathan Rochkind wrote:
>>>>Does anyone have any decent open source code to parse a citation?
>>>>talking about a completely narrative citation like someone might
>>>>cut-and-paste from a bibliography or web page. I realize there are a
>>>>number of differnet formats this could be in (not to mention the
>>>>error problems that always occur from human entered free text)--but
>>>>thinking about it, I suspect that with some work you could get
>>>>that worked reasonably well (if not perfect). So I'm wondering if
>>>>has donethis work.
>>>>(One of the commerical legal product--I forget if it's Lexis or
>>>>West--does this with legal citations--a more limited domain--quite
>>>>well.  I'm not sure if any of the commerical bibliographic citation
>>>>management software does this?)
>>>>The goal, as you can probably guess, is a box that the user can
>>>>paste a
>>>>citation into; make an OpenURL out of it; show the user where to
>>>>get the
>>>>citation.  I'm pretty confident something useful could be created
>>>>with enough time put into it. But saldy, it's probably more time
>>>>anyone has individually. Unless someone's done it already?
>>Jonathan Rochkind
>>Sr. Programmer/Analyst
>>The Sheridan Libraries
>>Johns Hopkins University
>>rochkind (at)


Eric Hellman, Director                            OCLC Openly
Informatics Division
[log in to unmask]      [log in to unmask]                   2 Broad St., Suite 208
tel 1-973-509-7800 fax 1-734-468-6216             Bloomfield, NJ 07003                      1 Click Access To Everything