Godmar Back wrote: > A year or so ago a couple of students looked into this for LibX. There > are a number of systems that people have published about, although > some are not available and none worked very well or were easy to get > to work. The systems also varied in their computational complexity, > with some not suitable for interactive use. Google for "libx citation > sensing", or generally for citation extraction, automatic record > boundary detection or extraction. (Unfortunately, pubs.dlib.vt.edu > appears to be down at the moment - otherwise, Suresh Menon's report > contains a useful bibliography of work. I'll ping them.) I've tested ParaTools <http://search.cpan.org/src/MJEWELL/Biblio-Document-Parser-1.10/docs/html/intro.html> but after it choked on most of it's own examples, tried looking elsewhere. Inera's eXtyles refXpress claims to do this. You can see it in action at: <http://www.crossref.org/SimpleTextQuery/>. Better than ParaTools but still missed a lot of things I thought would have been obvious. Inera said most of the issues I picked out were a problem with CrossRef's implementation, but the cost of the product was so great that I didn't explore further. There was an interesting paper at JCDL 2007 on an unsupervised way of doing this that had promising results <http://doi.acm.org/10.1145/1255175.1255219> but I haven't found any of their code online. > For citations that contain item titles (which is true for a majority, > but definitely not all citation styles) LibX's magic button uses > Scholar as a hidden backend to produce an actionable OpenURL. Combined > with a similarity analysis, this "magic button" functionality > produces a usable OpenURL in (on average) 81% of cases for a set of > 400 randomly chosen citations from 4 widely read journals from 4 > different areas published in 2006 [1]. With some fixes, we could > probably get this number up to 90%. Obviously, this approach only > works for individual use, Google would object for large scale batch > uses. Agreed that a lookup against something like Google Scholar, Web of Science, or a set of federated search targets instance may yield better results. We've discussed by haven't done any testing. --SET > - Godmar > > [1] Annette Bailey and Godmar Back, Retrieving Known Items with LibX. > The Serials Librarian, 2007. To appear. > > On 7/17/07, Jonathan Rochkind <[log in to unmask]> wrote: >> Does anyone have any decent open source code to parse a citation? I'm >> talking about a completely narrative citation like someone might >> cut-and-paste from a bibliography or web page. I realize there are a >> number of differnet formats this could be in (not to mention the human >> error problems that always occur from human entered free text)--but >> thinking about it, I suspect that with some work you could get something >> that worked reasonably well (if not perfect). So I'm wondering if anyone >> has donethis work. >> >> (One of the commerical legal product--I forget if it's Lexis or >> West--does this with legal citations--a more limited domain--quite >> well. I'm not sure if any of the commerical bibliographic citation >> management software does this?) >> >> The goal, as you can probably guess, is a box that the user can paste a >> citation into; make an OpenURL out of it; show the user where to get the >> citation. I'm pretty confident something useful could be created here, >> with enough time put into it. But saldy, it's probably more time than >> anyone has individually. Unless someone's done it already? >> >> Hopefully, >> Jonathan >> >