LISTSERV 16.5 - CODE4LIB Archives

Godmar Back wrote:
> A year or so ago a couple of students looked into this for LibX. There
> are a number of systems that people have published about, although
> some are not available and none worked very well or were easy to get
> to work. The systems also varied in their computational complexity,
> with some not suitable for interactive use. Google for "libx citation
> sensing", or generally for citation extraction, automatic record
> boundary detection or extraction. (Unfortunately, pubs.dlib.vt.edu
> appears to be down at the moment - otherwise, Suresh Menon's report
> contains a useful bibliography of work. I'll ping them.)

I've tested ParaTools
<http://search.cpan.org/src/MJEWELL/Biblio-Document-Parser-1.10/docs/html/intro.html>
but after it choked on most of it's own examples, tried looking elsewhere.

Inera's eXtyles refXpress claims to do this. You can see it in action
at: <http://www.crossref.org/SimpleTextQuery/>. Better than ParaTools
but still missed a lot of things I thought would have been obvious.
Inera said most of the issues I picked out were a problem with
CrossRef's implementation, but the cost of the product was so great that
I didn't explore further.

There was an interesting paper at JCDL 2007 on an unsupervised way of
doing this that had promising results
<http://doi.acm.org/10.1145/1255175.1255219> but I haven't found any of
their code online.

> For citations that contain item titles (which is true for a majority,
> but definitely not all citation styles) LibX's magic button uses
> Scholar as a hidden backend to produce an actionable OpenURL. Combined
> with a similarity analysis, this  "magic button" functionality
> produces a usable OpenURL in (on average) 81% of cases for a set of
> 400 randomly chosen citations from 4 widely read journals from 4
> different areas published in 2006 [1].  With some fixes, we could
> probably get this number up to 90%. Obviously, this approach only
> works for individual use, Google would object for large scale batch
> uses.

Agreed that a lookup against something like Google Scholar, Web of
Science, or a set of federated search targets instance may yield better
results. We've discussed by haven't done any testing.
        --SET


> - Godmar
>
> [1] Annette Bailey and Godmar Back, Retrieving Known Items with LibX.
> The Serials Librarian, 2007. To appear.
>
> On 7/17/07, Jonathan Rochkind <[log in to unmask]> wrote:
>> Does anyone have any decent open source code to parse a citation? I'm
>> talking about a completely narrative citation like someone might
>> cut-and-paste from a bibliography or web page. I realize there are a
>> number of differnet formats this could be in (not to mention the human
>> error problems that always occur from human entered free text)--but
>> thinking about it, I suspect that with some work you could get something
>> that worked reasonably well (if not perfect). So I'm wondering if anyone
>> has donethis work.
>>
>> (One of the commerical legal product--I forget if it's Lexis or
>> West--does this with legal citations--a more limited domain--quite
>> well.  I'm not sure if any of the commerical bibliographic citation
>> management software does this?)
>>
>> The goal, as you can probably guess, is a box that the user can paste a
>> citation into; make an OpenURL out of it; show the user where to get the
>> citation.  I'm pretty confident something useful could be created here,
>> with enough time put into it. But saldy, it's probably more time than
>> anyone has individually. Unless someone's done it already?
>>
>> Hopefully,
>> Jonathan
>>
>