LISTSERV 16.5 - CODE4LIB Archives

At Fri, 11 Jul 2008 14:55:18 -0500,
Steve Oberg <[log in to unmask]> wrote:
> 
> One example:
> 
> Here's the citation I have in hand:
> 
> Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al.
> The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone
> Metabolism and Disease in CKD: association with mortality in dialysis
> patients. American Journal of Kidney Diseases 2005; 46(5):925-932.
> 
> Here's the output from ParsCit. Note the problem with the article title:
>
> […]

The output is a little different from what I get from the parsCit web
service. The parsCit authors recently published a new paper on a new
version of their systems with a new engine, which you might want to
look at [1].

> There's more but basically it isn't accurate enough. It's very good but not
> good enough for what I need at this juncture.  OpenURL resolvers like SFX
> are generally only as good as the metadata they are given to parse.  I need
> a high level of accuracy.
> 
> Maybe that's a pipe dream.

I doubt that the software provided by Inera performs better than
parsCit. Inera does find a DOI for that citation but that is not
nearly so hard as determining which parts of a citation are which.
parsCit is pretty cutting edge & provides some of the best numbers I
have seen. The Flux-CiM system [2] also has pretty good numbers, but
the code for it is not available. I’ve also done a little bit of work
on this, which you might want to have a look at. [3]

One of the problems may be that the parsCit you are dealing with has
been trained on the Cora dataset of computer science citations. It is
a reasonably heterogeneous dataset of citations but it doesn’t have a
lot that looks like that health sciences format. If your citations are
largely drawn from the health sciences you might see about training it
on a health sciences dataset; you will probably get much better
results.

best,
Erik Hetzner

1. Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An
open-source CRF reference string parsing package. In Proceedings of
the Language Resources and Evaluation Conference (LREC 08), Marrakesh,
Morrocco, May. Available from <http://wing.comp.nus.edu.sg/parsCit/#p>

2. Eli Cortez C. Vilarinho, Altigran Soares da Silva, Marcos André
Gonçalves, Filipe de Sá Mesquita, Edleno Silva de Moura. FLUX-CIM:
flexible unsupervised extraction of citation metadata. In Proceedings
of the 8th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2007),
pp. 215-224.

3. A simple method for citation metadata extraction using hidden
Markov models. In Proc. of the Joint Conf. on Digital Libraries (JCDL
2008), Pittsburgh, Pa., 2008.
<http://gales.cdlib.org/~egh/hmm-citation-extractor/>