At Fri, 11 Jul 2008 14:55:18 -0500, Steve Oberg <[log in to unmask]> wrote: > > One example: > > Here's the citation I have in hand: > > Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. > The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone > Metabolism and Disease in CKD: association with mortality in dialysis > patients. American Journal of Kidney Diseases 2005; 46(5):925-932. > > Here's the output from ParsCit. Note the problem with the article title: > > […] The output is a little different from what I get from the parsCit web service. The parsCit authors recently published a new paper on a new version of their systems with a new engine, which you might want to look at [1]. > There's more but basically it isn't accurate enough. It's very good but not > good enough for what I need at this juncture. OpenURL resolvers like SFX > are generally only as good as the metadata they are given to parse. I need > a high level of accuracy. > > Maybe that's a pipe dream. I doubt that the software provided by Inera performs better than parsCit. Inera does find a DOI for that citation but that is not nearly so hard as determining which parts of a citation are which. parsCit is pretty cutting edge & provides some of the best numbers I have seen. The Flux-CiM system [2] also has pretty good numbers, but the code for it is not available. I’ve also done a little bit of work on this, which you might want to have a look at. [3] One of the problems may be that the parsCit you are dealing with has been trained on the Cora dataset of computer science citations. It is a reasonably heterogeneous dataset of citations but it doesn’t have a lot that looks like that health sciences format. If your citations are largely drawn from the health sciences you might see about training it on a health sciences dataset; you will probably get much better results. best, Erik Hetzner 1. Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morrocco, May. Available from <http://wing.comp.nus.edu.sg/parsCit/#p> 2. Eli Cortez C. Vilarinho, Altigran Soares da Silva, Marcos André Gonçalves, Filipe de Sá Mesquita, Edleno Silva de Moura. FLUX-CIM: flexible unsupervised extraction of citation metadata. In Proceedings of the 8th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2007), pp. 215-224. 3. A simple method for citation metadata extraction using hidden Markov models. In Proc. of the Joint Conf. on Digital Libraries (JCDL 2008), Pittsburgh, Pa., 2008. <http://gales.cdlib.org/~egh/hmm-citation-extractor/>