LISTSERV 16.5 - CODE4LIB Archives

Hi Jonathan,

> PS: And indeed, mapping to OpenURL 1.0 is _exactly_ what I need to  
> do. Sounds like I should look into L8X?

There is a demo/testing site at http://www.lemon8.org ; you might want  
to try playing around there with some citations to get a feel for how  
it works without having to download or install anything.

> It would be convenient if there were a way to choose which parsers  
> to use with L8X, via an API or configuration if I install the  
> software locally. I'm not sure I'll need to pass the citation to  
> _all_ of them. I am going to be doing this in realtime while the  
> user is waiting, so speed matters. But just ParsCit alone isn't  
> doing the job, perhaps ParsCit+regex plus maybe one more would be  
> good enough.

Absolutely -- setting a list of default parsers to use, and the  
ability to turn them on/off on-the-fly (ie. while editing any  
particular citation) is something that's been on the to-do list for a  
while.  I'm hoping to have it done in the next week or two.

I should add that having just added ParsCit, I've actually found that  
it doesn't do nearly as good a job as some of the other parsers, but  
that may just be on the citation formats that I happen to work with.   
Part of the way L8X is designed is to assign a simple statistical  
score to estimate how accurately each parser performs; one feature  
I've been planning is to simply allow a threshold to ignore results  
from parsers which have done a poor job on that particular citation.

There is some additional functionality to take a parsed citation and  
look it up in a number of online indexes, and attempt to fetch  
"correct" information, both to supplement, say, an incomplete  
citation, and provide an additional level of quality improvement, but  
that's a somewhat more complex topic that I'm hoping to make the  
subject of a submission to the Code4Lib journal.  :-)

MJ

>>>> MJ Suhonos <[log in to unmask]> 11/14/08 3:18 PM >>>
> Hi all,
>
> John, the supplemented approach you describe is how we go about it in
> our Lemon8-XML (L8X) software (http://pkp.sfu.ca/lemon8); The way L8X
> handles parsing is it passes the original unparsed string to a number
> of different parsers in turn (Freecite, each of the 3 Paracite
> parsers, and a home-grown regex parser), does a little cleaning and
> normalization, and then hands the results to the user to select the
> correct values for each element.
>
> Most of the time, it actually does a pretty good job of detecting the
> right elements -- in fact, numeric stuff like volume, issue, pages,
> etc. tend to be more accurate than names and titles, mostly because of
> the larger variance in the latter.  Our experience has been that
> relying on a single approach (machine-learning vs. format-rule-based
> vs. regular-expression) is less reliable than getting partial matches
> from various approaches, and then assembling them.  In this case, the
> whole is in fact greater than the sum of the parts.
>
> I haven't added the ParsCit web service explicitly since a SOAP-based
> interface is a bit more cumbersome in PHP than FreeCite's POST-type
> interface, but I'll make a point of doing so now.  Incrementally
> adding services that all map to the same citation elements (we use the
> OpenURL 1.0 fields, with a few aberrations) means it's very easy to
> increase the accuracy by simply adding another parsing plugin/service.
>
> You'd have to pull out the relevant classes from L8X to get a
> standalone parser, but since this is one of the more appealing aspects
> of the software for many people, we're looking at making a simple API
> in L8X to just do the citation parsing, possibly without the UI to
> take it from semi-automated to completely automatic.
>
> MJ
>
> On 14-Nov-08, at 12:07 AM, Jonathan Rochkind wrote:
>
>> Thanks Min, this is a great project, that I keep trying to find time
>> to investigate more. Don't apologize for keeping us updated, please
>> continue to!
>>
>> Do you know if any of the improvements have improved detection of
>> volume/issue/page# information? For what I want to use it for,
>> reasonably accurate parsing of volume/issue/page# is needed, and so
>> far whenever I've looked at demos, this seems to be something that
>> all of these machine-learning-type approaches do pretty awfully at.
>> (I wonder if you are not including this in your training much,
>> because it isn't neccesary for your purposes to have volume/issue/
>> page#?)
>>
>> I also have wondered if it would make sense to take a machine-
>> learning-type approach to begin with, but then supplement it with
>> formal-rule-based parsing to attempt to get vol/issue/page#
>> according to common patterns?
>>
>> I don't have too much time to try work on this myself, but if anyone
>> who is working on these various citation parsing efforts could
>> improve volume/issue/page# to a reasonable level, it would make the
>> libraries useful for a much greater range of applications.
>>
>> Jonathan
>>
>>
>>>>> Min-Yen Kan <[log in to unmask]> 11/13/08 8:30 PM >>>
>> Dear all:
>>
>> (Sorry to resurrect an old thread...)
>>
>> We've seen the release of several new freely available reference
>> string parsers in recent months.
>> The ParsCit team has also been updating the ParsCit package, and is
>> happy to announce a new version that improves on classification
>> accuracy, and adds training data in Italian, German and French and  
>> for
>> a different discipline of humanities. We've updated the  
>> classification
>> model to reflect these changes, which should be as easy to use as the
>> original ParsCit.
>>
>> You can either download a copy of ParsCit for your own use, or use it
>> through a web services interface. We welcome your feedback and hope
>> that if you use ParsCit or any other freely available reference  
>> string
>> parsing tool that you can contribute annotated data to help make  
>> these
>> models more robust.
>>
>> ParsCit is available from: http://wing.comp.nus.edu.sg/parsCit/
>> Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-080917.zip
>>
>> and is a joint collaboration between Pennsylvania State University
>> (the folks who brought you CiteSeerX) as well as the National
>> University of Singapore.
>>
>> Cheers,
>>
>> Min
>>
>> P.S. Integration with other freely available parsing systems is
>> hopefully in the works too. If you have something to contribute,  
>> we'll
>> be happy to commit some bandwidth into getting it integrated with
>> ParsCit.