LISTSERV 16.5 - CODE4LIB Archives

Thanks, sounds like i need to look into Lemon8.  If you did expose that as an API, it would be very useful. 

For my purposes, vol/issue/page# are actually MORE important than author and title, and I find that ParsCit alone does very poorly with it. Interesting to see that even your software doesn't use ParsCit alone!

Jonathan


>>> MJ Suhonos <[log in to unmask]> 11/14/08 3:18 PM >>>
Hi all,

John, the supplemented approach you describe is how we go about it in  
our Lemon8-XML (L8X) software (http://pkp.sfu.ca/lemon8); The way L8X  
handles parsing is it passes the original unparsed string to a number  
of different parsers in turn (Freecite, each of the 3 Paracite  
parsers, and a home-grown regex parser), does a little cleaning and  
normalization, and then hands the results to the user to select the  
correct values for each element.

Most of the time, it actually does a pretty good job of detecting the  
right elements -- in fact, numeric stuff like volume, issue, pages,  
etc. tend to be more accurate than names and titles, mostly because of  
the larger variance in the latter.  Our experience has been that  
relying on a single approach (machine-learning vs. format-rule-based  
vs. regular-expression) is less reliable than getting partial matches  
from various approaches, and then assembling them.  In this case, the  
whole is in fact greater than the sum of the parts.

I haven't added the ParsCit web service explicitly since a SOAP-based  
interface is a bit more cumbersome in PHP than FreeCite's POST-type  
interface, but I'll make a point of doing so now.  Incrementally  
adding services that all map to the same citation elements (we use the  
OpenURL 1.0 fields, with a few aberrations) means it's very easy to  
increase the accuracy by simply adding another parsing plugin/service.

You'd have to pull out the relevant classes from L8X to get a  
standalone parser, but since this is one of the more appealing aspects  
of the software for many people, we're looking at making a simple API  
in L8X to just do the citation parsing, possibly without the UI to  
take it from semi-automated to completely automatic.

MJ

On 14-Nov-08, at 12:07 AM, Jonathan Rochkind wrote:

> Thanks Min, this is a great project, that I keep trying to find time  
> to investigate more. Don't apologize for keeping us updated, please  
> continue to!
>
> Do you know if any of the improvements have improved detection of  
> volume/issue/page# information? For what I want to use it for,  
> reasonably accurate parsing of volume/issue/page# is needed, and so  
> far whenever I've looked at demos, this seems to be something that  
> all of these machine-learning-type approaches do pretty awfully at.  
> (I wonder if you are not including this in your training much,  
> because it isn't neccesary for your purposes to have volume/issue/ 
> page#?)
>
> I also have wondered if it would make sense to take a machine- 
> learning-type approach to begin with, but then supplement it with  
> formal-rule-based parsing to attempt to get vol/issue/page#  
> according to common patterns?
>
> I don't have too much time to try work on this myself, but if anyone  
> who is working on these various citation parsing efforts could  
> improve volume/issue/page# to a reasonable level, it would make the  
> libraries useful for a much greater range of applications.
>
> Jonathan
>
>
>>>> Min-Yen Kan <[log in to unmask]> 11/13/08 8:30 PM >>>
> Dear all:
>
> (Sorry to resurrect an old thread...)
>
> We've seen the release of several new freely available reference
> string parsers in recent months.
> The ParsCit team has also been updating the ParsCit package, and is
> happy to announce a new version that improves on classification
> accuracy, and adds training data in Italian, German and French and for
> a different discipline of humanities. We've updated the classification
> model to reflect these changes, which should be as easy to use as the
> original ParsCit.
>
> You can either download a copy of ParsCit for your own use, or use it
> through a web services interface. We welcome your feedback and hope
> that if you use ParsCit or any other freely available reference string
> parsing tool that you can contribute annotated data to help make these
> models more robust.
>
> ParsCit is available from: http://wing.comp.nus.edu.sg/parsCit/
> Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-080917.zip
>
> and is a joint collaboration between Pennsylvania State University
> (the folks who brought you CiteSeerX) as well as the National
> University of Singapore.
>
> Cheers,
>
> Min
>
> P.S. Integration with other freely available parsing systems is
> hopefully in the works too. If you have something to contribute, we'll
> be happy to commit some bandwidth into getting it integrated with
> ParsCit.