Hi Jonathan,
> PS: And indeed, mapping to OpenURL 1.0 is _exactly_ what I need to
> do. Sounds like I should look into L8X?
There is a demo/testing site at http://www.lemon8.org ; you might want
to try playing around there with some citations to get a feel for how
it works without having to download or install anything.
> It would be convenient if there were a way to choose which parsers
> to use with L8X, via an API or configuration if I install the
> software locally. I'm not sure I'll need to pass the citation to
> _all_ of them. I am going to be doing this in realtime while the
> user is waiting, so speed matters. But just ParsCit alone isn't
> doing the job, perhaps ParsCit+regex plus maybe one more would be
> good enough.
Absolutely -- setting a list of default parsers to use, and the
ability to turn them on/off on-the-fly (ie. while editing any
particular citation) is something that's been on the to-do list for a
while. I'm hoping to have it done in the next week or two.
I should add that having just added ParsCit, I've actually found that
it doesn't do nearly as good a job as some of the other parsers, but
that may just be on the citation formats that I happen to work with.
Part of the way L8X is designed is to assign a simple statistical
score to estimate how accurately each parser performs; one feature
I've been planning is to simply allow a threshold to ignore results
from parsers which have done a poor job on that particular citation.
There is some additional functionality to take a parsed citation and
look it up in a number of online indexes, and attempt to fetch
"correct" information, both to supplement, say, an incomplete
citation, and provide an additional level of quality improvement, but
that's a somewhat more complex topic that I'm hoping to make the
subject of a submission to the Code4Lib journal. :-)
MJ
>>>> MJ Suhonos <[log in to unmask]> 11/14/08 3:18 PM >>>
> Hi all,
>
> John, the supplemented approach you describe is how we go about it in
> our Lemon8-XML (L8X) software (http://pkp.sfu.ca/lemon8); The way L8X
> handles parsing is it passes the original unparsed string to a number
> of different parsers in turn (Freecite, each of the 3 Paracite
> parsers, and a home-grown regex parser), does a little cleaning and
> normalization, and then hands the results to the user to select the
> correct values for each element.
>
> Most of the time, it actually does a pretty good job of detecting the
> right elements -- in fact, numeric stuff like volume, issue, pages,
> etc. tend to be more accurate than names and titles, mostly because of
> the larger variance in the latter. Our experience has been that
> relying on a single approach (machine-learning vs. format-rule-based
> vs. regular-expression) is less reliable than getting partial matches
> from various approaches, and then assembling them. In this case, the
> whole is in fact greater than the sum of the parts.
>
> I haven't added the ParsCit web service explicitly since a SOAP-based
> interface is a bit more cumbersome in PHP than FreeCite's POST-type
> interface, but I'll make a point of doing so now. Incrementally
> adding services that all map to the same citation elements (we use the
> OpenURL 1.0 fields, with a few aberrations) means it's very easy to
> increase the accuracy by simply adding another parsing plugin/service.
>
> You'd have to pull out the relevant classes from L8X to get a
> standalone parser, but since this is one of the more appealing aspects
> of the software for many people, we're looking at making a simple API
> in L8X to just do the citation parsing, possibly without the UI to
> take it from semi-automated to completely automatic.
>
> MJ
>
> On 14-Nov-08, at 12:07 AM, Jonathan Rochkind wrote:
>
>> Thanks Min, this is a great project, that I keep trying to find time
>> to investigate more. Don't apologize for keeping us updated, please
>> continue to!
>>
>> Do you know if any of the improvements have improved detection of
>> volume/issue/page# information? For what I want to use it for,
>> reasonably accurate parsing of volume/issue/page# is needed, and so
>> far whenever I've looked at demos, this seems to be something that
>> all of these machine-learning-type approaches do pretty awfully at.
>> (I wonder if you are not including this in your training much,
>> because it isn't neccesary for your purposes to have volume/issue/
>> page#?)
>>
>> I also have wondered if it would make sense to take a machine-
>> learning-type approach to begin with, but then supplement it with
>> formal-rule-based parsing to attempt to get vol/issue/page#
>> according to common patterns?
>>
>> I don't have too much time to try work on this myself, but if anyone
>> who is working on these various citation parsing efforts could
>> improve volume/issue/page# to a reasonable level, it would make the
>> libraries useful for a much greater range of applications.
>>
>> Jonathan
>>
>>
>>>>> Min-Yen Kan <[log in to unmask]> 11/13/08 8:30 PM >>>
>> Dear all:
>>
>> (Sorry to resurrect an old thread...)
>>
>> We've seen the release of several new freely available reference
>> string parsers in recent months.
>> The ParsCit team has also been updating the ParsCit package, and is
>> happy to announce a new version that improves on classification
>> accuracy, and adds training data in Italian, German and French and
>> for
>> a different discipline of humanities. We've updated the
>> classification
>> model to reflect these changes, which should be as easy to use as the
>> original ParsCit.
>>
>> You can either download a copy of ParsCit for your own use, or use it
>> through a web services interface. We welcome your feedback and hope
>> that if you use ParsCit or any other freely available reference
>> string
>> parsing tool that you can contribute annotated data to help make
>> these
>> models more robust.
>>
>> ParsCit is available from: http://wing.comp.nus.edu.sg/parsCit/
>> Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-080917.zip
>>
>> and is a joint collaboration between Pennsylvania State University
>> (the folks who brought you CiteSeerX) as well as the National
>> University of Singapore.
>>
>> Cheers,
>>
>> Min
>>
>> P.S. Integration with other freely available parsing systems is
>> hopefully in the works too. If you have something to contribute,
>> we'll
>> be happy to commit some bandwidth into getting it integrated with
>> ParsCit.
|