Hi Jonathan, > PS: And indeed, mapping to OpenURL 1.0 is _exactly_ what I need to > do. Sounds like I should look into L8X? There is a demo/testing site at http://www.lemon8.org ; you might want to try playing around there with some citations to get a feel for how it works without having to download or install anything. > It would be convenient if there were a way to choose which parsers > to use with L8X, via an API or configuration if I install the > software locally. I'm not sure I'll need to pass the citation to > _all_ of them. I am going to be doing this in realtime while the > user is waiting, so speed matters. But just ParsCit alone isn't > doing the job, perhaps ParsCit+regex plus maybe one more would be > good enough. Absolutely -- setting a list of default parsers to use, and the ability to turn them on/off on-the-fly (ie. while editing any particular citation) is something that's been on the to-do list for a while. I'm hoping to have it done in the next week or two. I should add that having just added ParsCit, I've actually found that it doesn't do nearly as good a job as some of the other parsers, but that may just be on the citation formats that I happen to work with. Part of the way L8X is designed is to assign a simple statistical score to estimate how accurately each parser performs; one feature I've been planning is to simply allow a threshold to ignore results from parsers which have done a poor job on that particular citation. There is some additional functionality to take a parsed citation and look it up in a number of online indexes, and attempt to fetch "correct" information, both to supplement, say, an incomplete citation, and provide an additional level of quality improvement, but that's a somewhat more complex topic that I'm hoping to make the subject of a submission to the Code4Lib journal. :-) MJ >>>> MJ Suhonos <[log in to unmask]> 11/14/08 3:18 PM >>> > Hi all, > > John, the supplemented approach you describe is how we go about it in > our Lemon8-XML (L8X) software (http://pkp.sfu.ca/lemon8); The way L8X > handles parsing is it passes the original unparsed string to a number > of different parsers in turn (Freecite, each of the 3 Paracite > parsers, and a home-grown regex parser), does a little cleaning and > normalization, and then hands the results to the user to select the > correct values for each element. > > Most of the time, it actually does a pretty good job of detecting the > right elements -- in fact, numeric stuff like volume, issue, pages, > etc. tend to be more accurate than names and titles, mostly because of > the larger variance in the latter. Our experience has been that > relying on a single approach (machine-learning vs. format-rule-based > vs. regular-expression) is less reliable than getting partial matches > from various approaches, and then assembling them. In this case, the > whole is in fact greater than the sum of the parts. > > I haven't added the ParsCit web service explicitly since a SOAP-based > interface is a bit more cumbersome in PHP than FreeCite's POST-type > interface, but I'll make a point of doing so now. Incrementally > adding services that all map to the same citation elements (we use the > OpenURL 1.0 fields, with a few aberrations) means it's very easy to > increase the accuracy by simply adding another parsing plugin/service. > > You'd have to pull out the relevant classes from L8X to get a > standalone parser, but since this is one of the more appealing aspects > of the software for many people, we're looking at making a simple API > in L8X to just do the citation parsing, possibly without the UI to > take it from semi-automated to completely automatic. > > MJ > > On 14-Nov-08, at 12:07 AM, Jonathan Rochkind wrote: > >> Thanks Min, this is a great project, that I keep trying to find time >> to investigate more. Don't apologize for keeping us updated, please >> continue to! >> >> Do you know if any of the improvements have improved detection of >> volume/issue/page# information? For what I want to use it for, >> reasonably accurate parsing of volume/issue/page# is needed, and so >> far whenever I've looked at demos, this seems to be something that >> all of these machine-learning-type approaches do pretty awfully at. >> (I wonder if you are not including this in your training much, >> because it isn't neccesary for your purposes to have volume/issue/ >> page#?) >> >> I also have wondered if it would make sense to take a machine- >> learning-type approach to begin with, but then supplement it with >> formal-rule-based parsing to attempt to get vol/issue/page# >> according to common patterns? >> >> I don't have too much time to try work on this myself, but if anyone >> who is working on these various citation parsing efforts could >> improve volume/issue/page# to a reasonable level, it would make the >> libraries useful for a much greater range of applications. >> >> Jonathan >> >> >>>>> Min-Yen Kan <[log in to unmask]> 11/13/08 8:30 PM >>> >> Dear all: >> >> (Sorry to resurrect an old thread...) >> >> We've seen the release of several new freely available reference >> string parsers in recent months. >> The ParsCit team has also been updating the ParsCit package, and is >> happy to announce a new version that improves on classification >> accuracy, and adds training data in Italian, German and French and >> for >> a different discipline of humanities. We've updated the >> classification >> model to reflect these changes, which should be as easy to use as the >> original ParsCit. >> >> You can either download a copy of ParsCit for your own use, or use it >> through a web services interface. We welcome your feedback and hope >> that if you use ParsCit or any other freely available reference >> string >> parsing tool that you can contribute annotated data to help make >> these >> models more robust. >> >> ParsCit is available from: http://wing.comp.nus.edu.sg/parsCit/ >> Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-080917.zip >> >> and is a joint collaboration between Pennsylvania State University >> (the folks who brought you CiteSeerX) as well as the National >> University of Singapore. >> >> Cheers, >> >> Min >> >> P.S. Integration with other freely available parsing systems is >> hopefully in the works too. If you have something to contribute, >> we'll >> be happy to commit some bandwidth into getting it integrated with >> ParsCit.