LISTSERV 16.5 - CODE4LIB Archives

Hi Steve, all:

I'm the key developer of ParsCit.  I'm glad to hear your feedback
about what doesn't work with ParsCit.  Erik is correct in saying that
we have only trained the system for what data we have correct answers
for, namely computer science.  As such it doesn't perform well with
other data (especially health sciences citations, which we have also
done some pilot tests on.  I note that there are other citation
parsers out there, include Erik's own HMM parser (I think Erik
mentioned it as well, available from his website here:
http://gales.cdlib.org/~egh/hmm-citation-extractor/)

Anyways, I've tried your citation too, and got the same results from
the demo -- it doesn't handle the authors correctly in this case.  I
would very much love to have as many example cases of incorrectly
parsed citations as the community is willing to share with us so we
can improve ParsCit (it's open source so all can benefit from
improvements to ParsCit).

We are trying to be as proactive as possible about maintaining and
improving ParsCit.  I know of at least two groups that have said they
are willing to contribute more citations (with correct markings) to us
so that we can re-train ParsCit, and there is interest in porting it
to other languages (i.e. German right now).  We would love to get
samples of your data too, where the program does go wrong, to help
improve our system.  And to get feedback of other fields that need to
be parsed in as well: ISSN, ISBNs, volume, and issues.

We are also looking to make the output of the ParsCit system
compatible with EndNote, BibTeX.  We actually have an internal project
to try to hook up ParsCit to find references on arbitrary web pages
(to form something like Zotero that's not site specific and
non-template based).  If and when this project comes to fruition we'll
be announcing it to the list.

If anyone has used ParsCit and has feedback on what can be further
improved we'd love to hear from you.  You are our target audience!

Cheers,

Min

-- 
Min-Yen KAN (Dr) :: Assistant Professor :: National University of
Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
[log in to unmask] (E) :: www.comp.nus.edu.sg/~kanmy (W)

PS: Hi Erik, still planning on studying your HMM package for improving
ParsCit ... It's on my agenda.
Thanks again.

On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg <[log in to unmask]> wrote:
> Yeah, I am beginning to wonder, based on these really helpful replies, if I
> need to scale back to what is "doable" and "reasonable." And reassess
> ParsCit.
>
> Thanks to all for this additional information.
>
> Steve
>
> On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack <[log in to unmask]> wrote:
>
>> On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg <[log in to unmask]> wrote:
>>
>> > I fully realize how much of a risk that is in terms of reliability and
>> > maintenance.  But right now I just want a way to do this in bulk with a
>> high
>> > level of accuracy.
>>
>> How bad is it, really, if you get some (5%?) bad requests into your
>> document delivery system? Customers submit poor quality requests by
>> hand with some frequency, last I checked...
>>
>> Especially if you can hack your system to deliver the original
>> citation all the way into your doc delivery system, you may be able to
>> make the case that 'this is a good service to offer; let's just deal
>> with the bad parses manually.'
>>
>> Trying to solve this via pure technology is gonna get into a world of
>> diminishing returns. A surprising number of citations in references
>> sections are wrong. Some correct citations are really hard to parse,
>> even by humans who look at a lot of citations.
>>
>> ParsCit has, in my limited testing, worked as well as anything I've
>> seen (commercial or OSS), and much better than most.
>>
>> My $0.02,
>> -Nate
>>
>