Min, Eric, and others working in this domain -
have you considered designing your software as a scalable web service
from the get-go, using such frameworks as Google App Engine? You may
be able to use Montepython for the CRF computations
(http://montepython.sourceforge.net/)
I know Min offers a WSDL wrapper around their software, but that's
simply a gateway to one single-machine installation, and it's not
intended as a production service at that.
- Godmar
On Sat, Jul 12, 2008 at 3:20 AM, Min-Yen Kan <[log in to unmask]> wrote:
> Hi Steve, all:
>
> I'm the key developer of ParsCit. I'm glad to hear your feedback
> about what doesn't work with ParsCit. Erik is correct in saying that
> we have only trained the system for what data we have correct answers
> for, namely computer science. As such it doesn't perform well with
> other data (especially health sciences citations, which we have also
> done some pilot tests on. I note that there are other citation
> parsers out there, include Erik's own HMM parser (I think Erik
> mentioned it as well, available from his website here:
> http://gales.cdlib.org/~egh/hmm-citation-extractor/)
>
> Anyways, I've tried your citation too, and got the same results from
> the demo -- it doesn't handle the authors correctly in this case. I
> would very much love to have as many example cases of incorrectly
> parsed citations as the community is willing to share with us so we
> can improve ParsCit (it's open source so all can benefit from
> improvements to ParsCit).
>
> We are trying to be as proactive as possible about maintaining and
> improving ParsCit. I know of at least two groups that have said they
> are willing to contribute more citations (with correct markings) to us
> so that we can re-train ParsCit, and there is interest in porting it
> to other languages (i.e. German right now). We would love to get
> samples of your data too, where the program does go wrong, to help
> improve our system. And to get feedback of other fields that need to
> be parsed in as well: ISSN, ISBNs, volume, and issues.
>
> We are also looking to make the output of the ParsCit system
> compatible with EndNote, BibTeX. We actually have an internal project
> to try to hook up ParsCit to find references on arbitrary web pages
> (to form something like Zotero that's not site specific and
> non-template based). If and when this project comes to fruition we'll
> be announcing it to the list.
>
> If anyone has used ParsCit and has feedback on what can be further
> improved we'd love to hear from you. You are our target audience!
>
> Cheers,
>
> Min
>
> --
> Min-Yen KAN (Dr) :: Assistant Professor :: National University of
> Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
> 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
> [log in to unmask] (E) :: www.comp.nus.edu.sg/~kanmy (W)
>
> PS: Hi Erik, still planning on studying your HMM package for improving
> ParsCit ... It's on my agenda.
> Thanks again.
>
> On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg <[log in to unmask]> wrote:
>> Yeah, I am beginning to wonder, based on these really helpful replies, if I
>> need to scale back to what is "doable" and "reasonable." And reassess
>> ParsCit.
>>
>> Thanks to all for this additional information.
>>
>> Steve
>>
>> On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack <[log in to unmask]> wrote:
>>
>>> On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg <[log in to unmask]> wrote:
>>>
>>> > I fully realize how much of a risk that is in terms of reliability and
>>> > maintenance. But right now I just want a way to do this in bulk with a
>>> high
>>> > level of accuracy.
>>>
>>> How bad is it, really, if you get some (5%?) bad requests into your
>>> document delivery system? Customers submit poor quality requests by
>>> hand with some frequency, last I checked...
>>>
>>> Especially if you can hack your system to deliver the original
>>> citation all the way into your doc delivery system, you may be able to
>>> make the case that 'this is a good service to offer; let's just deal
>>> with the bad parses manually.'
>>>
>>> Trying to solve this via pure technology is gonna get into a world of
>>> diminishing returns. A surprising number of citations in references
>>> sections are wrong. Some correct citations are really hard to parse,
>>> even by humans who look at a lot of citations.
>>>
>>> ParsCit has, in my limited testing, worked as well as anything I've
>>> seen (commercial or OSS), and much better than most.
>>>
>>> My $0.02,
>>> -Nate
>>>
>>
>
|