LISTSERV 16.5 - CODE4LIB Archives

On Wed, Oct 19, 2011 at 2:57 PM, Jonathan Rochkind <[log in to unmask]> wrote:
> If someone else were getting started and didn't want to assemble their own
> training data -- do you think it would be likely useful for them to
> aggregate your training data _and_ Brown's training data together and
> generate a new model?  Was there a particular reason you chose not to use
> Brown's training data and add on it to it, but start over from scratch?

As far as I understand how Brown's model was created (which is not
completely obvious - there are lots of files in the repo that seem
related to the parser but do not appear to be used), we're using
theirs plus our data.  I can't really give our training data - not
because we're guarding anything, but simply because it (will be)
user-supplied, and that seems like it would present a host of social
complications.  The trained model is available, though.
>
> Forgive me if this is a stupid question, I'm still trying to learn about
> this stuff.
>
> And start to figure out how I'm going to deal with it when I get around to
> using FreeCite, which I surely will. Would it maybe make sense to actually
> seperate the training data and trained model in a seperate library, so
> people could even pick and choose what already built trained model they want
> to use, or build their own, without dealing with repo conflicts?

Well, you can tell FreeCite to use a different model file.  The model
out of parsCit's repo can be used, for example, and you'll get
different results (mostly better, some much worse).  I am a little
skeptical that models will be particularly useful across
implementations (my gut feeling is that the devil really is in the
local details), but the only way we'll know is to try.

-Ross.
>
> The training data is not currently under source control (it's in the
> database), but the trained model is.
>
> That's, admittedly, a bit of a downside to my fork (although the model being
> checked into git is true of the original, as well) since you'd always be in
> conflict with my trained model if you train your own.
>
> -Ross.
>
> On Monday, October 17, 2011, Jonathan Rochkind <[log in to unmask]> wrote:
>> When you say you've added to the training data, have you shared your
>> additions back with Brown, or your new improved training data is only in
>> your fork? Or is only held locally by you and isn't even in your github
>> fork?  Please clarify, thanks!
>>
>> On 10/13/2011 8:52 PM, Ross Singer wrote:
>>>
>>> Yeah, we've been doing a lot with (and putting a lot of updates into)
>>> FreeCite.  We only use the webservice (although we don't use the
>>> OpenURL context object and instead added a JSON response).  It works
>>> pretty well (not always great, but certainly better than nothing) -
>>> especially for giving us something "good enough" to throw against some
>>> OpenLibrary and Crossref data to look for matches.  Basically what
>>> we're using it for is to go from a citation string to an RDF graph.
>>>
>>> BTW, there have been no problems with post-2000 dates (not to say that
>>> there aren't plenty of other problems) - this might have been either a
>>> training issue or something a later version of CRF++ worked out.  We
>>> also add the citations it couldn't parse correctly to its training
>>> data, which might help this.
>>>
>>> Anyway, yeah, if anybody is interested, feel free to try it out.  One
>>> thing my fork does is remove the PostgreSQL dependency, if that's an
>>> issue for anybody.  It's kind of handy to be able to just use SQLite
>>> or MySQL or whatever to try it out.
>>>
>>> -Ross.
>>>
>>> On Thu, Oct 13, 2011 at 7:42 PM, Avram Lyon<[log in to unmask]>  wrote:
>>>>
>>>> On Thu, Oct 13, 2011 at 2:33 PM, Will Kurt<[log in to unmask]>  wrote:
>>>>>
>>>>> I always think that Brown's FreeCite api is under utilized.
>>>>> http://freecite.library.brown.edu/
>>>>> It's far from perfect, but I'm sure more use could be made of it.
>>>>>
>>>>> A few months back I threw together a copy/paste citation look-up with
>>>>> it:
>>>>> CiteBox
>>>>> http://willkurt.github.com/CiteBox/
>>>>>
>>>>> Of course I don't think anyone is really making use of it, but I've
>>>>> also done nothing to really promote it either ;)
>>>>
>>>> The FreeCite parser had major issues for a while with post-2000 dates,
>>>> and I believe the installation at Brown still does, but, to judge by
>>>> the GitHub activity (most active fork here:
>>>> https://github.com/rsinger/free_cite/), some enterprising folks have
>>>> picked it up after a period of apparent dormancy. This is great to
>>>> see, and vital to any project that hopes to use its API for anything
>>>> serious.
>>>>
>>>> By the way, the rarely-used XML representation of OpenURL
>>>> ContextObjects that FreeCite produces is supported by Zotero as a
>>>> full-fledged input format, a fact that might come in handy if you're
>>>> hoping to have your API produce something that Zotero users can
>>>> import.
>>>>
>>>> Avram
>>>>
>>>> UCLA Slavic, Zotero community dev
>>>>
>>