Print

Print


Linux has ps2ascii which extracts ascii text from either .ps or .pdf. Also nutch (http://lucene.apache.org/nutch/) comes with a pdf parse plugin.

/Jacob Larsen




> -----Oprindelig meddelelse-----
> Fra: Code for Libraries [mailto:[log in to unmask]] På vegne af Jose
> Manuel Barrueco
> Sendt: 20. april 2010 08:57
> Til: [log in to unmask]
> Emne: Re: [CODE4LIB] Reference string parsing and document logical structure
> software available: ParsCit 100401
> 
> 
>  	We have been using this software with great performance in our
> citation extraction project: CitEc (Citations in Economics)
> (http://citec.repec.org). The only problem we have is related to the
> quality of the input data. We are using a commercial OCR engine from
> Vividata Inc, but it's not able to deal with all types of PDFs. Does
> anyone have experience with conversion from PDF to ASCII? Thanks for your
> help. Regards,
> 
> 
> 
> 
> On Mon, 19 Apr 2010, Min-Yen Kan wrote:
> 
> > Dear all:
> >
> > The ParsCit team has also been updating the ParsCit package, and is
> > happy to announce a new version that improves on classification
> > accuracy.  This version also adds a fully-integrated module that adds
> > document logical structure parsing so that that each line of the input
> > is classified among 23 logical structure categories (e.g., page
> > number, title, section header, figure, table, figureCaption, etc.) can
> > be extracted from either plain text or XML output files that come from
> > an OCR engine.  The version also benefits from a number of user
> > contributed fixes and training data.
> >
> > You can either download a copy of ParsCit for your own use, or use it
> > through a web services interface. We welcome your feedback and hope
> > that if you use ParsCit or any other freely available reference string
> > parsing tool that you can contribute annotated data to help make these
> > models more robust.
> >
> > ParsCit (and its online demos) are available from:
> > http://wing.comp.nus.edu.sg/parsCit/
> > Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip
> >
> > Cheers,
> >
> > Min
> >
> >
> 
> 
> ---
> José Manuel Barrueco
> 	http://www.uv.es/=barrueco