Linux has ps2ascii which extracts ascii text from either .ps or .pdf. Also nutch (http://lucene.apache.org/nutch/) comes with a pdf parse plugin. /Jacob Larsen > -----Oprindelig meddelelse----- > Fra: Code for Libraries [mailto:[log in to unmask]] På vegne af Jose > Manuel Barrueco > Sendt: 20. april 2010 08:57 > Til: [log in to unmask] > Emne: Re: [CODE4LIB] Reference string parsing and document logical structure > software available: ParsCit 100401 > > > We have been using this software with great performance in our > citation extraction project: CitEc (Citations in Economics) > (http://citec.repec.org). The only problem we have is related to the > quality of the input data. We are using a commercial OCR engine from > Vividata Inc, but it's not able to deal with all types of PDFs. Does > anyone have experience with conversion from PDF to ASCII? Thanks for your > help. Regards, > > > > > On Mon, 19 Apr 2010, Min-Yen Kan wrote: > > > Dear all: > > > > The ParsCit team has also been updating the ParsCit package, and is > > happy to announce a new version that improves on classification > > accuracy. This version also adds a fully-integrated module that adds > > document logical structure parsing so that that each line of the input > > is classified among 23 logical structure categories (e.g., page > > number, title, section header, figure, table, figureCaption, etc.) can > > be extracted from either plain text or XML output files that come from > > an OCR engine. The version also benefits from a number of user > > contributed fixes and training data. > > > > You can either download a copy of ParsCit for your own use, or use it > > through a web services interface. We welcome your feedback and hope > > that if you use ParsCit or any other freely available reference string > > parsing tool that you can contribute annotated data to help make these > > models more robust. > > > > ParsCit (and its online demos) are available from: > > http://wing.comp.nus.edu.sg/parsCit/ > > Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip > > > > Cheers, > > > > Min > > > > > > > --- > José Manuel Barrueco > http://www.uv.es/=barrueco