You might try http://www.sno.phy.queensu.ca/~phil/exiftool/ , a Perl library to read and write embedded metadata.
Greg Reser
UC San Diego Library
9500 Gilman Drive, 0175K
La Jolla, CA 92093-0175
Phone: 858.246.0998
Skype: gregreser
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Stuart Yeates
Sent: Monday, May 12, 2014 3:26 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Extracting Text From .tiff Files
Your first step is to pin down the format. TIFF is a container form (like zip) and can contain pretty much anything. Likely candidates for you format include https://en.wikipedia.org/wiki/IPTC_Information_Interchange_Model and https://en.wikipedia.org/wiki/Extensible_Metadata_Platform
Your second step is to find a library / tool for your platform that supports your format.
Cheers
stuart
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Gavin Spomer
Sent: Tuesday, 13 May 2014 10:01 a.m.
To: [log in to unmask]
Subject: [CODE4LIB] Extracting Text From .tiff Files
Hello folks,
I'm in the process of migrating a student newspaper collection, currently implemented with ResCarta, into our new bepress institutional repository. ResCarta has each page of a newspaper stored as a tiff file. Not only does the tiff file contain the graphics data, but it has some metadata in xml format and the fulltext of the page. I know this because I opened up some of the tiffs with a plain-text editor (Vim).
Although I can see the text in the file, I've only been about 90% accurate in extracting it with a script. Some of those "weird" characters seem to do some wonky things when doing file IO for some reason. Is there a more reliable way to extract text stored in a tiff file? I've Googled and Googled and have pulled up almost nothing. But there's got to be a way, since ResCarta stores it there and can extract it.
Any ideas?
Gavin Spomer
Systems Programmer
Brooks Library
Central Washington University
|