LISTSERV 16.5 - CODE4LIB Archives

Thanks for the suggestion. I have it downloaded and installed on my test server and have run it with various options on one of the tiff files Even if this doesn't work for me, what a fantastic tool; I may have applications for it later. :) 

Can't seem to get the text out of the tiff file though. Here's what I was able to get: 

# exiftool -a -u 00000001.tif  
ExifTool Version Number         : 9.60 
File Name                       : 00000001.tif 
Directory                       : . 
File Size                       : 936 kB 
File Modification Date/Time     : 2013:07:17 15:59:31-07:00 
File Access Date/Time           : 2014:05:13 10:07:30-07:00 
File Inode Change Date/Time     : 2014:04:30 09:24:55-07:00 
File Permissions                : rw-r--r-- 
File Type                       : TIFF 
MIME Type                       : image/tiff 
Exif Byte Order                 : Little-endian (Intel, II) 
Subfile Type                    : Full-resolution Image 
Image Width                     : 4802 
Image Height                    : 7189 
Bits Per Sample                 : 1 
Compression                     : T6/Group 4 Fax 
Photometric Interpretation      : WhiteIsZero 
Fill Order                      : Normal 
Document Name                   : The Observer 
Strip Offsets                   : (Binary data 195 bytes, use -b option to extract) 
Orientation                     : Horizontal (normal) 
Samples Per Pixel               : 1 
Rows Per Strip                  : 256 
Strip Byte Counts               : (Binary data 166 bytes, use -b option to extract) 
X Resolution                    : 300 
Y Resolution                    : 300 
Page Name                       : 1 
T6 Options                      : (none) 
Resolution Unit                 : inches 
Software                        : ResCarta SDK v3.1.6 
Modify Date                     : 2013:06:28 19:13:26 
Exif 0x1637                     : 60 63 120 109 108 32 118 101 114 115 105 111 110 61 34 [...] 
Exif 0x1638                     : 226 128 162 97 108 117 10 67 101 110 116 114 97 108 10 [...] 
Exif 0x1639                     : 133 156 203 142 38 57 110 133 247 245 44 57 64 232 70 7[...] 
Image Size                      : 4802x7189 

Not sure any of this helps me. 


- Gavin 


>>> "Reser, Gregory" <[log in to unmask]> 5/12/2014 3:30 PM >>>
You might try http://www.sno.phy.queensu.ca/~phil/exiftool/ , a Perl library to read and write embedded metadata.

Greg Reser
UC San Diego Library
9500 Gilman Drive, 0175K
La Jolla, CA 92093-0175

Phone: 858.246.0998
Skype: gregreser



-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Stuart Yeates
Sent: Monday, May 12, 2014 3:26 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Extracting Text From .tiff Files

Your first step is to pin down the format. TIFF is a container form (like zip) and can contain pretty much anything. Likely candidates for you format include https://en.wikipedia.org/wiki/IPTC_Information_Interchange_Model and https://en.wikipedia.org/wiki/Extensible_Metadata_Platform

Your second step is to find a library / tool for your platform that supports your format.

Cheers
stuart

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Gavin Spomer
Sent: Tuesday, 13 May 2014 10:01 a.m.
To: [log in to unmask]
Subject: [CODE4LIB] Extracting Text From .tiff Files

Hello folks,

I'm in the process of migrating a student newspaper collection, currently implemented with ResCarta, into our new bepress institutional repository. ResCarta has each page of a newspaper stored as a tiff file. Not only does the tiff file contain the graphics data, but it has some metadata in xml format and the fulltext of the page. I know this because I opened up some of the tiffs with a plain-text editor (Vim).

Although I can see the text in the file, I've only been about 90% accurate in extracting it with a script. Some of those "weird" characters seem to do some wonky things when doing file IO for some reason. Is there a more reliable way to extract text stored in a tiff file? I've Googled and Googled and have pulled up almost nothing. But there's got to be a way, since ResCarta stores it there and can extract it.

Any ideas?
Gavin Spomer
Systems Programmer
Brooks Library
Central Washington University