LISTSERV 16.5 - CODE4LIB Archives

Gavin,

It looks like ExifTool is extracting the XML metadata, but isn't
translating it into ASCII - 60 63 120 109 108 is the "<?xml" header, and
I'm sure that the rest of those values are the fulltext that you're looking
for.  According to the FAQ (
http://www.sno.phy.queensu.ca/~phil/exiftool/faq.html), the --charset
exif=CHARSET will tell it to convert to your character set of choice.

Regards,
Alex


On Tue, May 13, 2014 at 1:29 PM, Gavin Spomer <[log in to unmask]> wrote:

> Thanks for the suggestion. I have it downloaded and installed on my test
> server and have run it with various options on one of the tiff files Even
> if this doesn't work for me, what a fantastic tool; I may have applications
> for it later. :)
>
> Can't seem to get the text out of the tiff file though. Here's what I was
> able to get:
>
> # exiftool -a -u 00000001.tif
> ExifTool Version Number         : 9.60
> File Name                       : 00000001.tif
> Directory                       : .
> File Size                       : 936 kB
> File Modification Date/Time     : 2013:07:17 15:59:31-07:00
> File Access Date/Time           : 2014:05:13 10:07:30-07:00
> File Inode Change Date/Time     : 2014:04:30 09:24:55-07:00
> File Permissions                : rw-r--r--
> File Type                       : TIFF
> MIME Type                       : image/tiff
> Exif Byte Order                 : Little-endian (Intel, II)
> Subfile Type                    : Full-resolution Image
> Image Width                     : 4802
> Image Height                    : 7189
> Bits Per Sample                 : 1
> Compression                     : T6/Group 4 Fax
> Photometric Interpretation      : WhiteIsZero
> Fill Order                      : Normal
> Document Name                   : The Observer
> Strip Offsets                   : (Binary data 195 bytes, use -b option to
> extract)
> Orientation                     : Horizontal (normal)
> Samples Per Pixel               : 1
> Rows Per Strip                  : 256
> Strip Byte Counts               : (Binary data 166 bytes, use -b option to
> extract)
> X Resolution                    : 300
> Y Resolution                    : 300
> Page Name                       : 1
> T6 Options                      : (none)
> Resolution Unit                 : inches
> Software                        : ResCarta SDK v3.1.6
> Modify Date                     : 2013:06:28 19:13:26
> Exif 0x1637                     : 60 63 120 109 108 32 118 101 114 115 105
> 111 110 61 34 [...]
> Exif 0x1638                     : 226 128 162 97 108 117 10 67 101 110 116
> 114 97 108 10 [...]
> Exif 0x1639                     : 133 156 203 142 38 57 110 133 247 245 44
> 57 64 232 70 7[...]
> Image Size                      : 4802x7189
>
> Not sure any of this helps me.
>
>
> - Gavin
>
>
> >>> "Reser, Gregory" <[log in to unmask]> 5/12/2014 3:30 PM >>>
> You might try http://www.sno.phy.queensu.ca/~phil/exiftool/ , a Perl
> library to read and write embedded metadata.
>
> Greg Reser
> UC San Diego Library
> 9500 Gilman Drive, 0175K
> La Jolla, CA 92093-0175
>
> Phone: 858.246.0998
> Skype: gregreser
>
>
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Stuart Yeates
> Sent: Monday, May 12, 2014 3:26 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Extracting Text From .tiff Files
>
> Your first step is to pin down the format. TIFF is a container form (like
> zip) and can contain pretty much anything. Likely candidates for you format
> include https://en.wikipedia.org/wiki/IPTC_Information_Interchange_Modeland
> https://en.wikipedia.org/wiki/Extensible_Metadata_Platform
>
> Your second step is to find a library / tool for your platform that
> supports your format.
>
> Cheers
> stuart
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Gavin Spomer
> Sent: Tuesday, 13 May 2014 10:01 a.m.
> To: [log in to unmask]
> Subject: [CODE4LIB] Extracting Text From .tiff Files
>
> Hello folks,
>
> I'm in the process of migrating a student newspaper collection, currently
> implemented with ResCarta, into our new bepress institutional repository.
> ResCarta has each page of a newspaper stored as a tiff file. Not only does
> the tiff file contain the graphics data, but it has some metadata in xml
> format and the fulltext of the page. I know this because I opened up some
> of the tiffs with a plain-text editor (Vim).
>
> Although I can see the text in the file, I've only been about 90% accurate
> in extracting it with a script. Some of those "weird" characters seem to do
> some wonky things when doing file IO for some reason. Is there a more
> reliable way to extract text stored in a tiff file? I've Googled and
> Googled and have pulled up almost nothing. But there's got to be a way,
> since ResCarta stores it there and can extract it.
>
> Any ideas?
> Gavin Spomer
> Systems Programmer
> Brooks Library
> Central Washington University
>