LISTSERV 16.5 - CODE4LIB Archives

Hello Code4Libbers,

I'm working with a faculty member and trying to help them to formalize
their data collection practices. Part of this process is also going through
old data and trying to assess what they currently have. This particular
faculty member has been doing research for 10 years without any kind of
structure or regular method. So far we have over 2 TB of data in various
states. (With more to come.)

I've got a programmer working with me to:
a) identify file types
b) count how many files of each type

We are now working on de-duping and assessing file size, focusing on the
JPEGs first. With over 300,000 over them...it might take a while. (Of
course they aren't following any kind of file naming structure,
either...It's a mess.)

Any tips or tricks or tools that you might know of to help speed up this
process? Is there a good image recognition tool that you could suggest that
would help us with automation?

 Thanks,

Carmen Mitchell
Institutional Repository Librarian
Cal State San Marcos