Print

Print


Carmen,

The following code may be able to help.

https://github.com/Georgetown-University-Libraries/File-Analyzer

This application can scan a file system and report counts of files by type.

The application can also report on files by checksum.  If you are trying to
find exact file duplicates, the checksum report will identify exact
duplicates found across a file system.

I will be presenting an overview of this application during the virtual
lightning talks session on April 3.

If this looks useful to you, I will be glad to give you an overview of the
application.

Terry


On Tue, Mar 19, 2013 at 4:51 PM, Carmen Mitchell
<[log in to unmask]>wrote:

> Hello Code4Libbers,
>
> I'm working with a faculty member and trying to help them to formalize
> their data collection practices. Part of this process is also going through
> old data and trying to assess what they currently have. This particular
> faculty member has been doing research for 10 years without any kind of
> structure or regular method. So far we have over 2 TB of data in various
> states. (With more to come.)
>
> I've got a programmer working with me to:
> a) identify file types
> b) count how many files of each type
>
> We are now working on de-duping and assessing file size, focusing on the
> JPEGs first. With over 300,000 over them...it might take a while. (Of
> course they aren't following any kind of file naming structure,
> either...It's a mess.)
>
> Any tips or tricks or tools that you might know of to help speed up this
> process? Is there a good image recognition tool that you could suggest that
> would help us with automation?
>
>  Thanks,
>
> Carmen Mitchell
> Institutional Repository Librarian
> Cal State San Marcos
>



-- 
Terry Brady
Applications Programmer Analyst
Lauinger Information Technology
202-687-7053