LISTSERV 16.5 - DLF-ANNOUNCE Archives

fyi, citing the blog below in full - (ie below the line is a citation and
quoted text)

 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

http://zzzoot.blogspot.com/2008/02/hadoop-ec2-s3-super-alternatives-for.html

"Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)

"
I recently discovered and have been inspired by a real-world and non-trivial
(in space and in time) application of Hadoop (Open Source implementation of
Google's MapReduce) combined with the Amazon Simple Storage Service (Amazon
S3) and the Amazon Elastic Compute Cloud (Amazon EC2). The project was to
convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of
the articles:

    Recipe:
    4 TB of data loaded to S3 (TIFF images)
    + Hadoop (+ Java Advanced Imaging and various glue)
    + 100 EC2 instances
    + 24 hours
    = 11M PDFs, 1.5 TB on S3


Unfortunately, the developer (Derek Gottfrid) did not say how much this cost
the NYT. But here is my back-of-the-envelope calculation (using the Amazon
S3/EC2 FAQ):

    EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240
    S3: $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33
    + $0.10 per GB of data transferred in x 4000 GB = $400
    + $0.13 per GB of data transferred out x 1500 GB = $195
    Total: = ~$868

Not unreasonable at all! Of course this does not include the cost of
bandwidth that the NYT needed to upload/download their data.

I've known about the MapReduce and Hadoop for quite a while now, but this is
the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined
with Amazon services that I've such a real problem solved so smoothly and
also wasn't web indexing or toy examples.

As much of my work in information retrieval and knowledge discovery involves
a great deal of space and even more CPU, I am looking forward to
experimenting with this sort of environment (Hadoop, local or in a service
cloud) for some of the more extreme experiments I am working on. And by
using Hadoop locally, if the problem gets to big for our local resources, we
can always buy capacity like the NYT example with a minimum of effort!

This is also something that various commercial organizations (and even
individuals?) with specific high CPU / high storage / high bandwidth (oh,
transfers between S3 and EC2 are free) compute needs should be considering
this solution. Of course security and privacy concerns apply."