Print

Print


Hi Eric,

sounds an interesting project!

You have multiple choices. Hadoop is at least two kind of things: a
distributed file system and a distributed computation engine with its
own API. If you upload files to the file system Hadoop will distribute
them in a safe way. The basic idea of the computation is that once the
files are located, the program is distributed to the files. For
processing you can use the Hadoop API which is based on the MapReduce
paradigm, or you can use some other tools based on Hadoop, I have
experience with Apache Spark. In either way you can read and write
data from the file system or from other sources which has Hadoop or
Spark interface (Cassandra, MongoDB etc.). The basic workflow is that
you submit a .jar file to Hadoop and it distributes the jar to the
three nodes. You can even exploit the multicore nature of these nodes:
Hadoop and Spark are very effective in multithread processing out of
the box and you do not need to put the usual Java multithread handling
parts into your code. The output is also stored the Hadoop File System
by default.

If you plan to use XSEDE I suggest you to check out this project:
https://www.xsede.org/news/science-stories/-/asset_publisher/9JovW1UTN10Q/content/ruby-mendenhall-charts-progress-using-hpc-big-data-to-flag-unidentified-historical-sources-on-african-american-women%E2%80%99s-lives/10165.
Prof. Ruby Mendenhall used XSEDE to do distant reading (for those who
doesn't know this term was coined by Franco Moretti, a Stanford
professor) on HathiTrust and JStore data to reveail traces of black
women lives in the first half of 19th century.

I am happy to share you some more technical tips if needed.

Best,
Péter

Eric Lease Morgan <[log in to unmask]> ezt írta (időpont: 2018. dec. 17., H, 18:53):
>
> What is your experience with Apache Hadoop?
>
> I have very recently been granted root privileges on as many as three virtual machines. Each machine has forty-four cores, and more hard disk space & RAM than I really know how to exploit. I got access to these machines to work on a project I call The Distant Reader, and The Distant Reader implements a lot of map/reduce computing.†
>
> Can use Apache Hadoop to accept jobs on one machine, send it to any of the other two machines, and then save the results in some sort of common/shared file system?
>
> † In reality, The Distant Reader is ultimately intended to be an XSEDE science gateway --> https://www.xsede.org. The code for the Reader is available on GitHub --> https://github.com/ericleasemorgan/reader
>
> --
> Eric Morgan
> University of Notre Dame



-- 
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly