Péter provided a good start, I just wanted to mention that using the
"streaming" option you can write code in pretty much whatever you want,
certainly Python and Perl. I've even mixed and matched, where my mapping
program is in Python and my reducing program (optional since the mapper
might just be writing data out to disk that doesn't need reducing) in Perl.
Hadoop doesn't really care, although it is inherently Java.
Also, you want to make sure to know how to check the logging, as not only
is it helpful to monitor the processes, but it's essential for debugging.
Sometimes I think I've crashed as many times as I've had jobs run
successfully...but then that's just me. ;-)
Roy
On Mon, Dec 17, 2018 at 10:27 AM Péter Király <[log in to unmask]> wrote:
> Hi Eric,
>
> sounds an interesting project!
>
> You have multiple choices. Hadoop is at least two kind of things: a
> distributed file system and a distributed computation engine with its
> own API. If you upload files to the file system Hadoop will distribute
> them in a safe way. The basic idea of the computation is that once the
> files are located, the program is distributed to the files. For
> processing you can use the Hadoop API which is based on the MapReduce
> paradigm, or you can use some other tools based on Hadoop, I have
> experience with Apache Spark. In either way you can read and write
> data from the file system or from other sources which has Hadoop or
> Spark interface (Cassandra, MongoDB etc.). The basic workflow is that
> you submit a .jar file to Hadoop and it distributes the jar to the
> three nodes. You can even exploit the multicore nature of these nodes:
> Hadoop and Spark are very effective in multithread processing out of
> the box and you do not need to put the usual Java multithread handling
> parts into your code. The output is also stored the Hadoop File System
> by default.
>
> If you plan to use XSEDE I suggest you to check out this project:
>
> https://www.xsede.org/news/science-stories/-/asset_publisher/9JovW1UTN10Q/content/ruby-mendenhall-charts-progress-using-hpc-big-data-to-flag-unidentified-historical-sources-on-african-american-women%E2%80%99s-lives/10165
> .
> Prof. Ruby Mendenhall used XSEDE to do distant reading (for those who
> doesn't know this term was coined by Franco Moretti, a Stanford
> professor) on HathiTrust and JStore data to reveail traces of black
> women lives in the first half of 19th century.
>
> I am happy to share you some more technical tips if needed.
>
> Best,
> Péter
>
> Eric Lease Morgan <[log in to unmask]> ezt írta (időpont: 2018. dec. 17., H,
> 18:53):
> >
> > What is your experience with Apache Hadoop?
> >
> > I have very recently been granted root privileges on as many as three
> virtual machines. Each machine has forty-four cores, and more hard disk
> space & RAM than I really know how to exploit. I got access to these
> machines to work on a project I call The Distant Reader, and The Distant
> Reader implements a lot of map/reduce computing.†
> >
> > Can use Apache Hadoop to accept jobs on one machine, send it to any of
> the other two machines, and then save the results in some sort of
> common/shared file system?
> >
> > † In reality, The Distant Reader is ultimately intended to be an XSEDE
> science gateway --> https://www.xsede.org. The code for the Reader is
> available on GitHub --> https://github.com/ericleasemorgan/reader
> >
> > --
> > Eric Morgan
> > University of Notre Dame
>
>
>
> --
> Péter Király
> software developer
> GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
> http://linkedin.com/in/peterkiraly
>
|