LISTSERV 16.5 - CODE4LIB Archives

Hi Eric,

I have extremely limited experience with one small set of tests, but wanted to share a couple of quick book recommendations that helped me run MapReduce jobs in Hadoop.

First, with a shout out to the Spark in the Dark reading club that formed after last year's conference, see the chapter 10, Batch Processing, from Designing Data-Intensive Applications [1]. This is a good introduction to the basic design of MR and how it is rooted in the Unix philosophy.

After that, I worked through the first few chapters in Hadoop in Action [2], which is a tutorial/follow along style coding book. This book is a bit out-dated now, but if you want to run simple Hadoop jobs it will get you up and running. What is also good about this book is that in addition to Java, it will run you through a few examples in which the data processing is done by either Unix command line interface jobs or very simple Python scripts. I found this to be very instructive, because for jobs that just need the distributed processing, but are computationally simple, you can see what the alternative is to working in Java. I ended up using Java as well but comprehending all three methods really solidified my understanding of how Hadoop works.

Finally, if you are completely new, I recommend paying close attention to the fact that Hadoop MR jobs are built with an explicit sort step/job between the map step and the reduce step. This is one of the key bits that makes the programming paradigm possible. But it is easy to miss this when 1) it is not called Map/Sort/Reduce, and 2) Hadoop's default behavior is to sort your map job output (I think). So while you write map code and reduce code, you don't explicitly program the sort step. The programming paradigm is built around what you can do to stream through sorted data (i.e., it can be streamed with ridiculously fast with low memory), so it has a big impact on how you structure code. Once I realized this, the concept of moving code to to data clicked.

Not sure if this is what you were asking for, but this would be my don't-fight-the-framework experience that is similar to letting go of certain programming tasks in a model-view-controller framework like Rails.

-Steve

[1] https://dataintensive.net/

[2] https://www.manning.com/books/hadoop-in-action

> On Dec 17, 2018, at 1:20 PM, Roy Tennant <[log in to unmask]> wrote:
> 
> Péter provided a good start, I just wanted to mention that using the
> "streaming" option you can write code in pretty much whatever you want,
> certainly Python and Perl. I've even mixed and matched, where my mapping
> program is in Python and my reducing program (optional since the mapper
> might just be writing data out to disk that doesn't need reducing) in Perl.
> Hadoop doesn't really care, although it is inherently Java.
> 
> Also, you want to make sure to know how to check the logging, as not only
> is it helpful to monitor the processes, but it's essential for debugging.
> Sometimes I think I've crashed as many times as I've had jobs run
> successfully...but then that's just me. ;-)
> Roy
> 
> On Mon, Dec 17, 2018 at 10:27 AM Péter Király <[log in to unmask]> wrote:
> 
>> Hi Eric,
>> 
>> sounds an interesting project!
>> 
>> You have multiple choices. Hadoop is at least two kind of things: a
>> distributed file system and a distributed computation engine with its
>> own API. If you upload files to the file system Hadoop will distribute
>> them in a safe way. The basic idea of the computation is that once the
>> files are located, the program is distributed to the files. For
>> processing you can use the Hadoop API which is based on the MapReduce
>> paradigm, or you can use some other tools based on Hadoop, I have
>> experience with Apache Spark. In either way you can read and write
>> data from the file system or from other sources which has Hadoop or
>> Spark interface (Cassandra, MongoDB etc.). The basic workflow is that
>> you submit a .jar file to Hadoop and it distributes the jar to the
>> three nodes. You can even exploit the multicore nature of these nodes:
>> Hadoop and Spark are very effective in multithread processing out of
>> the box and you do not need to put the usual Java multithread handling
>> parts into your code. The output is also stored the Hadoop File System
>> by default.
>> 
>> If you plan to use XSEDE I suggest you to check out this project:
>> 
>> https://www.xsede.org/news/science-stories/-/asset_publisher/9JovW1UTN10Q/content/ruby-mendenhall-charts-progress-using-hpc-big-data-to-flag-unidentified-historical-sources-on-african-american-women%E2%80%99s-lives/10165
>> .
>> Prof. Ruby Mendenhall used XSEDE to do distant reading (for those who
>> doesn't know this term was coined by Franco Moretti, a Stanford
>> professor) on HathiTrust and JStore data to reveail traces of black
>> women lives in the first half of 19th century.
>> 
>> I am happy to share you some more technical tips if needed.
>> 
>> Best,
>> Péter
>> 
>> Eric Lease Morgan <[log in to unmask]> ezt írta (időpont: 2018. dec. 17., H,
>> 18:53):
>>> 
>>> What is your experience with Apache Hadoop?
>>> 
>>> I have very recently been granted root privileges on as many as three
>> virtual machines. Each machine has forty-four cores, and more hard disk
>> space & RAM than I really know how to exploit. I got access to these
>> machines to work on a project I call The Distant Reader, and The Distant
>> Reader implements a lot of map/reduce computing.†
>>> 
>>> Can use Apache Hadoop to accept jobs on one machine, send it to any of
>> the other two machines, and then save the results in some sort of
>> common/shared file system?
>>> 
>>> † In reality, The Distant Reader is ultimately intended to be an XSEDE
>> science gateway --> https://www.xsede.org. The code for the Reader is
>> available on GitHub --> https://github.com/ericleasemorgan/reader
>>> 
>>> --
>>> Eric Morgan
>>> University of Notre Dame
>> 
>> 
>> 
>> --
>> Péter Király
>> software developer
>> GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
>> http://linkedin.com/in/peterkiraly
>>