These days, in my industry experiences, the predominant way to go is Spark for distributed work, rather than Hadoop.
Since many of you have your catalog in Solr (and Blacklight, blush, thank you), you could straightforwardly leverage our open source spark-solr library. https://github.com/LucidWorks/spark-solr
Slicing and dicing and processing what you have in Solr has never been this powerful before. And it would be possible to adapt to RDBMS backends and connect to other repositories from Spark as well.
Erik
> On Dec 17, 2018, at 12:52, Eric Lease Morgan <[log in to unmask]> wrote:
>
> What is your experience with Apache Hadoop?
>
> I have very recently been granted root privileges on as many as three virtual machines. Each machine has forty-four cores, and more hard disk space & RAM than I really know how to exploit. I got access to these machines to work on a project I call The Distant Reader, and The Distant Reader implements a lot of map/reduce computing.†
>
> Can use Apache Hadoop to accept jobs on one machine, send it to any of the other two machines, and then save the results in some sort of common/shared file system?
>
> † In reality, The Distant Reader is ultimately intended to be an XSEDE science gateway --> https://www.xsede.org. The code for the Reader is available on GitHub --> https://github.com/ericleasemorgan/reader
>
> --
> Eric Morgan
> University of Notre Dame
|