Print

Print


Alan,
if you are looking for data mining software that runs well in Hadoop, I would definitely recommend looking into Apache Mahout [1]. This software is specifically focused on categorization and clustering, and these algorithms tend to work well in the distributed architecture of a Hadoop-based system. If you are looking for parsers, taggers, tokenizers, then a different system (Gate / OpenNLP / UIMA) would be more appropriate.

-Aaron

[1] http://mahout.apache.org


On Aug 27, 2013, at 7:47 PM, Alan Darnell <[log in to unmask]> wrote:

> Do any of these work in Hadoop using MapReduce as a programming model? It seems like Hadoop would be a natural use case for text mining and analysis.  
> 
> Alan
> 
> On Aug 27, 2013, at 7:44 PM, "Riley, Jenn" <[log in to unmask]> wrote:
> 
>> This is still command-line, but Mallet is heavily used in the DH
>> community: http://mallet.cs.umass.edu/. I think MONK
>> (http://monkproject.org/) has a UI, but I'm not overly familiar with its
>> features.
>> 
>> Jenn
>> 
>> --------------------------------
>> Jenn Riley
>> Head, Carolina Digital Library and Archives
>> The University of North Carolina at Chapel Hill
>> http://cdla.unc.edu/
>> http://www.lib.unc.edu/users/jlriley
>> 
>> [log in to unmask]
>> (919) 843-5910
>> 
>> 
>> 
>> 
>> 
>> On 8/27/13 11:24 AM, "Eric Lease Morgan" <[log in to unmask]> wrote:
>> 
>>> What sorts of text mining software do y'all support / use in your
>>> libraries?
>>> 
>>> We here in the Hesburgh Libraries at the University of Notre Dame have
>>> all but opened a place called the Center For Digital Scholarship. We are
>>> / will be providing a number of different services to a number of
>>> different audiences. These services include but are not necessarily
>>> limited exactly to:
>>> 
>>> * data management consultation
>>> * data analysis and visualization
>>> * geographic information systems support
>>> * text mining investigations
>>> * referrals to other "centers" across campus
>>> 
>>> I am expected to support the text mining investigations. I have
>>> traditionally used open source tools do to my work. Many of these tools
>>> require some sort of programming in order to exploit. To some degree I am
>>> expected mount text mining software on our local Windows and Macintosh
>>> computers here in our Center. I am familiar with the lists of tools
>>> available at Bamboo as well as Hermeneuti.ca. [0, 1] TAPoRware is good
>>> too, but a bit long in the tooth. [2]
>>> 
>>> Do you know of other sets of tools to choose from? Are you familiar with
>>> SASŪ Text Analytics, STATISTICA Data Miner, or RapidMiner? [3, 4, 5]
>>> 
>>> [0] Bamboo Dirt - http://dirt.projectbamboo.org
>>> [1] Hermeneuti.ca - http://hermeneuti.ca/voyeur/tools
>>> [2] TAPoRware - http://taporware.ualberta.ca
>>> [3] Text Analytics - http://www.sas.com/text-analytics/
>>> [4] Data Miner - http://www.statsoft.com/Products/STATISTICA/Data-Miner/
>>> [5] RapidMiner - http://rapid-i.com/content/view/181/190/
>>> 
>>> --
>>> Eric Lease Morgan, Digital Initiatives Librarian
>>> Hesburgh Libraries
>>> University of Notre Dame
>>> 
>>> 574/631-8604