Print

Print


Hi Katie,
I've been playing with natural language processing in both Python and R. There are lots of books and webpages out there with advice but for me, it's easy to get sucked into doing a manipulation that you *can* do instead of what you *should do* to answer your research question (in my case) or for business purposes (sounds like your case).

I just saw the post mentioning Perl - from what I've seen, it looks a lot easier in Python with NLTK and other packages.

Christina

------
Christina K. Pikas
Librarian
The Johns Hopkins University Applied Physics Laboratory
Baltimore: 443.778.4812
D.C.: 240.228.4812
[log in to unmask]




-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Katie
Sent: Tuesday, July 01, 2014 9:13 AM
To: [log in to unmask]
Subject: [CODE4LIB] Natural language programming

Hello,

Has anyone here experience in the world of natural language programming (while applying information retrieval techniques)? 

I'm currently trying to develop a tool that will:

1. take a pdf and extract the text (paying no attention to images or formatting) 2. analyze the text via term weighting, inverse document frequency, and other natural language processing techniques 3. assemble a list of suggested terms and concepts that are weighted heavily in that document

Step 1 is straightforward and I've had much success there. Step 2 is the problem child. I've played around with a few APIs (like AlchemyAPI) but they have character length limitations or other shortcomings that keep me looking. 

The background behind this project is that I work for a digital library with a large pre-existing collection of pdfs with rudimentary metadata. The aforementioned tool will be used to classify and group the pdfs according to the themes of the library. Our CMS is Drupal so depending on my level of ambition, this *might* develop into a module.  

Does this sound like a project that has been done/attempted before? Any suggested tools or reading materials?