Hello,
Has anyone here experience in the world of natural language programming (while applying information retrieval techniques)?
I'm currently trying to develop a tool that will:
1. take a pdf and extract the text (paying no attention to images or formatting)
2. analyze the text via term weighting, inverse document frequency, and other natural language processing techniques
3. assemble a list of suggested terms and concepts that are weighted heavily in that document
Step 1 is straightforward and I've had much success there. Step 2 is the problem child. I've played around with a few APIs (like AlchemyAPI) but they have character length limitations or other shortcomings that keep me looking.
The background behind this project is that I work for a digital library with a large pre-existing collection of pdfs with rudimentary metadata. The aforementioned tool will be used to classify and group the pdfs according to the themes of the library. Our CMS is Drupal so depending on my level of ambition, this *might* develop into a module.
Does this sound like a project that has been done/attempted before? Any suggested tools or reading materials?
|