Andromeda Yelton writes
> I'm doing a project to prototype machine-learning-driven interfaces
> to MIT's thesis collection, and my preprocessing step would really
> benefit from a tokenizer that is aware of common multi-word
> scientific tokens
In my machine-learning work with the RePEc digital library, I use
the "Keywords: " field of free keywords in the RePEc records. I
parse those for multi-word terms and scan for those. You may have
bibliographic database sources that you can mine for similar
metadata fields. Without having gathered systematic evidence,
it feels like introducing these multi-term phrases, however, did
not significantly improve performance for my most of users. If you
sophisticated machine learning you can probably skip such
preprocessing. In my recent work on PubMed I don't use such
multi-word terms at all.
> (e.g. "inertial mass" should definitely be one token, not two).
I'd say it has "inertial", "mass" and "inertial mass".
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
|