Andromeda Yelton writes > I'm doing a project to prototype machine-learning-driven interfaces > to MIT's thesis collection, and my preprocessing step would really > benefit from a tokenizer that is aware of common multi-word > scientific tokens In my machine-learning work with the RePEc digital library, I use the "Keywords: " field of free keywords in the RePEc records. I parse those for multi-word terms and scan for those. You may have bibliographic database sources that you can mine for similar metadata fields. Without having gathered systematic evidence, it feels like introducing these multi-term phrases, however, did not significantly improve performance for my most of users. If you sophisticated machine learning you can probably skip such preprocessing. In my recent work on PubMed I don't use such multi-word terms at all. > (e.g. "inertial mass" should definitely be one token, not two). I'd say it has "inertial", "mass" and "inertial mass". -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel