I'm doing a project to prototype machine-learning-driven interfaces to
MIT's thesis collection, and my preprocessing step would really benefit
from a tokenizer that is aware of common multi-word scientific tokens (e.g.
"inertial mass" should definitely be one token, not two).
My somewhat cursory research didn't turn any up, and a conversation in
code4lib slack just now shows I'm not the only one with this problem...does
anyone have anything handy to suggest? Thanks.
Senior Software Engineer, MIT Libraries: https://libraries.mit.edu/
President, Library & Information Technology Association: http://www.lita.org