I'm doing a project to prototype machine-learning-driven interfaces to MIT's thesis collection, and my preprocessing step would really benefit from a tokenizer that is aware of common multi-word scientific tokens (e.g. "inertial mass" should definitely be one token, not two). My somewhat cursory research didn't turn any up, and a conversation in code4lib slack just now shows I'm not the only one with this problem...does anyone have anything handy to suggest? Thanks. -- Andromeda Yelton Senior Software Engineer, MIT Libraries: https://libraries.mit.edu/ President, Library & Information Technology Association: http://www.lita.org http://andromedayelton.com @ThatAndromeda <http://twitter.com/ThatAndromeda>