Print

Print


I'm doing a project to prototype machine-learning-driven interfaces to
MIT's thesis collection, and my preprocessing step would really benefit
from a tokenizer that is aware of common multi-word scientific tokens (e.g.
"inertial mass" should definitely be one token, not two).

My somewhat cursory research didn't turn any up, and a conversation in
code4lib slack just now shows I'm not the only one with this problem...does
anyone have anything handy to suggest? Thanks.

-- 
Andromeda Yelton
Senior Software Engineer, MIT Libraries: https://libraries.mit.edu/
President, Library & Information Technology Association: http://www.lita.org
http://andromedayelton.com
@ThatAndromeda <http://twitter.com/ThatAndromeda>