Print

Print


  Andromeda Yelton writes

> I'm doing a project to prototype machine-learning-driven interfaces
> to MIT's thesis collection, and my preprocessing step would really
> benefit from a tokenizer that is aware of common multi-word
> scientific tokens

  In my machine-learning work with the RePEc digital library, I use
  the "Keywords: " field of free keywords in the RePEc records. I
  parse those for multi-word terms and scan for those. You may have
  bibliographic database sources that you can mine for similar
  metadata fields. Without having gathered systematic evidence,
  it feels like introducing these multi-term phrases, however, did
  not significantly improve performance for my most of users. If you
  sophisticated machine learning you can probably skip such
  preprocessing. In my recent work on PubMed I don't use such
  multi-word terms at all. 

> (e.g. "inertial mass" should definitely be one token, not two).

  I'd say it has "inertial", "mass" and "inertial mass".


-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel