Print

Print


As a rule of thumb, to use word embedding effectively, how big ought my corpus be and what ought to be the size of each feature's vector?

I have been experimenting with word embedding (a la word2vec). [1] My corpus (a subset of the EEBO, ECCO, and Sabin collections) contains approximately 2.3 billion words. It can be logically & easily sub-divided into smaller corpora. Even considering the high-performance computing resources at my disposal, creating a word2vec binary file is not trivial. The process requires a lot RAM, disk space, and CPU cycles. Once I create a word2vec binary file, I can easily query it through using the word2vec tools or through the use of library such as the one supported by Gensim. [2] I am getting interesting results. For example, based on models created from different centuries of content, I can demonstrate changes of politics as well as changes in the definition of love.

I want to use word embedding on smaller corpora, but don't know how small is too small. Nor do I have an idea how large each feature's vector must be in order to be useful. To what degree will word embedding work on things like the size of a novel, and if they can be effective on a document that small, then what might be a recommended value for a vector's size when creating the model? Similarly, if my corpus is a billion words in size, then how many dimensions ought to the size of each vector?

Fun with natural language processing and machine learning.

[1] word2vec - https://github.com/tmikolov/word2vec
[2] Gensim - https://radimrehurek.com/gensim/models/word2vec.html

--
Eric Morgan