Word2Vec [1] is a technique for creating vectors of word representations to capture the syntax and semantics of words. The vectors used to represent the words have several interesting features, here are a few:
Addition and subtraction of vectors show how word semantics are captured:
e.g. \(king - man + woman = queen\)
This example captures the fact that the semantics of $king$ and $queen$ are nicely captured by the word vectors
Similar words have similar word vectors: E.g. $king$ is most similar to - $queen$, $duke$, $duchess$
One of the issues of the Word2Vec algorithm is that it is not able to add more words to vocabulary after an initial training. This approach to ‘freeze vocabulary’ might not work for several situations where we need to train the model in an online manner, by adding and training on new words as they are encountered. Here is a quick description of an online algorithm
In this post, I will discuss an online word2vec implementation that I have developed and how to use it to update the vocabulary and learn new word vectors in an online manner. I maintain the code here: https://github.com/rutum/gensim
2) On your local machine, browse to the location of the downloaded code, and install it by typing:
3) Now run the following lines of code from ipython or a seperate python file:
OK. So far so good.
You will notice that I did some more evaluation on this data, by testing it against the same dataset that Google released, to compute the sysntactic and semantic relationships between words. As text8 is a small dataset, we don’t expect it to achieve very high levels of accuracy on this task, however, it will help us discern the difference in learning words in an online manner, vs learning it all in one sitting. You can download the script that I ran from here
Now lets update the model with all the sentences containing queen and see if the vector for $queen$ is similar to that of $king$ and $duke$. Notice that the build_vocab function now has an additional argument update=True that add more words to the existing vocabulary.
BINGO! Looks like it learned the weights of the vector $queen$ quite well.
Here is how the files are divided: All sentences from text8 that have queen in them are in text8-queen, and the remaining sentences are in text8-rest. The file text8-all, is a concatenation of text8-rest and text8-queen.
Here are the output accuracies that were achieve if we were to train the entire model in one go, as opposed to piecemeal in an online manner. Note that as the amount of data we are using is very little, the accuracy will vary a little due to the initialization parameters.
As you can see, the output score does drop a little, when the model is updated in an online manner, as opposed to training everything in one go. The PR for my code can be found here: https://github.com/piskvorky/gensim/pull/435