# Dr. Rutu Mulkar-Mehta

## Online Word2Vec for Gensim

deep-learning word2vec gensim

Word2Vec [1] is a technique for creating vectors of word representations to capture the syntax and semantics of words. The vectors used to represent the words have several interesting features, here are a few:

• Addition and subtraction of vectors show how word semantics are captured: e.g. $king - man + woman = queen$ This example captures the fact that the semantics of $king$ and $queen$ are nicely captured by the word vectors

• Similar words have similar word vectors: E.g. $king$ is most similar to - $queen$, $duke$, $duchess$

Here is the description of Gensim Word2Vec, and a few blogs that describe how to use it: Deep Learning with Word2Vec, Deep learning with word2vec and gensim, Word2Vec Tutorial, Word2vec in Python, Part Two: Optimizing, Bag of Words Meets Bags of Popcorn.

One of the issues of the Word2Vec algorithm is that it is not able to add more words to vocabulary after an initial training. This approach to ‘freeze vocabulary’ might not work for several situations where we need to train the model in an online manner, by adding and training on new words as they are encountered. Here is a quick description of an online algorithm

In this post, I will discuss an online word2vec implementation that I have developed and how to use it to update the vocabulary and learn new word vectors in an online manner. I maintain the code here: https://github.com/rutum/gensim

How to use online word2vec:

2) On your local machine, browse to the location of the downloaded code, and install it by typing:

3) Now run the following lines of code from ipython or a seperate python file:

OK. So far so good.

You will notice that I did some more evaluation on this data, by testing it against the same dataset that Google released, to compute the sysntactic and semantic relationships between words. As text8 is a small dataset, we don’t expect it to achieve very high levels of accuracy on this task, however, it will help us discern the difference in learning words in an online manner, vs learning it all in one sitting. You can download the script that I ran from here

Now lets update the model with all the sentences containing queen and see if the vector for $queen$ is similar to that of $king$ and $duke$. Notice that the build_vocab function now has an additional argument update=True that add more words to the existing vocabulary.

BINGO! Looks like it learned the weights of the vector $queen$ quite well.