Online Word2Vec for Gensim

Word2Vec [1] is a technique for creating vectors of word representations to capture the syntax and semantics of words. The vectors used to represent the words have several interesting features, here are a few:

Addition and subtraction of vectors show how word semantics are captured: e.g. $king - man + woman = queen$ This example captures the fact that the semantics of $king$ and $queen$ are nicely captured by the word vectors
Similar words have similar word vectors: E.g. $king$ is most similar to - $queen$, $duke$, $duchess$

Here is the description of Gensim Word2Vec, and a few blogs that describe how to use it: Deep Learning with Word2Vec, Deep learning with word2vec and gensim, Word2Vec Tutorial, Word2vec in Python, Part Two: Optimizing, Bag of Words Meets Bags of Popcorn.

One of the issues of the Word2Vec algorithm is that it is not able to add more words to vocabulary after an initial training. This approach to ‘freeze vocabulary’ might not work for several situations where we need to train the model in an online manner, by adding and training on new words as they are encountered. Here is a quick description of an online algorithm

In this post, I will discuss an online word2vec implementation that I have developed and how to use it to update the vocabulary and learn new word vectors in an online manner. I maintain the code here: https://github.com/rutum/gensim

How to use online word2vec:

1) Download the source code from here: https://github.com/rutum/gensim

2) On your local machine, browse to the location of the downloaded code, and install it by typing:

#clean already existing install
sudo rm -rf build dist gensim/*.pyc

#installation
sudo python setup.py install

3) Now run the following lines of code from ipython or a seperate python file:

import gensim.models

# setup logging
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# train the basic model with text8-rest, which is all the sentences
# without the word - queen
model = gensim.models.Word2Vec()
sentences = gensim.models.word2vec.LineSentence("text8-rest")
model.build_vocab(sentences)
model.train(sentences)

# Evaluation
> model.n_similarity(["king"], ["duke"])
> 0.68208604377750204
> model.n_similarity(["king"], ["queen"])
> KeyError: 'queen'

# text8-rest:
> model.accuracy("questions-words.txt")

2015-08-21 10:56:49,781 : INFO : precomputing L2-norms of word weight vectors
2015-08-21 10:56:56,346 : INFO : capital-common-countries: 33.2% (168/506)
2015-08-21 10:57:12,728 : INFO : capital-world: 17.5% (254/1452)
2015-08-21 10:57:15,807 : INFO : currency: 6.0% (16/268)
2015-08-21 10:57:32,402 : INFO : city-in-state: 15.0% (235/1571)
2015-08-21 10:57:35,197 : INFO : family: 50.0% (136/272)
2015-08-21 10:57:43,378 : INFO : gram1-adjective-to-adverb: 6.0% (45/756)
2015-08-21 10:57:46,406 : INFO : gram2-opposite: 12.4% (38/306)
2015-08-21 10:57:59,972 : INFO : gram3-comparative: 34.6% (436/1260)
2015-08-21 10:58:05,865 : INFO : gram4-superlative: 13.2% (67/506)
2015-08-21 10:58:17,331 : INFO : gram5-present-participle: 18.2% (181/992)
2015-08-21 10:58:31,446 : INFO : gram6-nationality-adjective: 37.5% (514/1371)
2015-08-21 10:58:45,533 : INFO : gram7-past-tense: 18.3% (244/1332)
2015-08-21 10:58:55,660 : INFO : gram8-plural: 30.2% (300/992)
2015-08-21 10:59:02,508 : INFO : gram9-plural-verbs: 20.3% (132/650)
2015-08-21 10:59:02,509 : INFO : total: 22.6% (2766/12234)

OK. So far so good.

You will notice that I did some more evaluation on this data, by testing it against the same dataset that Google released, to compute the sysntactic and semantic relationships between words. As text8 is a small dataset, we don’t expect it to achieve very high levels of accuracy on this task, however, it will help us discern the difference in learning words in an online manner, vs learning it all in one sitting. You can download the script that I ran from here

Now lets update the model with all the sentences containing queen and see if the vector for $queen$ is similar to that of $king$ and $duke$. Notice that the build_vocab function now has an additional argument update=True that add more words to the existing vocabulary.

sentences2 = gensim.models.word2vec.LineSentence("text8-queen")
model.build_vocab(sentences2, update=True)
model.train(sentences2)

# Evaluation
> model.n_similarity(["king"], ["duke"])
> 0.47693305301957223
> model.n_similarity(["king"], ["queen"])
> 0.68197327708244115

# text8-rest + text8-queen (using update model)
> model.accuracy("questions-words.txt")
2015-08-21 11:00:42,571 : INFO : precomputing L2-norms of word weight vectors
2015-08-21 11:00:47,892 : INFO : capital-common-countries: 23.3% (118/506)
2015-08-21 11:01:02,583 : INFO : capital-world: 14.1% (205/1452)
2015-08-21 11:01:05,521 : INFO : currency: 4.5% (12/268)
2015-08-21 11:01:21,348 : INFO : city-in-state: 13.2% (208/1571)
2015-08-21 11:01:24,349 : INFO : family: 46.4% (142/306)
2015-08-21 11:01:31,891 : INFO : gram1-adjective-to-adverb: 6.2% (47/756)
2015-08-21 11:01:34,925 : INFO : gram2-opposite: 13.4% (41/306)
2015-08-21 11:01:47,631 : INFO : gram3-comparative: 32.4% (408/1260)
2015-08-21 11:01:52,768 : INFO : gram4-superlative: 11.7% (59/506)
2015-08-21 11:02:02,831 : INFO : gram5-present-participle: 18.0% (179/992)
2015-08-21 11:02:16,823 : INFO : gram6-nationality-adjective: 35.2% (483/1371)
2015-08-21 11:02:31,937 : INFO : gram7-past-tense: 17.1% (228/1332)
2015-08-21 11:02:42,960 : INFO : gram8-plural: 26.8% (266/992)
2015-08-21 11:02:49,822 : INFO : gram9-plural-verbs: 19.2% (125/650)
2015-08-21 11:02:49,823 : INFO : total: 20.5% (2521/12268)

BINGO! Looks like it learned the weights of the vector $queen$ quite well.

NOTE: text8-rest, and text8-queen, and text8-all can be downloaded here: http://rutumulkar.com/data/onlinew2v/text8-files.zip.

Here is how the files are divided: All sentences from text8 that have queen in them are in text8-queen, and the remaining sentences are in text8-rest. The file text8-all, is a concatenation of text8-rest and text8-queen.

Here are the output accuracies that were achieve if we were to train the entire model in one go, as opposed to piecemeal in an online manner. Note that as the amount of data we are using is very little, the accuracy will vary a little due to the initialization parameters.

sentences = gensim.models.word2vec.LineSentence("text8-all")
model.build_vocab(sentences)
model.train(sentences)

# text8-all
> model.accuracy("questions-words.txt")

2015-08-21 11:07:53,811 : INFO : precomputing L2-norms of word weight vectors
2015-08-21 11:07:58,595 : INFO : capital-common-countries: 36.0% (182/506)
2015-08-21 11:08:12,343 : INFO : capital-world: 18.9% (275/1452)
2015-08-21 11:08:14,757 : INFO : currency: 4.9% (13/268)
2015-08-21 11:08:28,813 : INFO : city-in-state: 16.4% (257/1571)
2015-08-21 11:08:31,542 : INFO : family: 48.4% (148/306)
2015-08-21 11:08:38,486 : INFO : gram1-adjective-to-adverb: 6.9% (52/756)
2015-08-21 11:08:41,268 : INFO : gram2-opposite: 16.7% (51/306)
2015-08-21 11:08:52,507 : INFO : gram3-comparative: 34.4% (434/1260)
2015-08-21 11:08:57,148 : INFO : gram4-superlative: 12.8% (65/506)
2015-08-21 11:09:06,475 : INFO : gram5-present-participle: 19.1% (189/992)
2015-08-21 11:09:18,681 : INFO : gram6-nationality-adjective: 40.0% (548/1371)
2015-08-21 11:09:30,722 : INFO : gram7-past-tense: 18.2% (243/1332)
2015-08-21 11:09:39,516 : INFO : gram8-plural: 32.7% (324/992)
2015-08-21 11:09:45,498 : INFO : gram9-plural-verbs: 17.1% (111/650)
2015-08-21 11:09:45,499 : INFO : total: 23.6% (2892/12268)

As you can see, the output score does drop a little, when the model is updated in an online manner, as opposed to training everything in one go. The PR for my code can be found here: https://github.com/piskvorky/gensim/pull/435

###References:### [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space