Home

Online Word2Vec for Gensim

Word2Vec [1] is a technique for creating vectors of word representations to capture the syntax and semantics of words. The vectors used to represent the words have several interesting features, here are a few: Addition and subtraction of vectors show how word semantics are captured: e.g. \(king - man + woman = queen\) This example capt...

Read more

Understanding your Data - Basic Statistics

Have you ever had to deal with a lot of data, and don’t know where to start? If yes, then this post is for you. In this post I will try to guide you through some basic approaches and operations you can perform to analyze your data, make some basic sense of it, and decide on your approach for deeper analysis of it. I will use python and a small s...

Read more

An Introduction to Probability

This post is an introduction to probability theory. Probability theory is the backbone of AI, and the this post attempts to cover these fundamentals, and bring us to Naive Bayes, which is a simple generative classification algorithm for text classification. Random Variables In this world things keep happening around us. Each event occurring is...

Read more

Build your own search Engine

In this post, I will take you through the steps for calculating the $tf \times idf$ values for all the words in a given document. To implement this, we use a small dataset (or corpus, as NLPers like to call it) form the Project Gutenberg Catalog. This is just a simple toy example on a very small dataset. In real life we use much larger corpora, ...

Read more

The Math behind Lucene

Lucene is an open source search engine, that one can use on top of custom data and create your own search engine - like your own personal google. In this post, we will go over the basic math behind Lucene, and how it ranks documents to the input search query. THE BASICS - TF*IDF The analysis of language often brings us in situations where we a...

Read more