INFORMATION RETRIEVAL

Approximate time to read: 6 min
Build your own search Engine

In this post, I will take you through the steps for calculating the $tf \times idf$ values for all the words in a given document. To implement this, we use...

Approximate time to read: 6 min
The Math behind Lucene

Lucene is an open source search engine, that one can use on top of custom data and create your own search engine - like your own personal google. In this...

Approximate time to read: 6 min
The Math behind Lucene

Lucene is an open source search engine, that one can use on top of custom data and create your own search engine - like your own personal google. In this...

LUCENE

Approximate time to read: 6 min
Build your own search Engine

In this post, I will take you through the steps for calculating the $tf \times idf$ values for all the words in a given document. To implement this, we use...

Approximate time to read: 6 min
The Math behind Lucene

Lucene is an open source search engine, that one can use on top of custom data and create your own search engine - like your own personal google. In this...

PROBABILITY

Approximate time to read: 9 min
Probability Theory for Natural Language Processing

A lot of work in Natural Language Processing (NLP) such a creation of Language Models is based on probability theory. For the purpose of NLP, knowing about probabilities of words...

Approximate time to read: 11 min
An Introduction to Probability

This post is an introduction to probability theory. Probability theory is the backbone of AI, and the this post attempts to cover these fundamentals, and bring us to Naive Bayes,...

STATISTICS

Approximate time to read: 10 min
Understanding your Data - Basic Statistics

Have you ever had to deal with a lot of data, and don’t know where to start? If yes, then this post is for you. In this post I will...

WORD2VEC

Approximate time to read: 14 min
Online Word2Vec for Gensim

Word2Vec [1] is a technique for creating vectors of word representations to capture the syntax and semantics of words. The vectors used to represent the words have several interesting features,...

REPRESENTATION LEARNING

Approximate time to read: 14 min
Online Word2Vec for Gensim

Word2Vec [1] is a technique for creating vectors of word representations to capture the syntax and semantics of words. The vectors used to represent the words have several interesting features,...

NLP INTRODUCTION

Approximate time to read: 7 min
What is Natural Language Processing (NLP)?

Last year I wrote a highly popular blog post about Natural Language Processing, Machine Learning, and Deep Learning.

Approximate time to read: 4 min
Natural Language Processing vs. Machine Learning vs. Deep Learning

NLP, Machine Learning and Deep Learning are all parts of Artificial Intelligence, which is a part of the greater field of Computer Science. The following image visually illustrates CS, AI...

MACHINE LEARNING

Approximate time to read: 3 min
Managing Machine Learning Experiments

I run Machine Learning experiments for a living and I run an average of 50 experiments per stage of a project. For each experiment I write code for training models,...

Approximate time to read: 4 min
Natural Language Processing vs. Machine Learning vs. Deep Learning

NLP, Machine Learning and Deep Learning are all parts of Artificial Intelligence, which is a part of the greater field of Computer Science. The following image visually illustrates CS, AI...

DEEP LEARNING

Approximate time to read: 4 min
Natural Language Processing vs. Machine Learning vs. Deep Learning

NLP, Machine Learning and Deep Learning are all parts of Artificial Intelligence, which is a part of the greater field of Computer Science. The following image visually illustrates CS, AI...

EXPERIMENT MANAGEMENT

Approximate time to read: 3 min
Managing Machine Learning Experiments

I run Machine Learning experiments for a living and I run an average of 50 experiments per stage of a project. For each experiment I write code for training models,...

TOKENIZATION

Approximate time to read: 3 min
What is Byte-Pair Encoding for Tokenization?

Tokenization is the concept of dividing text into tokens - words (unigrams), or groups of words (n-grams) or even characters. Morphology traditionally defines morphemes as the smallest semantic unit. e.g....

NLP

Approximate time to read: 16 min
The Comprehensive Guide to Logistic Regression

In Natural Language Processing (NLP) Logistic Regression is the baseline supervised ML algorithm for classification. It also has a very close relationship with neural networks (If you are new to...

CLASSIFICATION

Approximate time to read: 16 min
The Comprehensive Guide to Logistic Regression

In Natural Language Processing (NLP) Logistic Regression is the baseline supervised ML algorithm for classification. It also has a very close relationship with neural networks (If you are new to...

LANGUAGE MODELS

Approximate time to read: 12 min
The Foundations of Language Models

Language Models are models that are trained to predict the next word, given a set of words that are already uttered or written. e.g. Consider the sentence: “Don’t eat that...