Home

(WIP) Tokenization for Large Language Models

TikTokenizer shows a view of tokenizers used by different language models. Unicode Nathan Reed’s Programmer’s guide to Unicode Unicode aims to faithfully represent the entire world’s writing systems Unicode supports 135 different scripts, covering some 1100 languages over 100 unsupported scripts Unicode Character Database Backwards c...

Read more

What we do not know about LLMs

Although Deep Learning in general and Transformers in particular have made tremendous strides in applications and downstream tasks (such as Question Answering, Text Summarization, object detection etc.), there are still a lot of gaps in our understanding and effectiveness of Deep Learning and in particular in LLMs. For instance, Generative Large...

Read more

From Machine Learning to Large Language Models - A Survey

Starting the early 2000s, the improvements in hardware to support deep learning networks has lead to a leap in modern deep learning approaches. Deep Learning (Hinton et al. 2006), (Bengio et al. 2007), which is an extension of neural networks, contain an input, an output, and a large number of hidden layers between the input and output. This ty...

Read more

Probability Theory for Natural Language Processing

A lot of work in Natural Language Processing (NLP) such a creation of Language Models is based on probability theory. For the purpose of NLP, knowing about probabilities of words can help us predict the next word, understanding the rarity of words, analyzing and knowing when to ignore common words with respect to a given context - e.g. articles ...

Read more

The Foundations of Language Models

Language Models are models that are trained to predict the next word, given a set of words that are already uttered or written. e.g. Consider the sentence: “Don’t eat that because it looks…“ The next word following this will most likely be “disgusting”, or “bad”, but will probably not be “table” or “chair”. Language Models are models that assi...

Read more

The Comprehensive Guide to Logistic Regression

In Natural Language Processing (NLP) Logistic Regression is the baseline supervised ML algorithm for classification. It also has a very close relationship with neural networks (If you are new to neural networks, start with Logistic Regression to understand the basics.) Introduction Logistic Regression is a discriminative classifier. Discrim...

Read more

What is Byte-Pair Encoding for Tokenization?

Tokenization is the concept of dividing text into tokens - words (unigrams), or groups of words (n-grams) or even characters. Morphology traditionally defines morphemes as the smallest semantic unit. e.g. The word Unfortunately can be broken down as un - fortun - ate - ly [[un [[fortun(e) ]\(_{ROOT}\) ate]\(_{STEM}\)]\(_{STEM}\) ly]\(_{WORD}\)...

Read more

Managing Machine Learning Experiments

I run Machine Learning experiments for a living and I run an average of 50 experiments per stage of a project. For each experiment I write code for training models, identifying the right test cases and metrics, finding the right preprocessors - the list goes on. So how to manage these experiments? Here are a few criteria that I have: Compat...

Read more