(WIP) Tokenization for Large Language Models
TikTokenizer shows a view of tokenizers used by different language models.
Unicode
Nathan Reed’s Programmer’s guide to Unicode
Unicode aims to faithfully represent the entire world’s writing systems
Unicode supports 135 different scripts, covering some 1100 languages
over 100 unsupported scripts
Unicode Character Database
Backwards c...
What we do not know about LLMs
Although Deep Learning in general and Transformers in particular have made tremendous strides in applications and downstream tasks (such as Question Answering, Text Summarization, object detection etc.), there are still a lot of gaps in our understanding and effectiveness of Deep Learning and in particular in LLMs. For instance, Generative Large...
From Machine Learning to Large Language Models - A Survey
Starting the early 2000s, the improvements in hardware to support deep learning networks has lead to a leap in modern deep learning approaches. Deep Learning (Hinton et al. 2006), (Bengio et al. 2007), which is an extension of neural networks, contain an input, an output, and a large number of hidden layers between the input and output. This ty...
Probability Theory for Natural Language Processing
A lot of work in Natural Language Processing (NLP) such a creation of Language Models is based on probability theory. For the purpose of NLP, knowing about probabilities of words can help us predict the next word, understanding the rarity of words, analyzing and knowing when to ignore common words with respect to a given context - e.g. articles ...
The Foundations of Language Models
Language Models are models that are trained to predict the next word, given a set of words that are already uttered or written. e.g. Consider the sentence: “Don’t eat that because it looks…“
The next word following this will most likely be “disgusting”, or “bad”, but will probably not be “table” or “chair”. Language Models are models that assi...
The Comprehensive Guide to Logistic Regression
In Natural Language Processing (NLP) Logistic Regression is the baseline supervised ML algorithm for classification. It also has a very close relationship with neural networks (If you are new to neural networks, start with Logistic Regression to understand the basics.)
Introduction
Logistic Regression is a discriminative classifier.
Discrim...
What is Byte-Pair Encoding for Tokenization?
Tokenization is the concept of dividing text into tokens - words (unigrams), or groups of words (n-grams) or even characters.
Morphology traditionally defines morphemes as the smallest semantic unit. e.g. The word Unfortunately can be broken down as un - fortun - ate - ly
[[un [[fortun(e) ]\(_{ROOT}\) ate]\(_{STEM}\)]\(_{STEM}\) ly]\(_{WORD}\)...
Managing Machine Learning Experiments
I run Machine Learning experiments for a living and I run an average of 50 experiments per stage of a project. For each experiment I write code for training models, identifying the right test cases and metrics, finding the right preprocessors - the list goes on.
So how to manage these experiments? Here are a few criteria that I have:
Compat...
15 post articles, 2 pages.