(WIP) Tokenization for Large Language Models

TikTokenizer shows a view of tokenizers used by different language models.

Unicode

Nathan Reed’s Programmer’s guide to Unicode

Unicode aims to faithfully represent the entire world’s writing systems
Unicode supports 135 different scripts, covering some 1100 languages
over 100 unsupported scripts
Unicode Character Database
Backwards compatible with Ascii

Unicode Codespace

The basic elements of Unicode are called code points
Code points are identified by number, customarily written in hexadecimal with the prefix “U+”, such as U+0041 “A” latin capital letter a or U+03B8 “θ” greek small letter theta.
Each code point also has a short name, and quite a few other properties, specified in the Unicode Character Database.
The set of all possible code points is called the codespace.
Has 17 planes in 2d space
Plane 0 is called “Basic Multilingual Plane”, or BMP. The BMP contains essentially all the characters needed for modern text in any script, including Latin, Cyrillic, Greek, Han (Chinese), Japanese, Korean, Arabic, Hebrew, Devanagari (Indian), and many more.

Unicode Encodings:

Unicode code points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF
Unicode itself doesn’t directly specify the storage size per character. Instead, it provides various encoding schemes, known as Unicode Transformation Formats (UTFs), that determine how characters are stored as binary data
UTF8 (most common) - each code point is stored using 1 to 4 bytes, based on its index value, code points below 128 (ASCII characters) are encoded as single bytes, compatible with ASCII, widely-used string programming idioms—such as null termination, or delimiters (newlines, tabs, commas, slashes, etc.)—will just work on UTF-8 strings.
UTF16 - UTF-16 is a variable-length encoding, where a single Unicode code point can be represented by one or two 16-bit code units (two or four bytes), UTF-16’s words can be stored either little-endian or big-endian (does this matter?)
UTF32 (rarely used)

Combining multiple code points can produce a single character - e.g. Korean and Devnagiri have vowels that can be joined with a consonant to form a single character (मु and 무 in Devnagiri and Korean/Hangul respectively)

Token Representation

org('A') provides the unicode codepoint for a character. chr('232') converts the unicode codepoint into a string. If you’re seeing byte representations of strings when you encode them to UTF-8 individually, but those byte representations disappear when the strings are put into a list and then printed, it’s about how Python displays string and byte objects, especially when they’re part of a list.

There is a way to use bytes directly into a transformer MegaByte paper, but it requires architectural changes to the Transformer Architecture. It take byte streams directly into the model, but it is not proven yet.

BPE - compress the token representation/encoding

Byte Pair Encoding (BPE)

Bytes as Numbers: A byte can be seen as a container holding a number between 0 and 255 When iterating through a bytes object or using a function like map(int, my_bytes), Python directly accesses and presents the numerical values for each byte. No complex calculation or lookup is needed; the integer value is embedded in the byte. Bytes objects are sequences of integers Link: according to DataCamp.

Emojis etc become multiple bytes from tokens. An encoding is a representation of a unicode string. It defines a byte or byte sequence for every* unicode code point; essentially a translation table. For every unicode code point, there is a byte or sequence of bytes.