Login Sign Up

What is Tokenization? (Tokenization in LLM)

Tokenization is a fundamental concept in functioning Large Language Models (LLMs). When you interact with an LLM, whether in a chatbot or an AI-powered writing assistant, the text you input isn’t processed directly as whole words or sentences. Instead, the model first breaks down the text into smaller units called tokens before generating a response. 

This article will explain what is tokenization and how tokenization works in LLM. In addition, we will also provide a practical example in Google Colab.

What is Tokenization in LLMs?

A token is a unit of text that can be a word, subword, or even a character, depending on the tokenizer used by the model. Some key points about tokens:

  • A single word may be a token, but longer words can be split into multiple tokens.
  • Common words may be stored as whole tokens in the tokenizer’s vocabulary.
  • Punctuation marks are typically treated as separate tokens.
  • Special tokens, such as <s> (start of the sequence) and <|assistant|>, indicate a structure in the conversation.

Example of Tokenization

Let’s take an example sentence and see how it gets tokenized using the OpenAI Platform:

Input:

“Explain how artificial intelligence is transforming healthcare.”

Tokenized Output:

tokenizer
Source: https://platform.openai.com/tokenizer

Notice how the words are transformed into different tokens.

How Does Tokenization Work in Large Language Models?

Tokenization is a critical process in Large Language Models (LLMs) that dictates how text is broken down before being processed by the model. In this section, we will explore the key factors influencing tokenization, different tokenization methods, and their implications.

How Does the Tokenizer Break Down Text?

There are three major factors that dictate how a tokenizer processes an input prompt:

Tokenization Method

At the design stage of an LLM, the creators choose a tokenization method. Popular methods include:

  • Byte Pair Encoding (BPE): Used by GPT models, it efficiently compresses common word fragments.
  • WordPiece: Used by BERT, it breaks words into subword units to create a compact and effective vocabulary.

Although both methods aim to optimize token efficiency, they differ in implementation and segmentation strategies.

Tokenizers convert model output token IDs into readable words or tokens.
Tokenizers convert model output token IDs into readable words or tokens.

Tokenizer Design Choices

After selecting a tokenization method, several design choices must be made, such as:

  • Vocabulary size: Determines the number of unique tokens the model can recognize.
  • Special tokens: Used for structuring input, like <s> for sequence start and <|assistant|> for chat models.

Training on a Dataset

The tokenizer is trained on a specific dataset to establish an optimal vocabulary. A tokenizer trained on English text will differ from one trained on code or multilingual datasets, as the frequency of words and characters varies across domains.

Beyond processing input, tokenizers also convert model-generated token IDs back into readable words or tokens.

Tokenization Methods: Word, Subword, Character, and Byte Tokens

Tokenization can be categorized into four main types:

Word Tokenization

This method, used in early NLP models like word2vec, treats each word as a token. While effective for some applications, it has limitations:

  • Cannot handle unseen words effectively.
  • Results in large vocabularies with redundant tokens (e.g., “apology,” “apologize,” “apologetic”).

Subword Tokenization

Subword tokenization balances vocabulary size and expressivity. Instead of treating each word separately, it breaks words into meaningful subcomponents. For instance:

  • “Apologetic” may be split into apolog- and -etic.
  • “Unhappiness” can be split into un-, happi-, and -ness.

This method improves flexibility, as new words can be constructed from existing subword pieces.

Character Tokenization

Character tokenization breaks text into individual characters (e.g., “play” becomes “p-l-a-y”). While this allows models to handle any word, it increases the sequence length, making modeling more challenging.

Character-based tokenization requires more computational resources, as the model must infer word meaning from sequences of letters.

Byte Tokenization

Byte-level tokenization breaks text into raw byte representations rather than linguistic units. This approach is used in CANINE and ByT5 models, which aim for tokenization-free encoding. Some subword tokenizers (e.g., GPT-2, RoBERTa) also use byte-level fallback tokens for unknown characters, improving robustness.

Example: Tokenization in Action

Given the prompt:

prompt = “Explain how artificial intelligence is transforming healthcare.”

Different tokenization methods process it differently:

Word Tokenization:
[“Explain”, “how”, “artificial”, “intelligence”, “is”, “transforming”, “healthcare”, “.”]

Subword Tokenization (BPE/WordPiece):
[“Explain”, “how”, “artificial”, “intel”, “##ligence”, “is”, “trans”, “##forming”, “health”, “##care”, “.”]

Character Tokenization:
[“E”, “x”, “p”, “l”, “a”, “i”, “n”, ” “, “h”, “o”, “w”, ” “, “a”, “r”, “t”, …]

Byte Tokenization:
[69, 120, 112, 108, 97, 105, 110, 32, 104, 111, 119, 32, 97, 114, 116, …]

Understanding tokenization is essential when working with LLMs. Different tokenization methods have trade-offs, affecting vocabulary efficiency, model size, and performance.

To explore tokenization hands-on, try experimenting with tokenizers in Python using the transformers library from Hugging Face. By understanding how text is broken down, you can optimize inputs for better results in NLP applications.