Tokenization is a fundamental concept in functioning Large Language Models (LLMs). When you interact with an LLM, whether in a chatbot or an AI-powered writing assistant, the text you input isn’t processed directly as whole words or sentences. Instead, the model first breaks down the text into smaller units called tokens before generating a response.
This article will explain what is tokenization and how tokenization works in LLM. In addition, we will also provide a practical example in Google Colab.
A token is a unit of text that can be a word, subword, or even a character, depending on the tokenizer used by the model. Some key points about tokens:
Example of Tokenization
Let’s take an example sentence and see how it gets tokenized using the OpenAI Platform:
Input:
“Explain how artificial intelligence is transforming healthcare.”
Tokenized Output:

Notice how the words are transformed into different tokens.
Tokenization is a critical process in Large Language Models (LLMs) that dictates how text is broken down before being processed by the model. In this section, we will explore the key factors influencing tokenization, different tokenization methods, and their implications.
There are three major factors that dictate how a tokenizer processes an input prompt:
At the design stage of an LLM, the creators choose a tokenization method. Popular methods include:
Although both methods aim to optimize token efficiency, they differ in implementation and segmentation strategies.

After selecting a tokenization method, several design choices must be made, such as:
The tokenizer is trained on a specific dataset to establish an optimal vocabulary. A tokenizer trained on English text will differ from one trained on code or multilingual datasets, as the frequency of words and characters varies across domains.
Beyond processing input, tokenizers also convert model-generated token IDs back into readable words or tokens.
Tokenization can be categorized into four main types:
This method, used in early NLP models like word2vec, treats each word as a token. While effective for some applications, it has limitations:
Subword tokenization balances vocabulary size and expressivity. Instead of treating each word separately, it breaks words into meaningful subcomponents. For instance:
This method improves flexibility, as new words can be constructed from existing subword pieces.
Character tokenization breaks text into individual characters (e.g., “play” becomes “p-l-a-y”). While this allows models to handle any word, it increases the sequence length, making modeling more challenging.
Character-based tokenization requires more computational resources, as the model must infer word meaning from sequences of letters.
Byte-level tokenization breaks text into raw byte representations rather than linguistic units. This approach is used in CANINE and ByT5 models, which aim for tokenization-free encoding. Some subword tokenizers (e.g., GPT-2, RoBERTa) also use byte-level fallback tokens for unknown characters, improving robustness.
Given the prompt:
prompt = “Explain how artificial intelligence is transforming healthcare.”
Different tokenization methods process it differently:
Word Tokenization:
[“Explain”, “how”, “artificial”, “intelligence”, “is”, “transforming”, “healthcare”, “.”]
Subword Tokenization (BPE/WordPiece):
[“Explain”, “how”, “artificial”, “intel”, “##ligence”, “is”, “trans”, “##forming”, “health”, “##care”, “.”]
Character Tokenization:
[“E”, “x”, “p”, “l”, “a”, “i”, “n”, ” “, “h”, “o”, “w”, ” “, “a”, “r”, “t”, …]
Byte Tokenization:
[69, 120, 112, 108, 97, 105, 110, 32, 104, 111, 119, 32, 97, 114, 116, …]
Understanding tokenization is essential when working with LLMs. Different tokenization methods have trade-offs, affecting vocabulary efficiency, model size, and performance.
To explore tokenization hands-on, try experimenting with tokenizers in Python using the transformers library from Hugging Face. By understanding how text is broken down, you can optimize inputs for better results in NLP applications.