Transformer models have revolutionized natural language processing (NLP) since their introduction in 2017. These models are widely used in chatbots, content generation, and AI-powered writing tools. This blog provides a concise look at how they generate text and why they are so effective.
Transformer-based large language models (LLMs) function at their core by taking an input text prompt and generating a response, one token at a time. Instead of producing the entire output in a single step, the model predicts each token sequentially based on the preceding text.
Since each new token is based on previous ones, this process is known as autoregressive generation. It ensures that the output remains contextually relevant and coherent.
The tokenizer breaks down input text into smaller units (tokens) that the model understands. These tokens are then converted into numerical representations before being fed into the model.
A stack of Transformer layers processes the input, applying attention mechanisms to understand the context and relationships between words.
The LM head assigns probability scores to possible next tokens, guiding the model’s text generation.

At each step, the model generates a probability distribution over possible next tokens. The choice of token selection strategy impacts the output quality:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Select a smaller, efficient model for Google Colab T4
model_name = "microsoft/Phi-2" # Choose a lightweight LLM
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model with optimized settings
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto", # Automatically select the best device (GPU/CPU)
torch_dtype=torch.float16, # Optimize memory usage
trust_remote_code=True,
)
# Provide a prompt for text generation
prompt = "The capital of Turkey is"
# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
# Forward pass through the model
model_output = model(input_ids)
# Extract the logits (before softmax)
logits = model_output.logits
# Get the most probable next token
predicted_token_id = torch.argmax(logits[:, -1, :], dim=-1)
# Decode the token to text
predicted_word = tokenizer.decode(predicted_token_id)
print("Predicted next word:", predicted_word)Output:
![]() |
Let’s see how a Transformer LLM generates a blog introduction:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Select a smaller, efficient model for Google Colab T4
model_name = "microsoft/Phi-2" # Choose a lightweight LLM
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model with optimized settings
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto", # Automatically select the best device (GPU/CPU)
torch_dtype=torch.float16, # Optimize memory usage
trust_remote_code=True,
)
# Create a text generation pipeline
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False, # Avoid repeating input in the output
max_new_tokens=100, # Adjust output length
do_sample=True, # Enable variability in responses
temperature=0.7, # Control randomness (lower = more deterministic)
)
# Provide a prompt for text generation
prompt = "Write an engaging introduction for a blog about the benefits of remote work."
# Generate the text
output = generator(prompt)
# Print the result
print(output[0]['generated_text'])Output:
![]() |
By adjusting parameters like max_new_tokens and temperature, we can influence the length and style of the output.
One of the strengths of Transformer models is their ability to process multiple tokens simultaneously, unlike traditional sequential models.
Each token follows its own computation stream, with interaction occurring during attention steps. However, models have a context length limit, which defines the maximum number of tokens they can process at once (e.g., 4K tokens for some models).
If the input exceeds this limit, older tokens are truncated, potentially affecting output quality.
For example, let’s examine the processing of a simple prompt:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-2")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-2", device_map="auto", torch_dtype=torch.float16)
# Define the prompt
prompt = "The capital of Turkey is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
# Extract model outputs
model_output = model.model(input_ids)
lm_head_output = model.lm_head(model_output[0])
print("Model output shape:", model_output[0].shape)
print("LM head output shape:", lm_head_output.shape)Output:
![]() |
Here, model_output[0].shape indicates that the input batch has one sequence of six tokens, each represented by a 3,072-dimensional vector.
The lm_head_output.shape shows that the model is predicting a probability distribution over 32,064 possible tokens for each position in the sequence. This demonstrates how each token undergoes complex processing before contributing to the final output.