Skip to main content

Architecture

LLMs are now allowing to bend natural language processing to near human capabilities, and Meta AI’s LLAMA (Large Language Model Meta AI) is one of the best examples of how open and efficient large-scale models can be built. This blog post therefore explores and explains the LLAMA model and all other related details with background information from the research paper from the original developers.

Introduction to LLAMA

Currently, LLAMA is a set of foundation models with sizes from 7 billion to 65 billion parameters. However, compared to other well-known models like GPT-3, or PALM, LLAMA uses exclusively open-source data in its training and this aspect is crucial for the model. Nonetheless, relative to much larger models, LLAMA shows comparable or even better performance on many benchmarks, meaning that efficiency in training and architecture can make a large difference even if model size is not.

Main characteristics of the architecture of LLAMA

LLAMA architecture taken from Umar Jamil YT
LLAMA architecture taken from Umar Jamil YT

LLAMA’s architecture builds upon the transformer framework, incorporating innovative adjustments that enhance its performance and efficiency:

Pre-Normalization:

Following what was done in GPT-3, LLAMA uses normalization before each transformer sub-layer not after them. This boosting of training stability results from a straightforward modification of the Total Squared Error formula.
It is used for this step and simply scales inputs by dividing them by their value of RMSNorm or root mean square.

RMS formula from original paper
RMS formula from original paper

Activation Function:

To make it more accurate, ReLU is replaced with the SwiGLU activation function used recently in Google’s PALM model. SwiGLU is faster and enhances model performance utilizing 2×d dimensions instead of 4×d, saving computation.

Rotary Positional Embeddings (RoPE):

Absolute positional embeddings are not used instead, there is something called RoPE just like the GPT-Neo models. These embeddings improve positional information capability of the presented model to other lengths of inputs.

RoPE formula from original paper
RoPE formula from original paper

Model Scaling:

In the LLAMA series, the information scale is directly scaled across the parameter sizes ranging from 7B, 13B, 33B, 65B parameter respectively with the number of head, layers, and hidden dimensions also proportional to the corresponding information scale.

Hyperparameters and Training Setup

In particular, hyperparameters and training set up refers to the general parameters of the network that one has to adjust to optimize in order to achieve the best results.

Training loss over train tokens for llama
Training loss over train tokens for llama

Currently, LLAMA models are trained with a massive amount of datasets that sums up to 1.4 trillion tokens. A few notable hyperparameter configurations are:

LLAMA architecture hyperparameter configurations
LLAMA architecture hyperparameter configurations
  • Optimizer: AdamW with β1=0.9 and β2=0.95 together with cosine learning rate schedule.
  • Training: The batch size employed is 4 million tokens and the common memory friendly tricks like gradient clipping and activation checkpointing are implemented.

For the purpose of this article I will also refer to it as Innovative Efficiency Techniques.

Innovative Efficiency Techniques

LLAMA specifically focuses on efficiency on the training and the inference stages. Key optimizations include:

Causal Multi-Head Attention:

First, a memory-optimized implementation eliminates the overhead of storing unnecessary attention weights.

Activation Checkpointing:

Intermediate activations are recomputed selectively in the backward pass such that memory usage is moderate without much accompanying computation time.

Parallelism:

The models also implement sequence and model parallelism to divide the computations between the GPUs to make it possible to train these large models, such as the 65-Bitichel model, on 2048 GPUs in only 21 days.

Performance Highlights

Competitiveness:

Surprisingly, 13B model achieves better results than a 175B GPT-3 on many benchmarks, demonstrating that enhanced architectural design trumps the expansion of model parameters.

Accessibility:

Fortunately, LLLaMA-13B is trainable on a single GPU, meaning people can build their own high-quality LLMs.

LLAMA demonstrates how good design, approach to using the data, and the strong architecture can lead to the impressive performance of the language models. Therefore, by pre-normalization, efficient activations, and scalable designed LLaMA models not only report state of art performance but also do so using less amount of resources. This makes LLaMA a reference for future research in contextual and affordable AI for everyone.

We will continue to post on how these models change and how they are used in practical use!

0
    0
    Your Cart
    Your cart is emptyReturn to Courses