Architecture
LLMs are now allowing to bend natural language processing to near human capabilities, and Meta AI’s LLAMA (Large Language Model Meta AI) is one of the best examples of how open and efficient large-scale models can be built. This blog post therefore explores and explains the LLAMA model and all other related details with background information from the research paper from the original developers.
Introduction to LLAMA
Currently, LLAMA is a set of foundation models with sizes from 7 billion to 65 billion parameters. However, compared to other well-known models like GPT-3, or PALM, LLAMA uses exclusively open-source data in its training and this aspect is crucial for the model. Nonetheless, relative to much larger models, LLAMA shows comparable or even better performance on many benchmarks, meaning that efficiency in training and architecture can make a large difference even if model size is not.
Main characteristics of the architecture of LLAMA

LLAMA’s architecture builds upon the transformer framework, incorporating innovative adjustments that enhance its performance and efficiency:
Pre-Normalization:
Following what was done in GPT-3, LLAMA uses normalization before each transformer sub-layer not after them. This boosting of training stability results from a straightforward modification of the Total Squared Error formula.
It is used for this step and simply scales inputs by dividing them by their value of RMSNorm or root mean square.

Activation Function:
To make it more accurate, ReLU is replaced with the SwiGLU activation function used recently in Google’s PALM model. SwiGLU is faster and enhances model performance utilizing 2×d dimensions instead of 4×d, saving computation.
Rotary Positional Embeddings (RoPE):
Absolute positional embeddings are not used instead, there is something called RoPE just like the GPT-Neo models. These embeddings improve positional information capability of the presented model to other lengths of inputs.

Model Scaling:
In the LLAMA series, the information scale is directly scaled across the parameter sizes ranging from 7B, 13B, 33B, 65B parameter respectively with the number of head, layers, and hidden dimensions also proportional to the corresponding information scale.
Hyperparameters and Training Setup
In particular, hyperparameters and training set up refers to the general parameters of the network that one has to adjust to optimize in order to achieve the best results.

Currently, LLAMA models are trained with a massive amount of datasets that sums up to 1.4 trillion tokens. A few notable hyperparameter configurations are:

- Optimizer: AdamW with β1=0.9 and β2=0.95 together with cosine learning rate schedule.
- Training: The batch size employed is 4 million tokens and the common memory friendly tricks like gradient clipping and activation checkpointing are implemented.
For the purpose of this article I will also refer to it as Innovative Efficiency Techniques.
Innovative Efficiency Techniques
LLAMA specifically focuses on efficiency on the training and the inference stages. Key optimizations include:
Causal Multi-Head Attention:
First, a memory-optimized implementation eliminates the overhead of storing unnecessary attention weights.
Activation Checkpointing:
Intermediate activations are recomputed selectively in the backward pass such that memory usage is moderate without much accompanying computation time.
Parallelism:
The models also implement sequence and model parallelism to divide the computations between the GPUs to make it possible to train these large models, such as the 65-Bitichel model, on 2048 GPUs in only 21 days.
Performance Highlights
Competitiveness:
Surprisingly, 13B model achieves better results than a 175B GPT-3 on many benchmarks, demonstrating that enhanced architectural design trumps the expansion of model parameters.
Accessibility:
Fortunately, LLLaMA-13B is trainable on a single GPU, meaning people can build their own high-quality LLMs.
LLAMA demonstrates how good design, approach to using the data, and the strong architecture can lead to the impressive performance of the language models. Therefore, by pre-normalization, efficient activations, and scalable designed LLaMA models not only report state of art performance but also do so using less amount of resources. This makes LLaMA a reference for future research in contextual and affordable AI for everyone.
We will continue to post on how these models change and how they are used in practical use!