Skip to main content

Gemini Architecture

With the steady emergence of new capabilities into the field of artificial intelligence, the integration of multiple data processing and multiple data generation is now emerging as the number one feature. At Google, Gemini LLM (Language Model) is one of the examples of the multimodal AI that can manage the data of different types and with the help of different interfaces. Based on the transformer core, the architecture of Gemini is potentially expandable for a wide range of tasks due to additional decoders for specific modalities. Alright let’s start with the description on how exactly Gemini works based on the given reference image below.

Gemini Architecture
Gemini Architecture

1. Multimodal Input Processing

Fundamentally, Gemini works with multiple types of inputs or as it calls them: ‘sequences’. These input modalities include:

1- Text (Aa): Speech input which may include question type input or command input or conversational input.

2- Audio (Waveform): Auditory or acoustic I/Os, which can include activities such as speech recognition, voice recognition, dictation or the analysis of sound.

3- Images (Landscape Icon): Images for recognition or analysis, image data in any project related to imagery such as generating art.

4- Videos (Play Icon): Translational motions that deliver temporal data to the model and allow it to work with changes in the visual data.

All of these input types are tokenized and converted into sequences of embeddings which then fed into the core transformer block.

2. The Transformer Core

The Transformer acts as the prototype in Gemini’s architecture. This one serves as a kind of general purpose processor to convert and analyse the data and to identify the inter-pattern and relations. Here’s why the transformer is crucial:

  1. Unified Representation: By transforming the multimodal inputs to a single pass, the transformer effectively processes inputs in a similar manner.
  2. Cross-Modal Understanding: The transformer encourages the input modality to interact with others. For instance, it can associate description with the image or the audio with the video frames.
  3. Scalability: The transformers also generalize well with data and computation and hence enable Gemini to handle big data and problems. 

Here’s an in-depth look at the transformer’s role and functionality:

a. Unified Representation

As shown in the figure, the transformer core accepts the integrated input modalities comprising the text, audio, image or even the videos and encodes them into a single one. This means that the model processes these different types of data in a uniform and organized way, which makes the model able to interchange information between these modalities. For example:

Images and texts are used to create captions with the help of a chart.

Both audio and video can be synchronized for any function such as subtitling or video checking.

b. Attention Mechanism

In the heart of the transformer architecture is the mechanism called self-attention which serves the model to pay attention to the important information only. This mechanism works as follows:

1- Key-Value Queries: Successfully, the transformer computes for an input token a “query”, “key” and “value” to analyze the interactions between different entities.

2- Contextual Understanding: Thereby, the model assigns weights to different portions of the data and highlights factors significant in the context of given relationships.

3- Cross-Attention for Multimodal Inputs: In multimodal situations, cross attention allows the model to connect data from different domains (for instance, associating an area in the picture with a description).

This process is helpful for such problems as the video’s summarization or generation of the text based on the image input.

c. Hierarchical Tokenization

Gemini’s transformer core also uses a hierarchical tokenization especially for data types that are peculiar such as photos and videos and other similar types of data. For instance:

An image can be divided and then the image is divided into patches and every patch can be presented in the form of a token.

A video may be tokenized in a spatial manner, one frame per token, though temporal tokens that characterize the motion dynamics can also be used.

The transformer applies these tokens in layers wherein, step by step, the structure of the input is interpreted by the model.

d. Multimodal Fusion

In the presented transformer, multi-modal fusion is possible to combine information from various data types to improve understanding and context. For example:

Text and image inputs and their combination let the model produce composite descriptions of images.

Joining the two gives the ability to interpret dynamic scenes in a detailed manner.

It is this fusion process which is central to GEMINI’s overall capability for dealing with complex, multi-modal tasks.

3. Specialized Decoders for Output

Once the transformer processes the inputs, the output is routed through specialized decoders tailored to specific modalities:

Image Decoder: Creates image output as images or image related discoveries. For some of the use cases such as image captioning, object recognition, or even generation of new images based on prompts to name a few this is employed.

Text Decoder: Creates examples, responses, explanations, or findings, in plain language or story form. The decoder is especially important for conversational AI, for summarizing information, and for queries done on texts.

Each decoder is fine-tuned to its corresponding modality so that the outputs should be precise, semantically smooth, and optimal for the end tasks.

4. Key Features of Gemini’s architecture 

Key Features of Gemini Architecture
Key Features of Gemini Architecture

a. Multimodal Integration

What sets Gemini apart is the possibility to work with different types of data at once. For example, it can transform images to textual descriptions or map audio to frames of videos, which would serve purposes including teaching, helping the visually impaired or artists and designers.

b. Unified Training

By doing this, the model is trained using a number of datasets of different modality so as to improve generalization. This Christmas tree training approach means that Gemini can do well on a large number of tasks without a need for extra models.

c. Fine-Tuned Decoders

Additional decoders improve Gemini’s functionality by adapting the outputs to different modes, when it comes to writing human-like text or generating photorealistic images.

d. Scalability and Efficiency

Building on top of the transformer’s extensibility, Gemini should be able to cater to significantly large datasets and even more demanding applications for enterprises and above.

Technologies enhancing Gemini performance

a) Decoder-only Transformer model:

Similar to many generative AI models, Gemini models build on top of decoder-only transformers (base model [20]). However, the standard decoder-only architecture was modified to enhance efficiency and stabilize training at scale and optimized inference on Google’s  TPUs . They employed multi-query attention, a method that augments multi-head attention’s efficiency by allowing attention heads to share key and value vectors. Additionally, Gemini leverages some of the optimization and architectural tricks, i.e., Lion optimizer, Low Precision Layer Normalization, Flash Attention and, Flash Decoding (build on top of Flash Attention)

b) TPU Accelerators:

Gemini models were trained using TPUv4 and TPUv5e, based on their respective sizes and configurations. These specially  designed AI accelerators are core to Google’s AI-driven products, empowering cost-effective training of AI models. TPU v4 includes with SparseCores, specialized dataflow processors that enhance the performance of models dependent on embeddings  by 5 to 7 times, while consuming merely 5% of the die area and power . The training of GUltra, which utilized numerous TPUv4 accelerators spread across various data centers, resulted in a proportional decrease in the mean time between hardware  failures throughout the system. TPUv5e is the newest generation of AI accelerators, a successor of TPUv4 lite. It features a compact 256-chip configuration per Pod wherein TPUv4 have 4096 chips per Pod. These Pods are tailored for training, fine-tuning, and deploying transformer-based, text-to-image, and CNN-based models. TPUv5e enables Google to inference models that are larger than OpenAI at the same cost as OpenAI’s smaller model.

c) Retrieval-Augmented Generation (RAG):

Fundamentally, RAG is an AI framework designed for information optimization minimize the amount of irrelevant information to the model by feeding more relevant, external information. Due to limited context window, GPro integrates RAG for information retrieval with text generation, resulting in factually grounded outputs. RAG access useful passages from the book; indexes them using TF-IDF; and stores the results in an external database. By utilizing cosine similarity, the passages are re-ranked, and the most relevant passages are retrieved (up to4k tokens). The retrieved passages are then put into context following a temporal ordering.

Highly efficient architecture

Gemini 1.5 is built upon our leading research on Transformer and MoE architecture. While a traditional Transformer functions as one large neural network, MoE models are divided into smaller “expert” neural networks.

MoE Architecture
MoE Architecture

Gemini 1.5 can complete complex tasks faster and with less computing load thanks to the MoE architecture (read the paper if you’re interested). Consider it essentially as a constellation of “expert” neural networks that selectively activate the most pertinent pathways based on the input, greatly increasing efficiency. Compared to earlier models, this allows for far more complex thinking and problem-solving skills.

Depending on the type of input given, MoE models learn to selectively activate only the most relevant expert pathways in its neural network. This specialization massively enhances the model’s efficiency. Google has been an early adopter and pioneer of the MoE technique for deep learning through research such as Sparsely-Gated MoEGShard-TransformerSwitch-Transformer, M4 and more.

The latest innovations in model architecture allow Gemini 1.5 to learn complex tasks more quickly and maintain quality, while being more efficient to train and serve. These efficiencies are helping the teams iterate, train and deliver more advanced versions of Gemini faster than ever before.

Greater context, more helpful capabilities

An AI model’s “context window” is made up of tokens, which are the building blocks used for processing information. Tokens can be entire parts or subsections of words, images, videos, audio or code. The bigger a model’s context window, the more information it can take in and process in a given prompt — making its output more consistent, relevant and useful.

Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.

Gemini context length
Gemini context length
0
    0
    Your Cart
    Your cart is emptyReturn to Courses