Fundamentals of AI Engineering

Foundational Models

Transformers

Fine Tuning

Vector Databases

RAG

LangChain

AI models need to process and understand complex data like text, images, and audio. However, raw data cannot be used directly; it must first be converted into numerical representations that AI models can process efficiently.

Embeddings are dense vector representations of data that capture semantic meaning and relationships between different entities. They are widely used in Natural Language Processing (NLP), computer vision, recommendation systems, and retrieval-based AI models.

1. What are Embeddings?

A. The Concept of Embeddings

An embedding is a vector representation of an entity (such as a word, image, or user behavior) in a continuous space where similar entities are placed closer together.

Example:

In a word embedding model, the words king and queen will have similar vector representations because they are semantically related.

Why Use Embeddings?

Convert high-dimensional data (words, images) into compact numerical representations.
Capture semantic meaning (e.g., synonyms have similar embeddings).
Enable AI models to generalize and understand context better.

2. How are Embeddings Created?

A. Training Embeddings from Data

Embeddings are typically learned using deep learning models that analyze relationships between data points.

Methods for Learning Embeddings:

Word2Vec → Predicts word relationships based on context.
GloVe (Global Vectors for Word Representation) → Captures word co-occurrences.
BERT Embeddings → Contextual word embeddings for NLP.
CLIP Embeddings → Maps text and images into the same vector space.

3. Applications of Embeddings in AI

A. Natural Language Processing (NLP)

Text Similarity: Finding similar sentences using cosine similarity between word embeddings.
Machine Translation: Models like BERT use embeddings to understand context in different languages.

Example: Google Translate uses embeddings to represent words across languages.

B. Computer Vision

Image Embeddings: CNNs generate vector representations of images for classification.
Cross-Modal Search: CLIP embeddings allow AI to understand relationships between text and images.

Example: CLIP by OpenAI enables AI to search for images using text descriptions.

C. Recommendation Systems

User Embeddings: Online platforms (Netflix, Spotify) create embeddings for users based on their interactions.
Product Recommendations: Amazon uses embeddings to suggest related items.

Example: Spotify’s AI models use song embeddings to recommend personalized playlists.

4. AI-Powered Search Engine

Problem: Traditional keyword-based search fails to understand meaning.
Solution: AI-powered search engine using vector embeddings.
Workflow:
- Text Embeddings: Convert queries and documents into vectors.
- Vector Search: Use a vector database (FAISS, Pinecone) to retrieve relevant results.
- Ranking Model: Score and display the most relevant documents.

Result: The AI search engine delivers more context-aware and relevant results.

Word Embeddings with Word2Vec

Training word embeddings using Word2Vec

from gensim.models import Word2Vec

# Sample dataset
sentences = [
    ["I", "love", "AI"],
    ["AI", "is", "amazing"],
    ["Embeddings", "capture", "meaning"]
]

# Function to train Word2Vec model
def train_word2vec(sentences, vector_size=10, window=5, min_count=1, workers=4):
    model = Word2Vec(sentences, vector_size=vector_size, window=window, min_count=min_count, workers=workers)
    return model

# Train the Word2Vec model
model = train_word2vec(sentences)

# Function to get the embedding of a word
def get_word_embedding(model, word):
    if word in model.wv:
        return model.wv[word]
    else:
        raise ValueError(f"The word '{word}' is not in the vocabulary.")

# Get vector representation of "AI"
try:
    ai_embedding = get_word_embedding(model, "AI")
    print("AI Embedding:", ai_embedding)
except ValueError as e:
    print(e)

Output:

AI Embedding: [-0.00536227  0.00236431  0.0510335   0.09009273 -0.0930295  -0.07116809
  0.06458873  0.08972988 -0.05015428 -0.03763372]

Code Explanation

Importing Word2Vec

The gensim.models.Word2Vec module is used to train word embeddings, which represent words as numerical vectors.

Creating a Sample Dataset

A small list of sentences is defined, where each sentence is a list of words (tokens).

Training Function

The train_word2vec function trains a Word2Vec model using the following parameters:

vector_size=10: Each word is represented by a 10-dimensional vector.
window=5: Considers 5 words before and after the target word as context.
min_count=1: Includes words appearing at least once.
workers=4: Uses 4 CPU threads for faster processing.

Training the Model

The function is called with the dataset to create a trained model.

Fetching Word Embeddings

The get_word_embedding function retrieves the vector representation of a word if it exists in the model’s vocabulary. If the word is missing, an error is raised.

Retrieving “AI” Embedding

The script attempts to fetch the embedding for "AI". If successful, it prints the vector; otherwise, it displays an error message.

This code essentially trains a simple Word2Vec model and demonstrates how to extract meaningful vector representations of words.

Login

Understanding Embeddings and Representations in AI