Embeddings play a crucial role in Natural Language Processing (NLP) by converting words, sentences, and entire documents into numerical representations. These embeddings allow models to capture semantic meaning, relationships, and context effectively.
In this article, we’ll explore:
Let’s dive in!
While token embeddings help models process individual tokens, text embeddings allow us to work with entire sentences, paragraphs, or even full documents.
A text embedding model converts a sentence into a single numerical vector that represents its meaning. These embeddings power applications such as:
We can generate text embeddings using sentence-transformers, a powerful Python package that loads pre-trained embedding models.
Here’s an example using Google Colab:
# Install sentence-transformers (if not already installed)
!pip install -q sentence-transformers
# Import required libraries
from sentence_transformers import SentenceTransformer
# Load a pre-trained embedding model
model = SentenceTransformer("all-MiniLM-L6-v2") # A lightweight and efficient model
# Convert a sentence into an embedding vector
text = "Artificial intelligence is revolutionizing healthcare."
vector = model.encode(text)
# Print vector shape
print("Embedding Shape:", vector.shape)Output:
![]() |
Here, our sentence embedding is represented as a 384-dimensional numerical vector. Larger models (like all-mpnet-base-v2) can produce embeddings with 768 dimensions for richer representations.
💡 Use Case: These embeddings can be used for document similarity, where we compare vector distances to find related texts.
Beyond full-text embeddings, word embeddings capture relationships between individual words. Unlike traditional one-hot encoding, where each word is treated as an isolated entity, word embeddings map words into a vector space based on their meaning and context.
Example: Words with similar meanings will have closer vector representations:
We can use the Gensim library to download and explore pre-trained GloVe embeddings trained on large text datasets.
# Install Gensim (if not already installed)
!pip install --no-cache-dir --upgrade gensim
import gensim.downloader as api
# Load GloVe embeddings
model = api.load("glove-wiki-gigaword-50")
# Find similar words to "king" but filter out Roman numerals
similar_words = [
(word, score) for word, score in model.most_similar("king", topn=5)
if not word.isdigit() and not word.lower() in ["ii", "iii", "iv", "v", "vi", "vii", "viii", "ix", "x"]
]
# print similar words
print(similar_words)Output:
![]() |
These results show that “prince” and “queen” are semantically close to “king” based on their word embeddings.
Unlike pre-trained embeddings, the word2vec algorithm allows us to train our own word embeddings based on a given dataset. It works using two key techniques:
We can use Gensim to train a simple word2vec model on a custom dataset.
# Install necessary libraries
!pip install -q gensim nltk
# Import required modules
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
# Download the tokenizer
nltk.download('punkt_tab')
# Sample text corpus
text_corpus = [
"Artificial intelligence is changing the world.",
"Deep learning and machine learning are subsets of AI.",
"Natural language processing allows machines to understand text.",
"Neural networks are powerful for deep learning tasks."
]
# Tokenize sentences
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in text_corpus]
# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=2, min_count=1, workers=4)
# Find similar words to "learning"
print(word2vec_model.wv.most_similar("learning"))Output :
| [(‘artificial’, 0.1886581927537918), (‘world’, 0.18857981264591217), (‘allows’, 0.16100163757801056), (‘natural’, 0.16039778292179108), (‘processing’, 0.1384490579366684), (‘the’, 0.1285376399755478), (‘machines’, 0.12337520718574524), (‘of’, 0.08614791929721832), (‘deep’, 0.0684993714094162), (‘tasks’, 0.0357402078807354)] [nltk_data] Downloading package punkt_tab to /root/nltk_data… [nltk_data] Package punkt_tab is already up-to-date! |
This shows that the word “learning” is closely related to “deep”, “processing”, and “networks”, based on the trained model.