Login Sign Up

Text and Word Embeddings in NLP using LLM

Embeddings play a crucial role in Natural Language Processing (NLP) by converting words, sentences, and entire documents into numerical representations. These embeddings allow models to capture semantic meaning, relationships, and context effectively.

In this article, we’ll explore:

  • Text embeddings – used for processing entire sentences or documents
  • Word embeddings – capturing relationships between individual words
  • The word2vec algorithm – a powerful technique for training word embeddings

Let’s dive in!

Text Embeddings: Representing Sentences and Documents

While token embeddings help models process individual tokens, text embeddings allow us to work with entire sentences, paragraphs, or even full documents.

A text embedding model converts a sentence into a single numerical vector that represents its meaning. These embeddings power applications such as:

  • Semantic search
  • Text classification
  • Information retrieval
  • Question-answering systems

How to Generate Text Embeddings?

We can generate text embeddings using sentence-transformers, a powerful Python package that loads pre-trained embedding models.

Here’s an example using Google Colab:

# Install sentence-transformers (if not already installed)

!pip install -q sentence-transformers

# Import required libraries

from sentence_transformers import SentenceTransformer

# Load a pre-trained embedding model

model = SentenceTransformer("all-MiniLM-L6-v2")  # A lightweight and efficient model

# Convert a sentence into an embedding vector

text = "Artificial intelligence is revolutionizing healthcare."

vector = model.encode(text)

# Print vector shape

print("Embedding Shape:", vector.shape)

Output:

sentence embedding
sentence embedding

Here, our sentence embedding is represented as a 384-dimensional numerical vector. Larger models (like all-mpnet-base-v2) can produce embeddings with 768 dimensions for richer representations.

💡 Use Case: These embeddings can be used for document similarity, where we compare vector distances to find related texts.

Word Embeddings: Understanding Individual Words

Beyond full-text embeddings, word embeddings capture relationships between individual words. Unlike traditional one-hot encoding, where each word is treated as an isolated entity, word embeddings map words into a vector space based on their meaning and context.

Example: Words with similar meanings will have closer vector representations:

  • “King” and “Queen” are close in embedding space.
  • “Apple” and “Orange” are closer than “Apple” and “Car”.

Using Pre-trained Word Embeddings (GloVe)

We can use the Gensim library to download and explore pre-trained GloVe embeddings trained on large text datasets.

# Install Gensim (if not already installed)

!pip install --no-cache-dir --upgrade gensim

import gensim.downloader as api

# Load GloVe embeddings

model = api.load("glove-wiki-gigaword-50")

# Find similar words to "king" but filter out Roman numerals

similar_words = [

    (word, score) for word, score in model.most_similar("king", topn=5)

    if not word.isdigit() and not word.lower() in ["ii", "iii", "iv", "v", "vi", "vii", "viii", "ix", "x"]

]

# print similar words

print(similar_words)

Output:

Pre-trained Word Embeddings (GloVe)
Pre-trained Word Embeddings (GloVe)

These results show that “prince” and “queen” are semantically close to “king” based on their word embeddings.

The Word2Vec Algorithm: Training Custom Word Embeddings

Unlike pre-trained embeddings, the word2vec algorithm allows us to train our own word embeddings based on a given dataset. It works using two key techniques:

  • Skip-gram Model – Predicts surrounding words given a central word.
  • Negative Sampling – Helps distinguish correct word relationships from random noise.

Training Word2Vec on Custom Text

We can use Gensim to train a simple word2vec model on a custom dataset.

# Install necessary libraries

!pip install -q gensim nltk

# Import required modules

import nltk

from nltk.tokenize import word_tokenize

from gensim.models import Word2Vec

# Download the tokenizer

nltk.download('punkt_tab')

# Sample text corpus

text_corpus = [

    "Artificial intelligence is changing the world.",

    "Deep learning and machine learning are subsets of AI.",

    "Natural language processing allows machines to understand text.",

    "Neural networks are powerful for deep learning tasks."

]

# Tokenize sentences

tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in text_corpus]

# Train Word2Vec model

word2vec_model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=2, min_count=1, workers=4)

# Find similar words to "learning"

print(word2vec_model.wv.most_similar("learning"))

Output :

[(‘artificial’, 0.1886581927537918), (‘world’, 0.18857981264591217), (‘allows’, 0.16100163757801056), (‘natural’, 0.16039778292179108), (‘processing’, 0.1384490579366684), (‘the’, 0.1285376399755478), (‘machines’, 0.12337520718574524), (‘of’, 0.08614791929721832), (‘deep’, 0.0684993714094162), (‘tasks’, 0.0357402078807354)] [nltk_data] Downloading package punkt_tab to /root/nltk_data… [nltk_data]   Package punkt_tab is already up-to-date!

This shows that the word “learning” is closely related to “deep”, “processing”, and “networks”, based on the trained model.

  • Text embeddings help represent full sentences or documents as single vectors.
  • Word embeddings capture relationships between words in a vector space. 
  • Pre-trained embeddings (GloVe, Word2Vec) allow models to leverage vast linguistic knowledge.
  • Training custom embeddings enables fine-tuned representations for domain-specific tasks.