Traditional keyword-based search engines can only take us so far. What if we could search based on meaning rather than matching exact words? That’s where semantic search comes in, powered by modern language models and embeddings. In this post, we’ll walk through the core concepts, explain how dense retrieval works, and build a semantic search tool using sentence-transformers and FAISS.
Semantic search focuses on understanding the intent behind a query and matching it with semantically similar content. Instead of comparing words, we compare embeddings—vector representations of text—generated by a language model.
Think of each sentence as a point in a high-dimensional space. Texts with similar meanings will be closer together in this space.
Here’s how dense retrieval works in practice:
Let’s build it step-by-step using the Wikipedia summary of Interstellar.
Install the libraries:
pip install sentence-transformers faiss-cpu pandasStep 1: Prepare the Text
Let’s work with a sample document about Interstellar:
text = """
Interstellar is a 2014 science fiction film directed by Christopher Nolan.
The film stars Matthew McConaughey and Anne Hathaway.
Set in a dystopian future, it follows astronauts searching for a new home for humanity.
Kip Thorne, a theoretical physicist, was a scientific consultant on the film.
Interstellar premiered in 2014 and was praised for scientific accuracy and visual effects.
It grossed over $677 million worldwide.
"""
# Split and clean sentences
sentences = [s.strip() for s in text.split('.') if s.strip()]Step 2: Generate Embeddings
We’ll use sentence-transformers for generating embeddings:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embeddings = model.encode(sentences)Step 3: Create a FAISS Search Index
FAISS is a library optimized for fast vector search:
import faiss
dimension = sentence_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(sentence_embeddings).astype('float32'))Step 4: Search by Semantic Meaning
Let’s define a simple search function:
import pandas as pd
def semantic_search(query, k=3):
query_embedding = model.encode([query])
distances, indices = index.search(np.array(query_embedding).astype('float32'), k)
results = pd.DataFrame({
'Text': [sentences[i] for i in indices[0]],
'Distance': distances[0]
})
return results
# Try it out:
semantic_search("how accurate was the science")You’ll get a ranked list of the most relevant sentences—not based on exact word matches but semantic similarity.
For comparison, let’s also look at a simple keyword-based method using BM25:
from rank_bm25 import BM25Okapi
import string
def tokenize(text):
return [word.strip(string.punctuation).lower() for word in text.split() if word]
tokenized_corpus = [tokenize(s) for s in sentences]
bm25 = BM25Okapi(tokenized_corpus)
def keyword_search(query, k=3):
tokenized_query = tokenize(query)
scores = bm25.get_scores(tokenized_query)
top_indices = np.argsort(scores)[::-1][:k]
results = pd.DataFrame({
'Text': [sentences[i] for i in top_indices],
'Score': [scores[i] for i in top_indices]
})
return results
# Now compare both methods with:
semantic_search("how accurate was the science")
keyword_search("how accurate was the science")Since models can’t handle unlimited text length, we split long documents into chunks:
Use overlap like this for chunking:
def chunk_text(text, chunk_size=3, overlap=1):
sentences = [s.strip() for s in text.split('.') if s.strip()]
chunks = []
for i in range(0, len(sentences), chunk_size - overlap):
chunk = '. '.join(sentences[i:i + chunk_size])
chunks.append(chunk)
return chunksYou can further improve results by fine-tuning embedding models using positive and negative query-document pairs:
The model learns to bring positive pairs closer and push irrelevant ones farther apart.
Semantic search opens up a new dimension for information retrieval. You’re no longer bound by keywords—now your queries can be understood in context. And with tools like sentence-transformers and FAISS, it’s easier than ever to build your own intelligent search system.