Embeddings are the bridge between human-readable text and machine-processable vectors. In this lesson, we'll deep dive into embedding models, how to choose the right one, and best practices for generation and storage.
Embeddings are dense vector representations of text that capture semantic meaning. Similar concepts map to nearby points in vector space, enabling similarity-based search.
Each word gets a single vector, regardless of context:
Limitation: "bank" has the same embedding whether it means "river bank" or "money bank".
Each token gets a different embedding based on surrounding context:
These produce per-token embeddings; for RAG, we need sentence/document-level vectors.
Sentence transformers are specifically trained to produce meaningful sentence-level representations:
Bi-encoders (the standard for RAG) encode query and document independently:
Query → Encoder → Query Vector ──┐
├── Similarity Score
Document → Encoder → Doc Vector ──┘
Advantage: Documents can be pre-embedded and cached
Disadvantage: No direct query-document interactionSentence transformers are typically trained with contrastive learning:
# Contrastive loss example
def contrastive_loss(query_emb, pos_emb, neg_embs, temperature=0.05):
# Positive similarity
pos_sim = cosine_similarity(query_emb, pos_emb) / temperature
# Negative similarities
neg_sims = [cosine_similarity(query_emb, neg) / temperature for neg in neg_embs]
# InfoNCE loss
numerator = torch.exp(pos_sim)
denominator = numerator + sum(torch.exp(neg) for neg in neg_sims)
return -torch.log(numerator / denominator)| Model | Dimensions | Max Tokens | Cost |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | $0.13/1M tokens |
| OpenAI text-embedding-3-small | 1536 | 8191 | $0.02/1M tokens |
| Cohere embed-v3 | 1024 | 512 | $0.10/1M tokens |
| Voyage AI voyage-2 | 1024 | 4000 | $0.10/1M tokens |
| Model | Dimensions | Parameters | Notes |
|---|---|---|---|
| BGE-large-en-v1.5 | 1024 | 335M | Top MTEB performer, instruction-tuned |
| E5-large-v2 | 1024 | 335M | Prefix-based (query:/passage:) |
| GTE-large | 1024 | 335M | Strong multilingual support |
| all-MiniLM-L6-v2 | 384 | 22M | Fast, lightweight, good for prototyping |
| jina-embeddings-v2 | 768 | 137M | 8K context length, good for late chunking |
Check MTEB (Massive Text Embedding Benchmark) leaderboard for retrieval tasks:
from openai import OpenAI
client = OpenAI()
def get_embeddings(texts: list[str], model: str = "text-embedding-3-small"):
response = client.embeddings.create(
model=model,
input=texts,
encoding_format="float" # or "base64" for efficiency
)
return [item.embedding for item in response.data]
# Batch for efficiency (max 2048 inputs per request)
embeddings = get_embeddings(["Hello world", "Goodbye world"])from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
# For BGE models, add instruction prefix for queries
def embed_query(query: str):
instruction = "Represent this sentence for searching relevant passages: "
return model.encode(instruction + query)
def embed_documents(documents: list[str]):
# Documents don't need instruction prefix
return model.encode(documents, show_progress_bar=True)
# Batch processing with GPU
embeddings = model.encode(
documents,
batch_size=64,
device="cuda",
convert_to_tensor=True,
normalize_embeddings=True # L2 normalize for cosine similarity
)from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
# OpenAI
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Hugging Face (local)
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-en-v1.5",
model_kwargs={'device': 'cuda'},
encode_kwargs={'normalize_embeddings': True}
)
# Embed
vectors = embeddings.embed_documents(documents)Some models use different encoding strategies for queries vs documents:
# E5 models use prefixes
query = "query: What is machine learning?"
document = "passage: Machine learning is a subset of AI..."
# BGE models use instruction for queries
query_instruction = "Represent this sentence for searching relevant passages: "
query_embedding = model.encode(query_instruction + "What is machine learning?")
doc_embedding = model.encode("Machine learning is a subset of AI...")
# Cohere has explicit input_type parameter
import cohere
co = cohere.Client()
# For documents
doc_embeddings = co.embed(
texts=documents,
model="embed-english-v3.0",
input_type="search_document"
).embeddings
# For queries
query_embedding = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query"
).embeddings[0]OpenAI's embedding-3 models support dimension reduction via the 'dimensions' parameter:
# Reduce from 3072 to 512 dimensions
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts,
dimensions=512 # Reduce dimensionality
)Trade-off: Lower dimensions = faster search + less storage, but slightly lower quality.
Always L2-normalize embeddings when using cosine similarity:
import numpy as np
def normalize(embeddings):
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
return embeddings / norms
# With normalized vectors, cosine similarity = dot product
similarity = np.dot(query_embedding, document_embedding)def batch_embed(texts: list[str], batch_size: int = 100):
"""Embed texts in batches to manage memory and API limits."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = model.encode(batch)
all_embeddings.extend(embeddings)
return all_embeddingsOnce generated, embeddings need to be stored efficiently for retrieval.
# ChromaDB (embedded, development)
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("documents")
collection.add(
embeddings=embeddings,
documents=documents,
metadatas=[{"source": "web"} for _ in documents],
ids=[f"doc_{i}" for i in range(len(documents))]
)
# Pinecone (managed, production)
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")
index.upsert(
vectors=[
{"id": f"doc_{i}", "values": emb, "metadata": {"source": "web"}}
for i, emb in enumerate(embeddings)
]
)
# Qdrant (self-hosted, production)
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)
client.upsert(
collection_name="documents",
points=[
{"id": i, "vector": emb, "payload": {"text": doc}}
for i, (emb, doc) in enumerate(zip(embeddings, documents))
]
)Cache embeddings to avoid regenerating for the same content:
import hashlib
import json
from pathlib import Path
class EmbeddingCache:
def __init__(self, cache_dir: str = ".embedding_cache"):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def _get_key(self, text: str, model: str) -> str:
content = f"{model}:{text}"
return hashlib.md5(content.encode()).hexdigest()
def get(self, text: str, model: str) -> list | None:
key = self._get_key(text, model)
cache_file = self.cache_dir / f"{key}.json"
if cache_file.exists():
return json.loads(cache_file.read_text())
return None
def set(self, text: str, model: str, embedding: list):
key = self._get_key(text, model)
cache_file = self.cache_dir / f"{key}.json"
cache_file.write_text(json.dumps(embedding))
def get_or_compute(self, text: str, model: str, compute_fn):
cached = self.get(text, model)
if cached is not None:
return cached
embedding = compute_fn(text)
self.set(text, model, embedding)
return embeddingFor domain-specific applications, you can fine-tune embedding models on your data:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare training data (query, positive_doc, negative_doc)
train_examples = [
InputExample(texts=["What is RAG?", "RAG combines retrieval and generation..."]),
InputExample(texts=["How does chunking work?", "Chunking splits documents into..."]),
# ... more examples
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Use MultipleNegativesRankingLoss
train_loss = losses.MultipleNegativesRankingLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./fine_tuned_embeddings"
)For RAG over images, tables, and mixed content:
# CLIP for image embeddings
from sentence_transformers import SentenceTransformer
from PIL import Image
model = SentenceTransformer('clip-ViT-B-32')
# Embed text
text_embedding = model.encode("a photo of a cat")
# Embed image
image = Image.open("cat.jpg")
image_embedding = model.encode(image)
# These are in the same vector space - can compare directly!
similarity = cosine_similarity(text_embedding, image_embedding)In the next lesson, we'll explore implementing the retrieval layer, including vector store operations and search patterns.