Understanding Embeddings

ChromaDB, Milvus, and Other Options

This lesson covers ChromaDB for development, Milvus for enterprise scale, and other vector database options including pgvector and FAISS.

ChromaDB

ChromaDB is an open-source embedding database designed for simplicity and developer experience. Perfect for prototyping and small-scale applications.

Key Features

Embedded - Runs in-process, no server needed
Simple API - Minimal boilerplate to get started
Built-in Embeddings - Optional automatic embedding
LangChain Integration - First-class support

Getting Started

pip install chromadb

import chromadb
from chromadb.utils import embedding_functions

# In-memory (development)
client = chromadb.Client()

# Persistent storage
client = chromadb.PersistentClient(path="./chroma_db")

# With embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

# Create collection
collection = client.get_or_create_collection(
    name="documents",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}
)

Basic Operations

# Add documents (automatically generates embeddings)
collection.add(
    documents=[
        "Machine learning is a subset of AI...",
        "Deep learning uses neural networks...",
        "Natural language processing enables..."
    ],
    metadatas=[
        {"source": "docs", "category": "ml"},
        {"source": "docs", "category": "dl"},
        {"source": "docs", "category": "nlp"}
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Or add with your own embeddings
collection.add(
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["text1", "text2"],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_texts=["What is machine learning?"],
    n_results=5,
    where={"category": "ml"},
    include=["documents", "metadatas", "distances"]
)

print(results["documents"][0])  # List of matching documents
print(results["distances"][0])  # Similarity scores

# Update
collection.update(
    ids=["doc1"],
    metadatas=[{"category": "updated"}]
)

# Delete
collection.delete(ids=["doc1"])
collection.delete(where={"category": "old"})

ChromaDB with LangChain

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Create from documents
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Search
docs = vectorstore.similarity_search("What is ML?", k=5)

# As retriever for RAG
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20}
)

Milvus

Milvus is an enterprise-grade vector database designed for billion-scale similarity search.

Key Features

Massive Scale - Billions of vectors
GPU Acceleration - NVIDIA GPU support
Distributed Architecture - Horizontal scaling
Multiple Index Types - HNSW, IVF, PQ, and more
Attribute Filtering - Complex query conditions

Getting Started

pip install pymilvus

# Run with Docker
docker compose up -d  # Using milvus docker-compose.yml

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

# Connect
connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields, description="Document collection")

# Create collection
collection = Collection("documents", schema)

# Create index
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 200}
}
collection.create_index("embedding", index_params)

Operations

# Insert
data = [
    ["Machine learning...", "Deep learning..."],  # content
    [embedding1, embedding2]  # embeddings
]
collection.insert(data)

# Load to memory (required before search)
collection.load()

# Search
search_params = {"metric_type": "COSINE", "params": {"ef": 100}}

results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=5,
    output_fields=["content"]
)

for hits in results:
    for hit in hits:
        print(f"ID: {hit.id}, Score: {hit.score}")
        print(f"Content: {hit.entity.get('content')}")

pgvector (PostgreSQL Extension)

Add vector capabilities to your existing PostgreSQL database.

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Create index
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Or HNSW index (better performance)
CREATE INDEX ON documents 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Insert
INSERT INTO documents (content, embedding)
VALUES ('Machine learning...', '[0.1, 0.2, ...]'::vector);

-- Search
SELECT content, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS similarity
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;

# Python usage
import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(...)
register_vector(conn)

# Insert
cur.execute(
    "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
    ("Machine learning...", embedding)
)

# Search
cur.execute("""
    SELECT content, 1 - (embedding <=> %s) AS similarity
    FROM documents
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_embedding))

FAISS (Library, Not Database)

FAISS is Facebook AI's similarity search library - highly optimized but not a database.

import faiss
import numpy as np

# Create index
dimension = 768
index = faiss.IndexFlatL2(dimension)  # Exact search

# Add vectors
embeddings = np.random.rand(10000, dimension).astype('float32')
index.add(embeddings)

# Search
query = np.random.rand(1, dimension).astype('float32')
distances, indices = index.search(query, k=5)

# For approximate search (much faster)
nlist = 100
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index.train(embeddings)
index.add(embeddings)
index.nprobe = 10  # Search 10 clusters

# Save/Load
faiss.write_index(index, "index.faiss")
index = faiss.read_index("index.faiss")

Comparison: When to Use What

Solution	Scale	Best For
ChromaDB	~1M vectors	Development, prototyping, small apps
Milvus	Billions	Enterprise, massive scale, GPU
pgvector	~10M vectors	Existing Postgres users, SQL workflows
FAISS	Billions (in-memory)	Research, custom solutions, no persistence needed
Pinecone	Billions	Production, zero ops
Qdrant	100M+	Performance, filtering
Weaviate	100M+	Multimodal, built-in vectorization

Key Takeaways

ChromaDB - Best for prototyping and development
Milvus - Enterprise-grade for billion-scale with GPU acceleration
pgvector - Use existing PostgreSQL infrastructure
FAISS - Library for custom implementations, not a database
Start with ChromaDB for learning, migrate to production DBs later

In the next lesson, we'll build semantic search applications using these vector databases.

Vector Databases & Embeddings

Understanding Embeddings

Vector Database Concepts

Popular Vector DBs

Semantic Search Applications

ChromaDB, Milvus, and Other Options

ChromaDB

Key Features

Getting Started

Basic Operations

ChromaDB with LangChain

Milvus

Key Features

Getting Started

Operations

pgvector (PostgreSQL Extension)

FAISS (Library, Not Database)

Comparison: When to Use What

Key Takeaways