Vector Databases & Embeddings

0 of 12 lessons completed

ChromaDB, Milvus, and Other Options

This lesson covers ChromaDB for development, Milvus for enterprise scale, and other vector database options including pgvector and FAISS.

ChromaDB

ChromaDB is an open-source embedding database designed for simplicity and developer experience. Perfect for prototyping and small-scale applications.

Key Features

  • Embedded - Runs in-process, no server needed
  • Simple API - Minimal boilerplate to get started
  • Built-in Embeddings - Optional automatic embedding
  • LangChain Integration - First-class support

Getting Started

pip install chromadb
import chromadb
from chromadb.utils import embedding_functions

# In-memory (development)
client = chromadb.Client()

# Persistent storage
client = chromadb.PersistentClient(path="./chroma_db")

# With embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

# Create collection
collection = client.get_or_create_collection(
    name="documents",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}
)

Basic Operations

# Add documents (automatically generates embeddings)
collection.add(
    documents=[
        "Machine learning is a subset of AI...",
        "Deep learning uses neural networks...",
        "Natural language processing enables..."
    ],
    metadatas=[
        {"source": "docs", "category": "ml"},
        {"source": "docs", "category": "dl"},
        {"source": "docs", "category": "nlp"}
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Or add with your own embeddings
collection.add(
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["text1", "text2"],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_texts=["What is machine learning?"],
    n_results=5,
    where={"category": "ml"},
    include=["documents", "metadatas", "distances"]
)

print(results["documents"][0])  # List of matching documents
print(results["distances"][0])  # Similarity scores

# Update
collection.update(
    ids=["doc1"],
    metadatas=[{"category": "updated"}]
)

# Delete
collection.delete(ids=["doc1"])
collection.delete(where={"category": "old"})

ChromaDB with LangChain

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Create from documents
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Search
docs = vectorstore.similarity_search("What is ML?", k=5)

# As retriever for RAG
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20}
)

Milvus

Milvus is an enterprise-grade vector database designed for billion-scale similarity search.

Key Features

  • Massive Scale - Billions of vectors
  • GPU Acceleration - NVIDIA GPU support
  • Distributed Architecture - Horizontal scaling
  • Multiple Index Types - HNSW, IVF, PQ, and more
  • Attribute Filtering - Complex query conditions

Getting Started

pip install pymilvus

# Run with Docker
docker compose up -d  # Using milvus docker-compose.yml
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

# Connect
connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields, description="Document collection")

# Create collection
collection = Collection("documents", schema)

# Create index
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 200}
}
collection.create_index("embedding", index_params)

Operations

# Insert
data = [
    ["Machine learning...", "Deep learning..."],  # content
    [embedding1, embedding2]  # embeddings
]
collection.insert(data)

# Load to memory (required before search)
collection.load()

# Search
search_params = {"metric_type": "COSINE", "params": {"ef": 100}}

results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=5,
    output_fields=["content"]
)

for hits in results:
    for hit in hits:
        print(f"ID: {hit.id}, Score: {hit.score}")
        print(f"Content: {hit.entity.get('content')}")

pgvector (PostgreSQL Extension)

Add vector capabilities to your existing PostgreSQL database.

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Create index
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Or HNSW index (better performance)
CREATE INDEX ON documents 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Insert
INSERT INTO documents (content, embedding)
VALUES ('Machine learning...', '[0.1, 0.2, ...]'::vector);

-- Search
SELECT content, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS similarity
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;
# Python usage
import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(...)
register_vector(conn)

# Insert
cur.execute(
    "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
    ("Machine learning...", embedding)
)

# Search
cur.execute("""
    SELECT content, 1 - (embedding <=> %s) AS similarity
    FROM documents
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_embedding))

FAISS (Library, Not Database)

FAISS is Facebook AI's similarity search library - highly optimized but not a database.

import faiss
import numpy as np

# Create index
dimension = 768
index = faiss.IndexFlatL2(dimension)  # Exact search

# Add vectors
embeddings = np.random.rand(10000, dimension).astype('float32')
index.add(embeddings)

# Search
query = np.random.rand(1, dimension).astype('float32')
distances, indices = index.search(query, k=5)

# For approximate search (much faster)
nlist = 100
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index.train(embeddings)
index.add(embeddings)
index.nprobe = 10  # Search 10 clusters

# Save/Load
faiss.write_index(index, "index.faiss")
index = faiss.read_index("index.faiss")

Comparison: When to Use What

SolutionScaleBest For
ChromaDB~1M vectorsDevelopment, prototyping, small apps
MilvusBillionsEnterprise, massive scale, GPU
pgvector~10M vectorsExisting Postgres users, SQL workflows
FAISSBillions (in-memory)Research, custom solutions, no persistence needed
PineconeBillionsProduction, zero ops
Qdrant100M+Performance, filtering
Weaviate100M+Multimodal, built-in vectorization

Key Takeaways

  • ChromaDB - Best for prototyping and development
  • Milvus - Enterprise-grade for billion-scale with GPU acceleration
  • pgvector - Use existing PostgreSQL infrastructure
  • FAISS - Library for custom implementations, not a database
  • Start with ChromaDB for learning, migrate to production DBs later

In the next lesson, we'll build semantic search applications using these vector databases.