ChromaDB Cookbook
ChromaDB is the vector database we use in DS205 from W09 onwards. This guide collects the patterns you will need for Problem Set 2 and your own retrieval projects. Every snippet assumes Python with chromadb installed in your rag conda environment.
Setup: persistent vs in-memory
ChromaDB can run entirely in memory (good for quick tests) or persist to disk (good for anything you want to keep between kernel restarts).
import chromadb
# In-memory (gone when the process ends)
client = chromadb.Client()
# Persistent (survives kernel restarts)
client = chromadb.PersistentClient(path="data/chromadb")A persistent client stores everything under the path you give it. You will see a chroma.sqlite3 file and one folder per collection.
Collections
A collection is like a table in a relational database, except every row has an embedding vector alongside its text and metadata.
# Create (or open if it already exists)
collection = client.get_or_create_collection(
name="ajinomoto_char_limit",
metadata={"hnsw:space": "cosine"},
)The hnsw:space parameter controls the distance metric. Use "cosine" when your embeddings are normalised (like those from SentenceTransformer with normalize_embeddings=True). The alternatives are "l2" (Euclidean) and "ip" (inner product).
List and delete
# List all collections
client.list_collections()
# Delete one (irreversible)
client.delete_collection("ajinomoto_char_limit")Adding documents
Every document needs a unique string ID. You can optionally attach metadata (any flat dict of strings, ints, or floats) and pre-computed embeddings.
collection.upsert(
ids=["chunk_0000", "chunk_0001"],
documents=["First chunk text...", "Second chunk text..."],
embeddings=[[0.12, -0.34, ...], [0.56, 0.78, ...]],
metadatas=[
{"strategy": "char_limit", "pages": "[1, 2]"},
{"strategy": "char_limit", "pages": "[2]"},
],
)upsert inserts new entries and overwrites existing ones with the same ID. If you use add instead, duplicates raise an error.
Batch size
ChromaDB handles large batches fine, but if you are pushing tens of thousands of chunks, split into batches of a few hundred to keep memory use predictable.
batch_size = 256
for i in range(0, len(ids), batch_size):
collection.upsert(
ids=ids[i:i + batch_size],
embeddings=embeddings[i:i + batch_size],
documents=texts[i:i + batch_size],
metadatas=metas[i:i + batch_size],
)Querying
The core operation: find the nearest neighbours to a query vector.
results = collection.query(
query_embeddings=[query_vector],
n_results=5,
)The result is a dict with parallel lists under "ids", "distances", "documents", and "metadatas". Each is wrapped in an outer list (one entry per query), so for a single query you access results["ids"][0], results["distances"][0], etc.
for rank, (doc_id, dist, doc) in enumerate(
zip(
results["ids"][0],
results["distances"][0],
results["documents"][0],
),
1,
):
print(f"#{rank} {doc_id} (distance: {dist:.4f})")
print(f" {doc[:100]}...")
print()Querying with text (auto-embed)
If the collection was created without custom embeddings (i.e. ChromaDBβs built-in model handles them), you can pass raw text:
results = collection.query(
query_texts=["What are Scope 1 and 2 emissions targets?"],
n_results=5,
)In our pipeline, we supply our own SentenceTransformer embeddings, so we use query_embeddings instead.
Filtering with metadata
You can narrow a query to documents matching certain metadata conditions.
results = collection.query(
query_embeddings=[query_vector],
n_results=5,
where={"strategy": "table_prose"},
)ChromaDB supports $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin operators:
where={"char_count": {"$gte": 200}}And combining conditions with $and / $or:
where={"$and": [
{"strategy": "table_prose"},
{"char_count": {"$gte": 200}},
]}Fetching documents by ID
result = collection.get(ids=["chunk_0833", "chunk_0265"])
# result["documents"] contains the textsCounting
collection.count()Putting it together: a retrieval evaluation loop
This pattern runs multiple queries against a collection and measures Recall@k:
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1", device="cuda")
ground_truth = [
{
"query": "What are Scope 1 and 2 emissions targets?",
"relevant_ids": ["chunk_0833", "chunk_0262"],
},
# ... more queries
]
k = 5
recalls = []
for entry in ground_truth:
q_vec = model.encode(entry["query"], normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=[q_vec], n_results=k)
retrieved = set(results["ids"][0])
relevant = set(entry["relevant_ids"])
recall = len(retrieved & relevant) / len(relevant)
recalls.append(recall)
print(f"Mean Recall@{k}: {np.mean(recalls):.2%}")Common pitfalls
Mismatched distance metric. If you embed with normalize_embeddings=True but create the collection with the default l2 space, distances will still work but wonβt match the cosine similarity you expect. Always set hnsw:space explicitly.
Forgetting the outer list. results["ids"] is a list of lists (one per query). A single query still returns results["ids"][0], not results["ids"].
String-only metadata values. If you store a list like [1, 2, 3] in metadata, serialise it as a JSON string first: json.dumps([1, 2, 3]). ChromaDB metadata values must be strings, ints, floats, or bools.
Collection name rules. Names must be 3-63 characters, start and end with an alphanumeric, and can contain hyphens or underscores (but not consecutive periods). Keep names like ajinomoto_table_prose and you will be fine.