Vector Store in Machine Learning

Let's start with the big picture. A vector store is a specialized storage and retrieval system built for one job: persist high-dimensional embedding vectors and answer similarity queries against them at blazing-low latency.

In ML systems, vector stores sit right at the intersection of learned representations and information retrieval. They accept dense vectors produced by encoder models and return the nearest neighbors under a chosen distance metric -- cosine similarity, inner product, or L2 distance.

But why did this become such a big deal? Because retrieval-augmented generation (RAG) proved that coupling LLMs with external knowledge stores dramatically reduces hallucination and keeps responses grounded in actual source material. That insight turned vector stores from a niche research tool into a production necessity practically overnight.

Today, vector stores underpin recommendation engines, semantic search, duplicate detection, and virtually every LLM-powered application that needs access to domain-specific data at inference time. From Flipkart's product search to Swiggy's restaurant recommendations -- if an ML system retrieves anything by meaning rather than keywords, there's almost certainly a vector store under the hood.

Concept Snapshot

What It Is
A purpose-built data store optimized for indexing, persisting, and querying high-dimensional vectors using approximate nearest neighbor (ANN) algorithms.
Category
RAG Pipeline
Complexity
Intermediate
Inputs / Outputs
Inputs: embedding vectors (typically 384-4096 dimensions) with optional metadata. Outputs: ranked list of nearest neighbor vectors with similarity scores.
System Placement
Sits between the embedding model (upstream) and the context assembler or re-ranker (downstream) in a RAG or retrieval pipeline.
Also Known As
vector database, vector index, embedding store, ANN index, similarity search engine
Typical Users
ML engineers, backend engineers, data engineers, search/retrieval specialists
Prerequisites
Embeddings, Distance metrics (cosine, L2, dot product), Basic indexing concepts
Key Terms
ANNHNSWIVFproduct quantizationrecall@kQPSembedding dimensionmetadata filtering

Why This Concept Exists

The Problem with Regular Databases

Standard relational and document databases index data by discrete attributes -- primary keys, B-tree ranges, and inverted text indices. That's great for queries like "find users where age > 25" or "search documents containing the word 'retrieval'." But none of these structures are designed for the geometric proximity queries that dense embedding vectors require.

Let's put a number on this. A brute-force scan over one million 768-dimensional vectors demands roughly 3 billion floating-point comparisons per query. At production QPS, that's simply not feasible.

Two Trends That Made This Urgent

The need for a dedicated abstraction grew alongside two parallel trends:

Trend 1: Representation learning matured. Transformer-based encoders could now compress sentences, images, and structured records into semantically meaningful vectors. We finally had embeddings worth storing.

Trend 2: These representations went from offline analytics to live, user-facing retrieval. Autocomplete, personalized feeds, conversational AI with external knowledge -- all of these need sub-100ms vector lookups.

The Algorithmic Foundation

Approximate nearest neighbor (ANN) algorithms -- locality-sensitive hashing, product quantization, and graph-based methods like HNSW -- provided the building blocks. BUT raw algorithms aren't enough for production. You also need persistence, replication, metadata filtering, and operational tooling.

That's exactly the gap vector stores fill. They package ANN algorithms into managed systems with all the production plumbing that individual index libraries like FAISS or HNSWlib don't provide on their own.

Key Takeaway: Vector stores exist because traditional databases can't do geometric proximity queries, and raw ANN libraries don't provide the persistence and operational features production systems need. Vector stores bridge that gap.

Core Intuition & Mental Model

The Core Promise

Here's the fundamental guarantee: given a query vector, a vector store can return an approximately correct set of nearest neighbors in sub-linear time -- typically O(logn)O(\log n) for graph-based indices -- even when your corpus contains hundreds of millions of vectors.

The word "approximately" is load-bearing. Every ANN index trades exactness for speed, and the recall-throughput curve is the central knob you'll be tuning for your entire career working with these systems.

What a Vector Store Does NOT Do

This is where I see a lot of confusion. A vector store finds geometrically close vectors, not necessarily semantically relevant ones. If the upstream embedding model produces poor representations, the vector store will faithfully return poor results at very high speed.

The boundary of responsibility is clear: the vector store owns geometry, not meaning.

A Useful Mental Model

Think of a vector store as a spatial file system. Just as a file system organizes bytes into a hierarchy optimized for path-based lookups, a vector store organizes floating-point arrays into a structure optimized for proximity-based lookups.

Both abstractions are inert -- they reflect the quality of what you put into them. Garbage embeddings in, garbage retrieval out. That was pretty simple, wasn't it?

Expert Note: When debugging poor retrieval quality, start with the embedding model, not the vector store. Nine times out of ten, the issue is upstream -- bad chunking, wrong model, or mismatched fine-tuning -- not the index itself.

Technical Foundations

The Math (Don't Worry, We'll Build Up to It)

Let's formalize what a vector store actually does. I'll explain the intuition first, then give you the notation.

Intuition: You have a big collection of vectors. Someone hands you a new query vector. You need to find the kk vectors in your collection that are closest to the query. And you need to do it fast.

Formally: A vector store implements an index over a collection V={v1,v2,,vn}V = \{v_1, v_2, \ldots, v_n\} where each viRdv_i \in \mathbb{R}^d. Given a query vector qRdq \in \mathbb{R}^d and a positive integer kk, the store returns a set SVS \subset V with S=k|S| = k that approximates the true kk-nearest neighbors under a distance function d(,)d(\cdot, \cdot).

Measuring Approximation Quality

How do we know if our "approximate" answer is any good? We use recall@k: the fraction of true top-kk neighbors present in the returned set SS.

A recall of 0.95 at k=10k = 10 means that, on average, 9.5 of the 10 returned items are among the actual 10 closest vectors. That's usually good enough for production.

Distance Functions

Vector stores commonly support three distance functions:

  • Cosine similarity: sim(q,v)=qvqv\text{sim}(q, v) = \frac{q \cdot v}{\|q\| \cdot \|v\|}
  • L2 (Euclidean) distance: d(q,v)=qv2d(q, v) = \|q - v\|_2
  • Inner product (dot product): d(q,v)=qvd(q, v) = q \cdot v

Which Metric Should You Use?

This is critical, and I've seen teams get it wrong more often than you'd expect. The choice of metric must match the training objective of the embedding model:

  • Models trained with contrastive losses on normalized vectors? Use cosine similarity.
  • Maximum inner product search (MIPS) problems where vector magnitudes carry information? Use inner product.
  • Models explicitly trained with L2 distance? Use L2.

Warning: Using the wrong distance metric is a silent killer. Your store will happily return results -- they'll just be the wrong results. No error, no warning, just degraded quality.

Internal Architecture

A production vector store typically consists of four subsystems: an ingestion layer that accepts vectors and metadata, an indexing engine that builds and maintains ANN data structures, a query engine that executes search and filtering, and a storage backend that handles persistence and replication. Let's walk through each one.

Key Components

Ingestion Layer

Accepts vectors (and optional payload/metadata) via API or bulk import, validates dimensions, and routes to the indexing engine.

ANN Index Engine

Builds and maintains the core index structure (HNSW graph, IVF clusters, or product quantization codebooks). Handles incremental inserts and deletes.

Query Engine

Receives a query vector and optional metadata filters, traverses the index, and returns top-k results with scores. Manages concurrency and query planning.

Storage Backend

Persists vectors and index structures to disk. Handles write-ahead logging, snapshots, and replication for durability and high availability.

Metadata Filter Engine

Applies pre-filter or post-filter predicates on scalar metadata fields (e.g., category, timestamp) alongside vector similarity search.

Data Flow

Here's how data flows through the system:

Write Path: Embedding vectors arrive via the ingestion API -> validated and assigned internal IDs -> inserted into the ANN index (and WAL) -> persisted to storage.

Read Path: A query vector enters the query engine -> the ANN index is traversed (with optional metadata pre-filtering) -> top-k candidates are scored -> results are returned with similarity scores and payloads.

The write path and read path are largely independent, which is why most vector stores can handle concurrent reads and writes without heavy locking.

A directed flow from 'Embedding Model' -> 'Ingestion API' -> 'ANN Index' <-> 'Storage Backend', with a parallel 'Query API' -> 'ANN Index' -> 'Top-K Results' path.

How to Implement

Two Approaches to Implementation

Implementation patterns fall into two broad categories:

Option A: Embedded index library (FAISS, HNSWlib, ScaNN) -- you run the index directly in your application process. Great for prototyping and when you control the infra.

Option B: Managed vector database (Pinecone, Weaviate, Qdrant, Milvus, pgvector) -- a separate service handles everything. Better for production when you need replication, sharding, metadata filtering, or multi-tenancy.

The right choice depends on scale, operational maturity, and whether you need features beyond basic ANN search. For a startup in Bengaluru building their first RAG prototype, Chroma or pgvector might be perfect. For a platform serving millions of queries per day, you'll likely want Qdrant, Milvus, or Pinecone.

Cost Note: Pinecone's managed service starts around $70/month (~INR 5,900/month) for a basic pod. Self-hosted Qdrant or Milvus on a cloud VM can be cheaper but requires ops investment. FAISS is free but you own the infrastructure entirely.

FAISS — Build an IVF+PQ index and query
import faiss
import numpy as np

d = 768           # embedding dimension
nlist = 100       # number of IVF clusters
m = 64            # PQ sub-quantizers

# Train the index on a sample
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
index.train(training_vectors)  # shape (n_train, d)
index.add(corpus_vectors)       # shape (n_corpus, d)

# Search
index.nprobe = 10  # clusters to visit at query time
D, I = index.search(query_vectors, k=10)
# D: distances, I: indices of nearest neighbors

FAISS (Facebook AI Similarity Search) provides composable index types. IVF clusters the space for coarse quantization, then PQ compresses residual vectors for memory efficiency. The nprobe parameter controls the recall-throughput tradeoff: higher values improve recall at the cost of latency.

Qdrant — Upsert and query with metadata filters
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue

client = QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Upsert vectors with metadata
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(id=1, vector=embedding, payload={"source": "arxiv", "year": 2024}),
    ],
)

# Query with metadata filter
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    query_filter=Filter(must=[FieldCondition(key="source", match=MatchValue(value="arxiv"))]),
    limit=10,
)

Qdrant is an open-source vector database written in Rust. This example demonstrates the common pattern of combining vector similarity search with scalar metadata filtering — essential for multi-tenant RAG systems where retrieved documents must belong to the querying user's corpus.

Configuration Example
# Qdrant collection config (YAML equivalent)
collection_name: documents
vectors:
  size: 768
  distance: Cosine
hnsw_config:
  m: 16
  ef_construct: 128
optimizers_config:
  indexing_threshold: 20000
  memmap_threshold: 50000
replication_factor: 2

Common Implementation Mistakes

  • Choosing cosine similarity when the embedding model was trained with dot-product objectives (or vice versa) -- mismatched metrics silently degrade recall. I cannot stress this enough: always check the model card before configuring your distance metric.

  • Skipping index parameter tuning (e.g., HNSW ef_construction, IVF nprobe) and accepting library defaults that may not suit your recall/latency requirements. The defaults are designed for generality, not your specific workload.

  • Storing raw text alongside vectors in the vector store instead of keeping a lightweight ID reference and resolving content from a separate document store -- this bloats memory and slows index operations. Your vector store should be lean.

  • Failing to implement versioned re-indexing when the embedding model changes -- old and new embeddings are not comparable in the same vector space. This one can silently wreck your entire retrieval pipeline.

  • Ignoring metadata filtering performance: post-filtering can return fewer than kk results when the filter is highly selective; pre-filtering requires index support. We'll cover this more in the failure modes section.

When Should You Use This?

Use When

  • Your application requires sub-100ms retrieval over >100K embedding vectors

  • You are building a RAG system that needs to ground LLM responses in domain-specific documents

  • Semantic similarity (rather than keyword matching) is the primary retrieval signal

  • You need metadata-filtered vector search (e.g., per-tenant or per-category retrieval, like serving different restaurant catalogs per city in a Zomato-like app)

  • Your embedding dimensions exceed what full-text search or traditional inverted indices can handle efficiently

Avoid When

  • Your corpus is small (<10K items) and brute-force exact search meets latency requirements -- the added complexity of an ANN index is unnecessary. Seriously, just use a flat index or even NumPy.

  • Keyword or lexical matching is sufficient for your use case (consider BM25 or Elasticsearch instead)

  • You need strong transactional guarantees (ACID) that most vector databases do not yet provide

  • Your retrieval is based on structured attributes (category, price range, date) rather than semantic similarity -- a relational database with standard indices will be simpler and faster

Key Tradeoffs

The Fundamental Tradeoff: Recall vs. Throughput

Higher recall (more accurate nearest neighbors) requires visiting more index nodes, which increases latency. For most production systems, a recall@10 of 0.95-0.99 is the sweet spot. Below 0.90, retrieval quality degrades noticeably in downstream tasks like RAG -- your LLM starts missing critical context.

The Second Axis: Memory vs. Accuracy

Compressed indices (PQ, scalar quantization) use 4-16x less memory but reduce recall compared to flat or HNSW-only indices. For example, a 100M-vector corpus at 768 dimensions takes ~300 GB as a flat index. With PQ (8-byte codes), that drops to ~800 MB -- a 375x reduction. BUT your recall might drop from 0.99 to 0.92.

The right balance depends on your budget and quality requirements. For a cost-sensitive deployment serving a Flipkart product catalog, IVF+PQ might save you $2,000/month (~INR 1.68 lakh/month) in memory costs compared to a full HNSW index.

Rule of Thumb: Start with HNSW (best recall), measure your memory footprint, and only move to compressed indices if the cost is prohibitive.

Alternatives & Comparisons

Semantic search is a higher-level abstraction that includes the embedding model, vector store, and optional re-ranking. A vector store is one component within a semantic search pipeline. If you need end-to-end retrieval rather than just the storage layer, you may want a managed semantic search service. Think of it this way: a vector store is the engine, semantic search is the whole car.

Hybrid search combines dense vector retrieval with sparse lexical retrieval (BM25). When keyword precision matters alongside semantic recall -- for example, matching specific product codes, medical terms, or Indian pin codes -- hybrid search outperforms pure vector search alone. Most production systems at scale end up adopting hybrid search eventually.

Pros, Cons & Tradeoffs

Advantages

  • Sub-linear query time over millions to billions of vectors, enabling real-time similarity search at scale. We're talking 5-10ms queries over 100M vectors -- 1000x faster than brute force.

  • Native support for the geometric operations that dense embeddings require -- no impedance mismatch with SQL or document stores. The data structure matches the workload.

  • Metadata filtering allows multi-tenant, temporally scoped, or attribute-constrained retrieval in a single query. Essential for SaaS applications where each customer's data must be isolated.

  • Mature ecosystem of open-source options (FAISS, Milvus, Qdrant, Weaviate, Chroma) and managed services (Pinecone, Zilliz). You have real choices, not vendor lock-in.

  • Index types (HNSW, IVF, PQ) can be composed to optimize for different memory-recall-latency profiles. You can literally mix and match like building blocks.

Disadvantages

  • Approximate retrieval means some relevant results will be missed -- no guarantee of exact nearest neighbors. You're always operating with a recall < 1.0.

  • Index construction can be expensive: building an HNSW graph over 100M vectors requires significant CPU time (hours) and peak memory that can be 2-3x the final index size.

  • Most vector databases lack mature transaction support, schema evolution, and joins that relational databases provide. Don't try to use them as your primary database.

  • Embedding model upgrades require full re-indexing -- old and new embeddings are incompatible in the same space. For 1B vectors, this can take days and cost hundreds of dollars in compute.

  • Operational complexity increases with scale: sharding, replication, and compaction introduce infrastructure overhead that requires dedicated SRE attention.

Failure Modes & Debugging

Silent recall degradation

Cause

Index parameters (ef_search, nprobe) set too low for the data distribution, or embedding model updated without re-indexing. This is the most insidious failure because nothing breaks -- things just get slightly worse.

Symptoms

Downstream task quality drops (e.g., RAG answers become less accurate) but no errors are raised. Metrics like NDCG@10 or MRR decline gradually. Users start complaining that "search feels off" but you can't reproduce the bug.

Mitigation

Monitor recall against a golden evaluation set. Automate recall regression tests in CI. Track embedding model version alongside index version. Set up alerts when recall@10 drops below your threshold (e.g., 0.93).

Metric mismatch

Cause

Vector store configured with L2 distance but embeddings were trained with cosine similarity (or vice versa). I've personally seen this waste a team two weeks of debugging.

Symptoms

Retrieval results are semantically poor despite vectors being correctly stored. Switching to the correct metric immediately improves quality -- a dead giveaway.

Mitigation

Always verify the training objective of the embedding model and configure the vector store distance metric to match. Document this in your pipeline configuration and add a validation check.

Metadata filter starvation

Cause

Post-filtering applied after ANN retrieval returns fewer than kk results because most candidates are filtered out by the metadata predicate.

Symptoms

Queries with selective filters return fewer results than requested or return irrelevant items. Latency may spike as the system retries with a wider search beam. Common in multi-tenant setups where a small tenant's documents are a tiny fraction of the total corpus.

Mitigation

Use pre-filtering where supported (Qdrant, Weaviate). Alternatively, partition the index by the most selective filter dimension. For multi-tenant RAG, consider per-tenant collections if tenant count is manageable (<1000).

Memory exhaustion on index load

Cause

HNSW or flat indices for large collections loaded entirely into RAM; insufficient memory provisioned. This is especially common when teams underestimate index overhead.

Symptoms

OOM kills, container restarts, or extremely slow queries as the OS pages index data to disk. On Kubernetes, you'll see pods in CrashLoopBackOff.

Mitigation

Use memory-mapped (mmap) storage, enable on-disk indices, or switch to compressed index types (IVF+PQ, scalar quantization). Always calculate expected memory before deploying: vectors ×\times dimensions ×\times 4 bytes ×\times 1.3 (for HNSW overhead).

Stale index after embedding model update

Cause

Embedding model retrained or swapped but corpus not re-embedded and re-indexed. The query vectors now live in a completely different geometric space than the stored vectors.

Symptoms

Query vectors and stored vectors occupy different embedding spaces; recall drops to near-random levels. This is catastrophic, not gradual.

Mitigation

Implement blue-green re-indexing: build a new collection with updated embeddings, validate recall on an evaluation set, swap the alias once validation passes, then delete the old collection. Never do in-place model swaps.

Placement in an ML System

Where Does It Sit in the Pipeline?

In a RAG pipeline, the vector store sits after the embedding model has converted chunks into vectors and before the re-ranker or context assembler that formats retrieved passages for the LLM.

For recommendation systems (think Spotify's Discover Weekly or Flipkart's "Similar Products"), it sits after the user/item encoder and before the scoring/ranking layer.

The vector store is the primary bottleneck for retrieval latency and directly determines the quality ceiling of downstream generation or ranking. If the vector store misses a relevant document, no amount of re-ranking or prompt engineering can recover it.

Key Insight: The vector store is the gatekeeper of your retrieval pipeline. Everything downstream can only work with what it returns. This is why recall@k matters so much.

Pipeline Stage

Retrieval / Serving

Upstream

  • Embedding Model
  • Document Loader
  • Text Chunker

Downstream

  • Re-Ranker
  • Context Assembler
  • LLM (Generator)

Scaling Bottlenecks

Where It Gets Tight

Primary bottlenecks are memory (index size grows linearly with corpus size and embedding dimension) and write throughput during bulk ingestion.

At query time, HNSW scales well but becomes memory-bound. IVF+PQ trades recall for memory efficiency. Sharding across nodes introduces network latency for distributed search -- typically adding 2-5ms per cross-shard hop.

Some concrete numbers: a single Qdrant node can handle ~5,000 QPS for 1M vectors with HNSW. Scaling to 100M vectors usually requires 4-8 shards, bringing per-query latency from ~5ms to ~15-20ms.

Production Case Studies

SpotifyMusic Streaming

Spotify uses approximate nearest neighbor search over user and track embeddings to power its recommendation engine. Embeddings are trained via collaborative filtering and content-based models, then indexed for real-time retrieval of candidate tracks during personalized playlist generation. With 600M+ users and 100M+ tracks, brute-force search was never an option.

Outcome:

ANN-based retrieval reduced candidate generation latency from seconds (exhaustive search) to single-digit milliseconds -- roughly a 1000x speedup -- enabling real-time personalization at scale for 600M+ users.

NotionProductivity Software

Notion built its AI Q&A feature using a RAG architecture with a vector store backing semantic search over user workspace content. Documents are chunked, embedded, and indexed per-workspace. At query time, the user's question is embedded and used to retrieve relevant chunks, which are then fed to an LLM for answer generation. The per-workspace isolation is a textbook example of metadata-filtered multi-tenant retrieval.

Outcome:

The vector store enables sub-200ms retrieval across millions of documents per workspace, with metadata filtering ensuring retrieval is scoped to the user's accessible content. This powers Notion AI's ability to answer questions about your own notes and docs.

FlipkartE-commerce (India)

Flipkart employs vector similarity search for visual product recommendations and duplicate product detection. Product images are encoded into embeddings via a fine-tuned vision transformer, stored in a vector index, and queried to surface visually similar items or flag catalog duplicates. With millions of product listings and thousands of new additions daily, maintaining index freshness is a key engineering challenge.

Outcome:

Reduced duplicate product listings by approximately 30% in high-volume categories, improving catalog quality and reducing customer confusion. This directly impacts GMV by reducing return rates on misidentified products.

Tooling & Ecosystem

FAISS
C++ / PythonOpen Source

Meta's library for efficient similarity search. Supports GPU acceleration, composable index types (IVF, PQ, HNSW, flat). Best for embedding in application code without a separate database service.

Milvus
Go / C++Open Source

Cloud-native vector database with distributed architecture, multiple index types, hybrid search, and multi-tenancy. Backed by Zilliz. Best for large-scale production deployments.

Qdrant
RustOpen Source

High-performance vector database written in Rust. Supports payload-based filtering, quantization, and distributed deployment. Strong filtering capabilities for multi-tenant RAG.

Pinecone
Commercial

Fully managed vector database service. Zero operational overhead — handles indexing, sharding, and replication. Best for teams that want to avoid infrastructure management.

Weaviate
GoOpen Source

Open-source vector database with built-in vectorization modules, GraphQL API, and hybrid (vector + BM25) search. Good for applications that want integrated embedding generation.

pgvector
COpen Source

PostgreSQL extension for vector similarity search. Ideal when you want vector search alongside relational data in a single database without introducing a separate system.

Chroma
Python / RustOpen Source

Lightweight, developer-friendly embedding database. Low setup friction for prototyping and small-to-medium workloads. Wraps HNSWlib under the hood.

Research & References

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

Malkov & Yashunin (2018)IEEE TPAMI, Vol. 42, No. 4

Introduced the HNSW algorithm — a multi-layer proximity graph that achieves logarithmic search complexity. Now the dominant index type in most vector databases.

Product Quantization for Nearest Neighbor Search

Jegou, Douze & Schmid (2011)IEEE TPAMI, Vol. 33, No. 1

Decomposed high-dimensional vectors into sub-vector quantization, enabling compact codes and efficient distance estimation that scales to billions of vectors.

Billion-Scale Similarity Search with GPUs

Johnson, Douze & Jegou (2021)IEEE Transactions on Big Data

Described GPU-optimized k-selection and brute-force/IVF search implementations forming the algorithmic basis of the FAISS library.

The Faiss Library

Douze, Guzhva, Deng, Johnson, Szilvasy, Mazare, Lomeli, Hosseini & Jegou (2024)arXiv preprint

Comprehensive systems reference for FAISS covering all index types, composability, and benchmarks.

Milvus: A Purpose-Built Vector Data Management System

Wang, Yi, Guo, Li, Wang, Liu, Zhao, Li & Xu (2021)ACM SIGMOD 2021

Presented Milvus as a cloud-native vector database with heterogeneous computing support and hybrid scalar+vector queries.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, Perez, Piktus et al. (2020)NeurIPS 2020

Established the RAG paradigm — combining a dense passage retriever backed by a FAISS index with a seq2seq generator — that drives most production LLM+vector-store architectures.

Survey of Vector Database Management Systems

Pan, Wang & Li (2024)The VLDB Journal

Comprehensive survey of 20+ vector DBMSs analyzing indexing, storage, and query processing techniques across commercial and open-source systems.

Accelerating Large-Scale Inference with Anisotropic Vector Quantization

Guo, Sun, Lindgren, Geng, Simcha, Chern & Kumar (2020)ICML 2020

Introduced anisotropic quantization that penalizes the parallel component of residual error, achieving state-of-the-art recall-vs-speed tradeoffs for MIPS; underlies Google's ScaNN library.

Interview & Evaluation Perspective

Common Interview Questions

  • How would you design a vector store for a RAG system serving 10M documents?

  • What is the difference between HNSW and IVF indexing? When would you choose one over the other?

  • How do you handle embedding model upgrades when you have billions of indexed vectors?

  • Explain the recall-latency tradeoff in approximate nearest neighbor search.

  • How would you implement metadata filtering alongside vector search?

Key Points to Mention

  • Recall@k is the primary quality metric -- always quantify the recall you need before choosing an index type. Lead with numbers, not vibes.

  • HNSW provides the best recall-throughput tradeoff for most workloads but is memory-intensive (expect ~1.3x the raw vector size); IVF+PQ trades recall for memory efficiency (as low as 0.01x the raw size).

  • Distance metric must match the embedding model's training objective (cosine, dot product, or L2) -- this is a hard requirement, not a preference.

  • Blue-green re-indexing is the production pattern for embedding model upgrades. Never do in-place model swaps.

  • Pre-filtering vs. post-filtering has major implications for result quality and latency. Pre-filtering is almost always better but requires index support.

Pitfalls to Avoid

  • Claiming vector stores provide exact nearest neighbors -- they are approximate by design. The clue is in the name: Approximate Nearest Neighbor.

  • Ignoring the memory footprint: a naive flat index of 100M x 768-dim vectors requires ~300 GB of RAM. That's a $2,400/month (~INR 2 lakh/month) cloud instance just for the index.

  • Conflating the vector store with the entire retrieval pipeline -- it is one component, not the whole system. Always clarify the boundary.

  • Assuming one index type fits all workloads without discussing scale, latency, and recall requirements. HNSW is great, but it's not always the answer.

Senior-Level Expectation

A senior candidate should be able to discuss the full lifecycle: embedding model selection, chunking strategy, index type selection with quantitative justification (not just "I'd use HNSW"), metadata schema design, re-indexing strategy (blue-green with validation gates), monitoring (recall regression, latency P99, QPS headroom), and capacity planning (memory per vector, storage growth rate, cost projections). The ability to reason about cost-performance tradeoffs -- especially under budget constraints common in Indian startups -- separates senior engineers from mid-level ones.

Summary

Let's recap what we've covered:

  • A vector store is a purpose-built system for indexing, persisting, and querying high-dimensional embedding vectors using approximate nearest neighbor algorithms. It's the retrieval backbone of modern ML systems.

  • The core tradeoff is recall vs. throughput: every ANN index sacrifices exact accuracy for sub-linear query time. Tuning this balance -- not just once, but continuously -- is the primary operational concern.

  • HNSW dominates for its recall-throughput profile (5-10ms queries at 0.95+ recall). IVF+PQ is preferred when memory is constrained (4-16x compression). Both can be composed in systems like FAISS.

  • Always match the distance metric to the embedding model's training objective -- cosine, dot product, and L2 are not interchangeable. Getting this wrong is the #1 silent failure mode.

  • Vector stores own geometry, not semantics: retrieval quality is bounded by the upstream embedding model's representation quality. If your results are bad, look upstream first.

A vector store is the bridge between learned representations and real-time retrieval. It enables ML systems to leverage dense embeddings at production scale by providing the indexing infrastructure that brute-force search simply cannot. Moving on to the next block in the pipeline, remember: the vector store determines the quality ceiling for everything downstream.

ML System Design Reference · Built by QnA Lab