Vector Store in Machine Learning
Let's start with the big picture. A vector store is a specialized storage and retrieval system built for one job: persist high-dimensional embedding vectors and answer similarity queries against them at blazing-low latency.
In ML systems, vector stores sit right at the intersection of learned representations and information retrieval. They accept dense vectors produced by encoder models and return the nearest neighbors under a chosen distance metric -- cosine similarity, inner product, or L2 distance.
But why did this become such a big deal? Because retrieval-augmented generation (RAG) proved that coupling LLMs with external knowledge stores dramatically reduces hallucination and keeps responses grounded in actual source material. That insight turned vector stores from a niche research tool into a production necessity practically overnight.
Today, vector stores underpin recommendation engines, semantic search, duplicate detection, and virtually every LLM-powered application that needs access to domain-specific data at inference time. From Flipkart's product search to Swiggy's restaurant recommendations -- if an ML system retrieves anything by meaning rather than keywords, there's almost certainly a vector store under the hood.
Concept Snapshot
- What It Is
- A purpose-built data store optimized for indexing, persisting, and querying high-dimensional vectors using approximate nearest neighbor (ANN) algorithms.
- Category
- RAG Pipeline
- Complexity
- Intermediate
- Inputs / Outputs
- Inputs: embedding vectors (typically 384-4096 dimensions) with optional metadata. Outputs: ranked list of nearest neighbor vectors with similarity scores.
- System Placement
- Sits between the embedding model (upstream) and the context assembler or re-ranker (downstream) in a RAG or retrieval pipeline.
- Also Known As
- vector database, vector index, embedding store, ANN index, similarity search engine
- Typical Users
- ML engineers, backend engineers, data engineers, search/retrieval specialists
- Prerequisites
- Embeddings, Distance metrics (cosine, L2, dot product), Basic indexing concepts
- Key Terms
- ANNHNSWIVFproduct quantizationrecall@kQPSembedding dimensionmetadata filtering
Why This Concept Exists
The Problem with Regular Databases
Standard relational and document databases index data by discrete attributes -- primary keys, B-tree ranges, and inverted text indices. That's great for queries like "find users where age > 25" or "search documents containing the word 'retrieval'." But none of these structures are designed for the geometric proximity queries that dense embedding vectors require.
Let's put a number on this. A brute-force scan over one million 768-dimensional vectors demands roughly 3 billion floating-point comparisons per query. At production QPS, that's simply not feasible.
Two Trends That Made This Urgent
The need for a dedicated abstraction grew alongside two parallel trends:
Trend 1: Representation learning matured. Transformer-based encoders could now compress sentences, images, and structured records into semantically meaningful vectors. We finally had embeddings worth storing.
Trend 2: These representations went from offline analytics to live, user-facing retrieval. Autocomplete, personalized feeds, conversational AI with external knowledge -- all of these need sub-100ms vector lookups.
The Algorithmic Foundation
Approximate nearest neighbor (ANN) algorithms -- locality-sensitive hashing, product quantization, and graph-based methods like HNSW -- provided the building blocks. BUT raw algorithms aren't enough for production. You also need persistence, replication, metadata filtering, and operational tooling.
That's exactly the gap vector stores fill. They package ANN algorithms into managed systems with all the production plumbing that individual index libraries like FAISS or HNSWlib don't provide on their own.
Key Takeaway: Vector stores exist because traditional databases can't do geometric proximity queries, and raw ANN libraries don't provide the persistence and operational features production systems need. Vector stores bridge that gap.
Core Intuition & Mental Model
The Core Promise
Here's the fundamental guarantee: given a query vector, a vector store can return an approximately correct set of nearest neighbors in sub-linear time -- typically for graph-based indices -- even when your corpus contains hundreds of millions of vectors.
The word "approximately" is load-bearing. Every ANN index trades exactness for speed, and the recall-throughput curve is the central knob you'll be tuning for your entire career working with these systems.
What a Vector Store Does NOT Do
This is where I see a lot of confusion. A vector store finds geometrically close vectors, not necessarily semantically relevant ones. If the upstream embedding model produces poor representations, the vector store will faithfully return poor results at very high speed.
The boundary of responsibility is clear: the vector store owns geometry, not meaning.
A Useful Mental Model
Think of a vector store as a spatial file system. Just as a file system organizes bytes into a hierarchy optimized for path-based lookups, a vector store organizes floating-point arrays into a structure optimized for proximity-based lookups.
Both abstractions are inert -- they reflect the quality of what you put into them. Garbage embeddings in, garbage retrieval out. That was pretty simple, wasn't it?
Expert Note: When debugging poor retrieval quality, start with the embedding model, not the vector store. Nine times out of ten, the issue is upstream -- bad chunking, wrong model, or mismatched fine-tuning -- not the index itself.
Technical Foundations
The Math (Don't Worry, We'll Build Up to It)
Let's formalize what a vector store actually does. I'll explain the intuition first, then give you the notation.
Intuition: You have a big collection of vectors. Someone hands you a new query vector. You need to find the vectors in your collection that are closest to the query. And you need to do it fast.
Formally: A vector store implements an index over a collection where each . Given a query vector and a positive integer , the store returns a set with that approximates the true -nearest neighbors under a distance function .
Measuring Approximation Quality
How do we know if our "approximate" answer is any good? We use recall@k: the fraction of true top- neighbors present in the returned set .
A recall of 0.95 at means that, on average, 9.5 of the 10 returned items are among the actual 10 closest vectors. That's usually good enough for production.
Distance Functions
Vector stores commonly support three distance functions:
- Cosine similarity:
- L2 (Euclidean) distance:
- Inner product (dot product):
Which Metric Should You Use?
This is critical, and I've seen teams get it wrong more often than you'd expect. The choice of metric must match the training objective of the embedding model:
- Models trained with contrastive losses on normalized vectors? Use cosine similarity.
- Maximum inner product search (MIPS) problems where vector magnitudes carry information? Use inner product.
- Models explicitly trained with L2 distance? Use L2.
Warning: Using the wrong distance metric is a silent killer. Your store will happily return results -- they'll just be the wrong results. No error, no warning, just degraded quality.
Internal Architecture
A production vector store typically consists of four subsystems: an ingestion layer that accepts vectors and metadata, an indexing engine that builds and maintains ANN data structures, a query engine that executes search and filtering, and a storage backend that handles persistence and replication. Let's walk through each one.

Key Components
Ingestion Layer
Accepts vectors (and optional payload/metadata) via API or bulk import, validates dimensions, and routes to the indexing engine.
ANN Index Engine
Builds and maintains the core index structure (HNSW graph, IVF clusters, or product quantization codebooks). Handles incremental inserts and deletes.
Query Engine
Receives a query vector and optional metadata filters, traverses the index, and returns top-k results with scores. Manages concurrency and query planning.
Storage Backend
Persists vectors and index structures to disk. Handles write-ahead logging, snapshots, and replication for durability and high availability.
Metadata Filter Engine
Applies pre-filter or post-filter predicates on scalar metadata fields (e.g., category, timestamp) alongside vector similarity search.
Data Flow
Here's how data flows through the system:
Write Path: Embedding vectors arrive via the ingestion API -> validated and assigned internal IDs -> inserted into the ANN index (and WAL) -> persisted to storage.
Read Path: A query vector enters the query engine -> the ANN index is traversed (with optional metadata pre-filtering) -> top-k candidates are scored -> results are returned with similarity scores and payloads.
The write path and read path are largely independent, which is why most vector stores can handle concurrent reads and writes without heavy locking.
A directed flow from 'Embedding Model' -> 'Ingestion API' -> 'ANN Index' <-> 'Storage Backend', with a parallel 'Query API' -> 'ANN Index' -> 'Top-K Results' path.
How to Implement
Two Approaches to Implementation
Implementation patterns fall into two broad categories:
Option A: Embedded index library (FAISS, HNSWlib, ScaNN) -- you run the index directly in your application process. Great for prototyping and when you control the infra.
Option B: Managed vector database (Pinecone, Weaviate, Qdrant, Milvus, pgvector) -- a separate service handles everything. Better for production when you need replication, sharding, metadata filtering, or multi-tenancy.
The right choice depends on scale, operational maturity, and whether you need features beyond basic ANN search. For a startup in Bengaluru building their first RAG prototype, Chroma or pgvector might be perfect. For a platform serving millions of queries per day, you'll likely want Qdrant, Milvus, or Pinecone.
Cost Note: Pinecone's managed service starts around $70/month (~INR 5,900/month) for a basic pod. Self-hosted Qdrant or Milvus on a cloud VM can be cheaper but requires ops investment. FAISS is free but you own the infrastructure entirely.
import faiss
import numpy as np
d = 768 # embedding dimension
nlist = 100 # number of IVF clusters
m = 64 # PQ sub-quantizers
# Train the index on a sample
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
index.train(training_vectors) # shape (n_train, d)
index.add(corpus_vectors) # shape (n_corpus, d)
# Search
index.nprobe = 10 # clusters to visit at query time
D, I = index.search(query_vectors, k=10)
# D: distances, I: indices of nearest neighborsFAISS (Facebook AI Similarity Search) provides composable index types. IVF clusters the space for coarse quantization, then PQ compresses residual vectors for memory efficiency. The nprobe parameter controls the recall-throughput tradeoff: higher values improve recall at the cost of latency.
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
client = QdrantClient(host="localhost", port=6333)
# Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)
# Upsert vectors with metadata
client.upsert(
collection_name="documents",
points=[
PointStruct(id=1, vector=embedding, payload={"source": "arxiv", "year": 2024}),
],
)
# Query with metadata filter
results = client.search(
collection_name="documents",
query_vector=query_embedding,
query_filter=Filter(must=[FieldCondition(key="source", match=MatchValue(value="arxiv"))]),
limit=10,
)Qdrant is an open-source vector database written in Rust. This example demonstrates the common pattern of combining vector similarity search with scalar metadata filtering — essential for multi-tenant RAG systems where retrieved documents must belong to the querying user's corpus.
# Qdrant collection config (YAML equivalent)
collection_name: documents
vectors:
size: 768
distance: Cosine
hnsw_config:
m: 16
ef_construct: 128
optimizers_config:
indexing_threshold: 20000
memmap_threshold: 50000
replication_factor: 2Common Implementation Mistakes
- ●
Choosing cosine similarity when the embedding model was trained with dot-product objectives (or vice versa) -- mismatched metrics silently degrade recall. I cannot stress this enough: always check the model card before configuring your distance metric.
- ●
Skipping index parameter tuning (e.g., HNSW
ef_construction, IVFnprobe) and accepting library defaults that may not suit your recall/latency requirements. The defaults are designed for generality, not your specific workload. - ●
Storing raw text alongside vectors in the vector store instead of keeping a lightweight ID reference and resolving content from a separate document store -- this bloats memory and slows index operations. Your vector store should be lean.
- ●
Failing to implement versioned re-indexing when the embedding model changes -- old and new embeddings are not comparable in the same vector space. This one can silently wreck your entire retrieval pipeline.
- ●
Ignoring metadata filtering performance: post-filtering can return fewer than results when the filter is highly selective; pre-filtering requires index support. We'll cover this more in the failure modes section.
When Should You Use This?
Use When
Your application requires sub-100ms retrieval over >100K embedding vectors
You are building a RAG system that needs to ground LLM responses in domain-specific documents
Semantic similarity (rather than keyword matching) is the primary retrieval signal
You need metadata-filtered vector search (e.g., per-tenant or per-category retrieval, like serving different restaurant catalogs per city in a Zomato-like app)
Your embedding dimensions exceed what full-text search or traditional inverted indices can handle efficiently
Avoid When
Your corpus is small (<10K items) and brute-force exact search meets latency requirements -- the added complexity of an ANN index is unnecessary. Seriously, just use a flat index or even NumPy.
Keyword or lexical matching is sufficient for your use case (consider BM25 or Elasticsearch instead)
You need strong transactional guarantees (ACID) that most vector databases do not yet provide
Your retrieval is based on structured attributes (category, price range, date) rather than semantic similarity -- a relational database with standard indices will be simpler and faster
Key Tradeoffs
The Fundamental Tradeoff: Recall vs. Throughput
Higher recall (more accurate nearest neighbors) requires visiting more index nodes, which increases latency. For most production systems, a recall@10 of 0.95-0.99 is the sweet spot. Below 0.90, retrieval quality degrades noticeably in downstream tasks like RAG -- your LLM starts missing critical context.
The Second Axis: Memory vs. Accuracy
Compressed indices (PQ, scalar quantization) use 4-16x less memory but reduce recall compared to flat or HNSW-only indices. For example, a 100M-vector corpus at 768 dimensions takes ~300 GB as a flat index. With PQ (8-byte codes), that drops to ~800 MB -- a 375x reduction. BUT your recall might drop from 0.99 to 0.92.
The right balance depends on your budget and quality requirements. For a cost-sensitive deployment serving a Flipkart product catalog, IVF+PQ might save you $2,000/month (~INR 1.68 lakh/month) in memory costs compared to a full HNSW index.
Rule of Thumb: Start with HNSW (best recall), measure your memory footprint, and only move to compressed indices if the cost is prohibitive.
Alternatives & Comparisons
Semantic search is a higher-level abstraction that includes the embedding model, vector store, and optional re-ranking. A vector store is one component within a semantic search pipeline. If you need end-to-end retrieval rather than just the storage layer, you may want a managed semantic search service. Think of it this way: a vector store is the engine, semantic search is the whole car.
Hybrid search combines dense vector retrieval with sparse lexical retrieval (BM25). When keyword precision matters alongside semantic recall -- for example, matching specific product codes, medical terms, or Indian pin codes -- hybrid search outperforms pure vector search alone. Most production systems at scale end up adopting hybrid search eventually.
Pros, Cons & Tradeoffs
Advantages
Sub-linear query time over millions to billions of vectors, enabling real-time similarity search at scale. We're talking 5-10ms queries over 100M vectors -- 1000x faster than brute force.
Native support for the geometric operations that dense embeddings require -- no impedance mismatch with SQL or document stores. The data structure matches the workload.
Metadata filtering allows multi-tenant, temporally scoped, or attribute-constrained retrieval in a single query. Essential for SaaS applications where each customer's data must be isolated.
Mature ecosystem of open-source options (FAISS, Milvus, Qdrant, Weaviate, Chroma) and managed services (Pinecone, Zilliz). You have real choices, not vendor lock-in.
Index types (HNSW, IVF, PQ) can be composed to optimize for different memory-recall-latency profiles. You can literally mix and match like building blocks.
Disadvantages
Approximate retrieval means some relevant results will be missed -- no guarantee of exact nearest neighbors. You're always operating with a recall < 1.0.
Index construction can be expensive: building an HNSW graph over 100M vectors requires significant CPU time (hours) and peak memory that can be 2-3x the final index size.
Most vector databases lack mature transaction support, schema evolution, and joins that relational databases provide. Don't try to use them as your primary database.
Embedding model upgrades require full re-indexing -- old and new embeddings are incompatible in the same space. For 1B vectors, this can take days and cost hundreds of dollars in compute.
Operational complexity increases with scale: sharding, replication, and compaction introduce infrastructure overhead that requires dedicated SRE attention.
Failure Modes & Debugging
Silent recall degradation
Cause
Index parameters (ef_search, nprobe) set too low for the data distribution, or embedding model updated without re-indexing. This is the most insidious failure because nothing breaks -- things just get slightly worse.
Symptoms
Downstream task quality drops (e.g., RAG answers become less accurate) but no errors are raised. Metrics like NDCG@10 or MRR decline gradually. Users start complaining that "search feels off" but you can't reproduce the bug.
Mitigation
Monitor recall against a golden evaluation set. Automate recall regression tests in CI. Track embedding model version alongside index version. Set up alerts when recall@10 drops below your threshold (e.g., 0.93).
Metric mismatch
Cause
Vector store configured with L2 distance but embeddings were trained with cosine similarity (or vice versa). I've personally seen this waste a team two weeks of debugging.
Symptoms
Retrieval results are semantically poor despite vectors being correctly stored. Switching to the correct metric immediately improves quality -- a dead giveaway.
Mitigation
Always verify the training objective of the embedding model and configure the vector store distance metric to match. Document this in your pipeline configuration and add a validation check.
Metadata filter starvation
Cause
Post-filtering applied after ANN retrieval returns fewer than results because most candidates are filtered out by the metadata predicate.
Symptoms
Queries with selective filters return fewer results than requested or return irrelevant items. Latency may spike as the system retries with a wider search beam. Common in multi-tenant setups where a small tenant's documents are a tiny fraction of the total corpus.
Mitigation
Use pre-filtering where supported (Qdrant, Weaviate). Alternatively, partition the index by the most selective filter dimension. For multi-tenant RAG, consider per-tenant collections if tenant count is manageable (<1000).
Memory exhaustion on index load
Cause
HNSW or flat indices for large collections loaded entirely into RAM; insufficient memory provisioned. This is especially common when teams underestimate index overhead.
Symptoms
OOM kills, container restarts, or extremely slow queries as the OS pages index data to disk. On Kubernetes, you'll see pods in CrashLoopBackOff.
Mitigation
Use memory-mapped (mmap) storage, enable on-disk indices, or switch to compressed index types (IVF+PQ, scalar quantization). Always calculate expected memory before deploying: vectors dimensions 4 bytes 1.3 (for HNSW overhead).
Stale index after embedding model update
Cause
Embedding model retrained or swapped but corpus not re-embedded and re-indexed. The query vectors now live in a completely different geometric space than the stored vectors.
Symptoms
Query vectors and stored vectors occupy different embedding spaces; recall drops to near-random levels. This is catastrophic, not gradual.
Mitigation
Implement blue-green re-indexing: build a new collection with updated embeddings, validate recall on an evaluation set, swap the alias once validation passes, then delete the old collection. Never do in-place model swaps.
Placement in an ML System
Where Does It Sit in the Pipeline?
In a RAG pipeline, the vector store sits after the embedding model has converted chunks into vectors and before the re-ranker or context assembler that formats retrieved passages for the LLM.
For recommendation systems (think Spotify's Discover Weekly or Flipkart's "Similar Products"), it sits after the user/item encoder and before the scoring/ranking layer.
The vector store is the primary bottleneck for retrieval latency and directly determines the quality ceiling of downstream generation or ranking. If the vector store misses a relevant document, no amount of re-ranking or prompt engineering can recover it.
Key Insight: The vector store is the gatekeeper of your retrieval pipeline. Everything downstream can only work with what it returns. This is why recall@k matters so much.
Pipeline Stage
Retrieval / Serving
Upstream
- Embedding Model
- Document Loader
- Text Chunker
Downstream
- Re-Ranker
- Context Assembler
- LLM (Generator)
Scaling Bottlenecks
Primary bottlenecks are memory (index size grows linearly with corpus size and embedding dimension) and write throughput during bulk ingestion.
At query time, HNSW scales well but becomes memory-bound. IVF+PQ trades recall for memory efficiency. Sharding across nodes introduces network latency for distributed search -- typically adding 2-5ms per cross-shard hop.
Some concrete numbers: a single Qdrant node can handle ~5,000 QPS for 1M vectors with HNSW. Scaling to 100M vectors usually requires 4-8 shards, bringing per-query latency from ~5ms to ~15-20ms.
Production Case Studies
Spotify uses approximate nearest neighbor search over user and track embeddings to power its recommendation engine. Embeddings are trained via collaborative filtering and content-based models, then indexed for real-time retrieval of candidate tracks during personalized playlist generation. With 600M+ users and 100M+ tracks, brute-force search was never an option.
ANN-based retrieval reduced candidate generation latency from seconds (exhaustive search) to single-digit milliseconds -- roughly a 1000x speedup -- enabling real-time personalization at scale for 600M+ users.
Notion built its AI Q&A feature using a RAG architecture with a vector store backing semantic search over user workspace content. Documents are chunked, embedded, and indexed per-workspace. At query time, the user's question is embedded and used to retrieve relevant chunks, which are then fed to an LLM for answer generation. The per-workspace isolation is a textbook example of metadata-filtered multi-tenant retrieval.
The vector store enables sub-200ms retrieval across millions of documents per workspace, with metadata filtering ensuring retrieval is scoped to the user's accessible content. This powers Notion AI's ability to answer questions about your own notes and docs.
Flipkart employs vector similarity search for visual product recommendations and duplicate product detection. Product images are encoded into embeddings via a fine-tuned vision transformer, stored in a vector index, and queried to surface visually similar items or flag catalog duplicates. With millions of product listings and thousands of new additions daily, maintaining index freshness is a key engineering challenge.
Reduced duplicate product listings by approximately 30% in high-volume categories, improving catalog quality and reducing customer confusion. This directly impacts GMV by reducing return rates on misidentified products.
Tooling & Ecosystem
Meta's library for efficient similarity search. Supports GPU acceleration, composable index types (IVF, PQ, HNSW, flat). Best for embedding in application code without a separate database service.
Cloud-native vector database with distributed architecture, multiple index types, hybrid search, and multi-tenancy. Backed by Zilliz. Best for large-scale production deployments.
High-performance vector database written in Rust. Supports payload-based filtering, quantization, and distributed deployment. Strong filtering capabilities for multi-tenant RAG.
Fully managed vector database service. Zero operational overhead — handles indexing, sharding, and replication. Best for teams that want to avoid infrastructure management.
Open-source vector database with built-in vectorization modules, GraphQL API, and hybrid (vector + BM25) search. Good for applications that want integrated embedding generation.
PostgreSQL extension for vector similarity search. Ideal when you want vector search alongside relational data in a single database without introducing a separate system.
Lightweight, developer-friendly embedding database. Low setup friction for prototyping and small-to-medium workloads. Wraps HNSWlib under the hood.
Research & References
Malkov & Yashunin (2018)IEEE TPAMI, Vol. 42, No. 4
Introduced the HNSW algorithm — a multi-layer proximity graph that achieves logarithmic search complexity. Now the dominant index type in most vector databases.
Jegou, Douze & Schmid (2011)IEEE TPAMI, Vol. 33, No. 1
Decomposed high-dimensional vectors into sub-vector quantization, enabling compact codes and efficient distance estimation that scales to billions of vectors.
Johnson, Douze & Jegou (2021)IEEE Transactions on Big Data
Described GPU-optimized k-selection and brute-force/IVF search implementations forming the algorithmic basis of the FAISS library.
Douze, Guzhva, Deng, Johnson, Szilvasy, Mazare, Lomeli, Hosseini & Jegou (2024)arXiv preprint
Comprehensive systems reference for FAISS covering all index types, composability, and benchmarks.
Wang, Yi, Guo, Li, Wang, Liu, Zhao, Li & Xu (2021)ACM SIGMOD 2021
Presented Milvus as a cloud-native vector database with heterogeneous computing support and hybrid scalar+vector queries.
Lewis, Perez, Piktus et al. (2020)NeurIPS 2020
Established the RAG paradigm — combining a dense passage retriever backed by a FAISS index with a seq2seq generator — that drives most production LLM+vector-store architectures.
Pan, Wang & Li (2024)The VLDB Journal
Comprehensive survey of 20+ vector DBMSs analyzing indexing, storage, and query processing techniques across commercial and open-source systems.
Guo, Sun, Lindgren, Geng, Simcha, Chern & Kumar (2020)ICML 2020
Introduced anisotropic quantization that penalizes the parallel component of residual error, achieving state-of-the-art recall-vs-speed tradeoffs for MIPS; underlies Google's ScaNN library.
Interview & Evaluation Perspective
Common Interview Questions
- ●
How would you design a vector store for a RAG system serving 10M documents?
- ●
What is the difference between HNSW and IVF indexing? When would you choose one over the other?
- ●
How do you handle embedding model upgrades when you have billions of indexed vectors?
- ●
Explain the recall-latency tradeoff in approximate nearest neighbor search.
- ●
How would you implement metadata filtering alongside vector search?
Key Points to Mention
- ●
Recall@k is the primary quality metric -- always quantify the recall you need before choosing an index type. Lead with numbers, not vibes.
- ●
HNSW provides the best recall-throughput tradeoff for most workloads but is memory-intensive (expect ~1.3x the raw vector size); IVF+PQ trades recall for memory efficiency (as low as 0.01x the raw size).
- ●
Distance metric must match the embedding model's training objective (cosine, dot product, or L2) -- this is a hard requirement, not a preference.
- ●
Blue-green re-indexing is the production pattern for embedding model upgrades. Never do in-place model swaps.
- ●
Pre-filtering vs. post-filtering has major implications for result quality and latency. Pre-filtering is almost always better but requires index support.
Pitfalls to Avoid
- ●
Claiming vector stores provide exact nearest neighbors -- they are approximate by design. The clue is in the name: Approximate Nearest Neighbor.
- ●
Ignoring the memory footprint: a naive flat index of 100M x 768-dim vectors requires ~300 GB of RAM. That's a $2,400/month (~INR 2 lakh/month) cloud instance just for the index.
- ●
Conflating the vector store with the entire retrieval pipeline -- it is one component, not the whole system. Always clarify the boundary.
- ●
Assuming one index type fits all workloads without discussing scale, latency, and recall requirements. HNSW is great, but it's not always the answer.
Senior-Level Expectation
A senior candidate should be able to discuss the full lifecycle: embedding model selection, chunking strategy, index type selection with quantitative justification (not just "I'd use HNSW"), metadata schema design, re-indexing strategy (blue-green with validation gates), monitoring (recall regression, latency P99, QPS headroom), and capacity planning (memory per vector, storage growth rate, cost projections). The ability to reason about cost-performance tradeoffs -- especially under budget constraints common in Indian startups -- separates senior engineers from mid-level ones.
Summary
Let's recap what we've covered:
-
A vector store is a purpose-built system for indexing, persisting, and querying high-dimensional embedding vectors using approximate nearest neighbor algorithms. It's the retrieval backbone of modern ML systems.
-
The core tradeoff is recall vs. throughput: every ANN index sacrifices exact accuracy for sub-linear query time. Tuning this balance -- not just once, but continuously -- is the primary operational concern.
-
HNSW dominates for its recall-throughput profile (5-10ms queries at 0.95+ recall). IVF+PQ is preferred when memory is constrained (4-16x compression). Both can be composed in systems like FAISS.
-
Always match the distance metric to the embedding model's training objective -- cosine, dot product, and L2 are not interchangeable. Getting this wrong is the #1 silent failure mode.
-
Vector stores own geometry, not semantics: retrieval quality is bounded by the upstream embedding model's representation quality. If your results are bad, look upstream first.
A vector store is the bridge between learned representations and real-time retrieval. It enables ML systems to leverage dense embeddings at production scale by providing the indexing infrastructure that brute-force search simply cannot. Moving on to the next block in the pipeline, remember: the vector store determines the quality ceiling for everything downstream.