What is a vector store in simple terms?

A vector store is a specialized database that stores numerical representations (embeddings) of data and lets you find similar items quickly. Instead of searching by keywords, you search by meaning -- the vector store finds items whose embeddings are closest to your query embedding. Think of it like a library where books are organized not by title or author, but by what they're *about*. When you ask a question, it finds the books with the most relevant content, regardless of the exact words used.

Do I always need a dedicated vector database?

Not necessarily! For small datasets (<100K vectors), a library like FAISS or HNSWlib embedded in your application may be more than enough. For PostgreSQL-centric stacks, the pgvector extension adds vector search without a separate service -- this is a great option if you're already running Postgres. Dedicated vector databases become valuable at scale (millions of vectors) or when you need operational features like replication, sharding, and managed backups. If you're building a weekend hackathon project, Chroma is perfect. If you're building Flipkart's product search, you'll want something more robust.

Vector store vs. traditional database -- what is the difference?

Traditional databases index discrete values (strings, numbers) using B-trees or hash maps and answer exact-match or range queries. Vector stores index high-dimensional floating-point arrays using ANN algorithms and answer similarity queries -- finding the closest vectors to a given query. They solve fundamentally different retrieval problems. A SQL query says "give me exactly this." A vector query says "give me things like this." You'll often need both in a production system.

Is a vector store the same as semantic search?

No, and this is a common confusion. A vector store is the **storage and retrieval component**. Semantic search is the **end-to-end pipeline** that includes an embedding model (to produce vectors), a vector store (to index and retrieve them), and optionally a re-ranker. The vector store is one building block within semantic search -- an important one, but not the whole story.

What happens to my vector store when I update my embedding model?

You must re-embed your entire corpus and rebuild the index. Vectors from different model versions exist in incompatible geometric spaces and cannot be meaningfully compared -- it's like trying to look up English words in a French dictionary. The standard production pattern is **blue-green re-indexing**: build a new index with updated embeddings, validate recall on an evaluation set, then swap the active index. For a corpus of 10M documents, budget 4-8 hours for re-embedding and 1-2 hours for re-indexing, depending on your compute setup.

How much memory does a vector store need?

A flat (uncompressed) index requires approximately: $$\text{Memory} = \text{num\_vectors} \times \text{dimensions} \times 4 \text{ bytes}$$ For example, 10 million 768-dimensional vectors need about 30 GB. HNSW adds roughly 20-30% overhead for graph edges, bringing it to ~39 GB. Compressed indices (PQ, scalar quantization) can reduce this by 4-16x at some cost to recall. So that same 10M-vector corpus could fit in ~2 GB with aggressive PQ -- but you'd see recall drop from 0.99 to perhaps 0.90-0.93.

RAG Pipeline

Vector Store in Machine Learning

Q: Vector store vs. traditional database -- what is the difference?

Traditional databases index discrete values (strings, numbers) using B-trees or hash maps and answer exact-match or range queries. Vector stores index high-dimensional floating-point arrays using ANN algorithms and answer similarity queries -- finding the closest vectors to a given query. They solve fundamentally different retrieval problems. A SQL query says "give me exactly this." A vector query says "give me things like this." You'll often need both in a production system.

Q: Is a vector store the same as semantic search?

No, and this is a common confusion. A vector store is the **storage and retrieval component**. Semantic search is the **end-to-end pipeline** that includes an embedding model (to produce vectors), a vector store (to index and retrieve them), and optionally a re-ranker. The vector store is one building block within semantic search -- an important one, but not the whole story.

Q: What happens to my vector store when I update my embedding model?

You must re-embed your entire corpus and rebuild the index. Vectors from different model versions exist in incompatible geometric spaces and cannot be meaningfully compared -- it's like trying to look up English words in a French dictionary. The standard production pattern is **blue-green re-indexing**: build a new index with updated embeddings, validate recall on an evaluation set, then swap the active index. For a corpus of 10M documents, budget 4-8 hours for re-embedding and 1-2 hours for re-indexing, depending on your compute setup.

Q: How much memory does a vector store need?

A flat (uncompressed) index requires approximately: $$\text{Memory} = \text{num\_vectors} \times \text{dimensions} \times 4 \text{ bytes}$$ For example, 10 million 768-dimensional vectors need about 30 GB. HNSW adds roughly 20-30% overhead for graph edges, bringing it to ~39 GB. Compressed indices (PQ, scalar quantization) can reduce this by 4-16x at some cost to recall. So that same 10M-vector corpus could fit in ~2 GB with aggressive PQ -- but you'd see recall drop from 0.99 to perhaps 0.90-0.93.

Let's start with the big picture. A vector store is a specialized storage and retrieval system built for one job: persist high-dimensional embedding vectors and answer similarity queries against them at blazing-low latency.

In ML systems, vector stores sit right at the intersection of learned representations and information retrieval. They accept dense vectors produced by encoder models and return the nearest neighbors under a chosen distance metric -- cosine similarity, inner product, or L2 distance.

But why did this become such a big deal? Because retrieval-augmented generation (RAG) proved that coupling LLMs with external knowledge stores dramatically reduces hallucination and keeps responses grounded in actual source material. That insight turned vector stores from a niche research tool into a production necessity practically overnight.

Today, vector stores underpin recommendation engines, semantic search, duplicate detection, and virtually every LLM-powered application that needs access to domain-specific data at inference time. From Flipkart's product search to Swiggy's restaurant recommendations -- if an ML system retrieves anything by meaning rather than keywords, there's almost certainly a vector store under the hood.

Concept Snapshot

What It Is: A purpose-built data store optimized for indexing, persisting, and querying high-dimensional vectors using approximate nearest neighbor (ANN) algorithms.
Category: RAG Pipeline
Complexity: Intermediate
Inputs / Outputs: Inputs: embedding vectors (typically 384-4096 dimensions) with optional metadata. Outputs: ranked list of nearest neighbor vectors with similarity scores.
System Placement: Sits between the embedding model (upstream) and the context assembler or re-ranker (downstream) in a RAG or retrieval pipeline.
Also Known As: vector database, vector index, embedding store, ANN index, similarity search engine
Typical Users: ML engineers, backend engineers, data engineers, search/retrieval specialists
Prerequisites: Embeddings, Distance metrics (cosine, L2, dot product), Basic indexing concepts
Key Terms: ANNHNSWIVFproduct quantizationrecall@kQPSembedding dimensionmetadata filtering

Why This Concept Exists

The Problem with Regular Databases

Standard relational and document databases index data by discrete attributes -- primary keys, B-tree ranges, and inverted text indices. That's great for queries like "find users where age > 25" or "search documents containing the word 'retrieval'." But none of these structures are designed for the geometric proximity queries that dense embedding vectors require.

Let's put a number on this. A brute-force scan over one million 768-dimensional vectors demands roughly 3 billion floating-point comparisons per query. At production QPS, that's simply not feasible.

Two Trends That Made This Urgent

The need for a dedicated abstraction grew alongside two parallel trends:

Trend 1: Representation learning matured. Transformer-based encoders could now compress sentences, images, and structured records into semantically meaningful vectors. We finally had embeddings worth storing.

Trend 2: These representations went from offline analytics to live, user-facing retrieval. Autocomplete, personalized feeds, conversational AI with external knowledge -- all of these need sub-100ms vector lookups.

The Algorithmic Foundation

Approximate nearest neighbor (ANN) algorithms -- locality-sensitive hashing, product quantization, and graph-based methods like HNSW -- provided the building blocks. BUT raw algorithms aren't enough for production. You also need persistence, replication, metadata filtering, and operational tooling.

That's exactly the gap vector stores fill. They package ANN algorithms into managed systems with all the production plumbing that individual index libraries like FAISS or HNSWlib don't provide on their own.

Key Takeaway: Vector stores exist because traditional databases can't do geometric proximity queries, and raw ANN libraries don't provide the persistence and operational features production systems need. Vector stores bridge that gap.

Core Intuition & Mental Model

The Core Promise

Here's the fundamental guarantee: given a query vector, a vector store can return an approximately correct set of nearest neighbors in sub-linear time -- typically $O(\log n)$ for graph-based indices -- even when your corpus contains hundreds of millions of vectors.

The word "approximately" is load-bearing. Every ANN index trades exactness for speed, and the recall-throughput curve is the central knob you'll be tuning for your entire career working with these systems.

What a Vector Store Does NOT Do

This is where I see a lot of confusion. A vector store finds geometrically close vectors, not necessarily semantically relevant ones. If the upstream embedding model produces poor representations, the vector store will faithfully return poor results at very high speed.

The boundary of responsibility is clear: the vector store owns geometry, not meaning.

A Useful Mental Model

Think of a vector store as a spatial file system. Just as a file system organizes bytes into a hierarchy optimized for path-based lookups, a vector store organizes floating-point arrays into a structure optimized for proximity-based lookups.

Both abstractions are inert -- they reflect the quality of what you put into them. Garbage embeddings in, garbage retrieval out. That was pretty simple, wasn't it?

Expert Note: When debugging poor retrieval quality, start with the embedding model, not the vector store. Nine times out of ten, the issue is upstream -- bad chunking, wrong model, or mismatched fine-tuning -- not the index itself.

Technical Foundations

The Math (Don't Worry, We'll Build Up to It)

Let's formalize what a vector store actually does. I'll explain the intuition first, then give you the notation.

Intuition: You have a big collection of vectors. Someone hands you a new query vector. You need to find the $k$ vectors in your collection that are closest to the query. And you need to do it fast.

Formally: A vector store implements an index over a collection $V = \{v_1, v_2, \ldots, v_n\}$ where each $v_i \in \mathbb{R}^d$ . Given a query vector $q \in \mathbb{R}^d$ and a positive integer $k$ , the store returns a set $S \subset V$ with $|S| = k$ that approximates the true $k$ -nearest neighbors under a distance function $d(\cdot, \cdot)$ .

Measuring Approximation Quality

How do we know if our "approximate" answer is any good? We use recall@k: the fraction of true top- $k$ neighbors present in the returned set $S$ .

A recall of 0.95 at $k = 10$ means that, on average, 9.5 of the 10 returned items are among the actual 10 closest vectors. That's usually good enough for production.

Distance Functions

Vector stores commonly support three distance functions:

Cosine similarity: $\text{sim}(q, v) = \frac{q \cdot v}{\|q\| \cdot \|v\|}$
L2 (Euclidean) distance: $d(q, v) = \|q - v\|_2$
Inner product (dot product): $d(q, v) = q \cdot v$

Which Metric Should You Use?

This is critical, and I've seen teams get it wrong more often than you'd expect. The choice of metric must match the training objective of the embedding model:

Models trained with contrastive losses on normalized vectors? Use cosine similarity.
Maximum inner product search (MIPS) problems where vector magnitudes carry information? Use inner product.
Models explicitly trained with L2 distance? Use L2.

Warning: Using the wrong distance metric is a silent killer. Your store will happily return results -- they'll just be the wrong results. No error, no warning, just degraded quality.

Internal Architecture

A production vector store typically consists of four subsystems: an ingestion layer that accepts vectors and metadata, an indexing engine that builds and maintains ANN data structures, a query engine that executes search and filtering, and a storage backend that handles persistence and replication. Let's walk through each one.

Vector Store in Machine Learning: Architecture, Use Cases & Tradeoffs Architecture — A directed flow from 'Embedding Model' -> 'Ingestion API' -> 'ANN Index' <-> 'Storage Backend', w...

Key Components

Ingestion Layer

Accepts vectors (and optional payload/metadata) via API or bulk import, validates dimensions, and routes to the indexing engine.

ANN Index Engine

Builds and maintains the core index structure (HNSW graph, IVF clusters, or product quantization codebooks). Handles incremental inserts and deletes.

Query Engine

Receives a query vector and optional metadata filters, traverses the index, and returns top-k results with scores. Manages concurrency and query planning.

Storage Backend

Persists vectors and index structures to disk. Handles write-ahead logging, snapshots, and replication for durability and high availability.

Metadata Filter Engine

Applies pre-filter or post-filter predicates on scalar metadata fields (e.g., category, timestamp) alongside vector similarity search.

Data Flow

Here's how data flows through the system:

Write Path: Embedding vectors arrive via the ingestion API -> validated and assigned internal IDs -> inserted into the ANN index (and WAL) -> persisted to storage.

Read Path: A query vector enters the query engine -> the ANN index is traversed (with optional metadata pre-filtering) -> top-k candidates are scored -> results are returned with similarity scores and payloads.

The write path and read path are largely independent, which is why most vector stores can handle concurrent reads and writes without heavy locking.

A directed flow from 'Embedding Model' -> 'Ingestion API' -> 'ANN Index' <-> 'Storage Backend', with a parallel 'Query API' -> 'ANN Index' -> 'Top-K Results' path.

How to Implement

Two Approaches to Implementation

Implementation patterns fall into two broad categories:

Option A: Embedded index library (FAISS, HNSWlib, ScaNN) -- you run the index directly in your application process. Great for prototyping and when you control the infra.

Option B: Managed vector database (Pinecone, Weaviate, Qdrant, Milvus, pgvector) -- a separate service handles everything. Better for production when you need replication, sharding, metadata filtering, or multi-tenancy.

The right choice depends on scale, operational maturity, and whether you need features beyond basic ANN search. For a startup in Bengaluru building their first RAG prototype, Chroma or pgvector might be perfect. For a platform serving millions of queries per day, you'll likely want Qdrant, Milvus, or Pinecone.

Cost Note: Pinecone's managed service starts around $70/month (~INR 5,900/month) for a basic pod. Self-hosted Qdrant or Milvus on a cloud VM can be cheaper but requires ops investment. FAISS is free but you own the infrastructure entirely.

FAISS — Build an IVF+PQ index and query17 lines

import faiss
import numpy as np

d = 768           # embedding dimension
nlist = 100       # number of IVF clusters
m = 64            # PQ sub-quantizers

# Train the index on a sample
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
index.train(training_vectors)  # shape (n_train, d)
index.add(corpus_vectors)       # shape (n_corpus, d)

# Search
index.nprobe = 10  # clusters to visit at query time
D, I = index.search(query_vectors, k=10)
# D: distances, I: indices of nearest neighbors

FAISS (Facebook AI Similarity Search) provides composable index types. IVF clusters the space for coarse quantization, then PQ compresses residual vectors for memory efficiency. The nprobe parameter controls the recall-throughput tradeoff: higher values improve recall at the cost of latency.

Qdrant — Upsert and query with metadata filters26 lines

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue

client = QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Upsert vectors with metadata
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(id=1, vector=embedding, payload={"source": "arxiv", "year": 2024}),
    ],
)

# Query with metadata filter
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    query_filter=Filter(must=[FieldCondition(key="source", match=MatchValue(value="arxiv"))]),
    limit=10,
)

Qdrant is an open-source vector database written in Rust. This example demonstrates the common pattern of combining vector similarity search with scalar metadata filtering — essential for multi-tenant RAG systems where retrieved documents must belong to the querying user's corpus.

Configuration Example12 lines

# Qdrant collection config (YAML equivalent)
collection_name: documents
vectors:
  size: 768
  distance: Cosine
hnsw_config:
  m: 16
  ef_construct: 128
optimizers_config:
  indexing_threshold: 20000
  memmap_threshold: 50000
replication_factor: 2

Common Implementation Mistakes

●
Choosing cosine similarity when the embedding model was trained with dot-product objectives (or vice versa) -- mismatched metrics silently degrade recall. I cannot stress this enough: always check the model card before configuring your distance metric.
●
Skipping index parameter tuning (e.g., HNSW ef_construction, IVF nprobe) and accepting library defaults that may not suit your recall/latency requirements. The defaults are designed for generality, not your specific workload.
●
Storing raw text alongside vectors in the vector store instead of keeping a lightweight ID reference and resolving content from a separate document store -- this bloats memory and slows index operations. Your vector store should be lean.
●
Failing to implement versioned re-indexing when the embedding model changes -- old and new embeddings are not comparable in the same vector space. This one can silently wreck your entire retrieval pipeline.
●
Ignoring metadata filtering performance: post-filtering can return fewer than $k$ results when the filter is highly selective; pre-filtering requires index support. We'll cover this more in the failure modes section.

When Should You Use This?

Use When

Your application requires sub-100ms retrieval over >100K embedding vectors
You are building a RAG system that needs to ground LLM responses in domain-specific documents
Semantic similarity (rather than keyword matching) is the primary retrieval signal
You need metadata-filtered vector search (e.g., per-tenant or per-category retrieval, like serving different restaurant catalogs per city in a Zomato-like app)
Your embedding dimensions exceed what full-text search or traditional inverted indices can handle efficiently

Avoid When

Your corpus is small (<10K items) and brute-force exact search meets latency requirements -- the added complexity of an ANN index is unnecessary. Seriously, just use a flat index or even NumPy.
Keyword or lexical matching is sufficient for your use case (consider BM25 or Elasticsearch instead)
You need strong transactional guarantees (ACID) that most vector databases do not yet provide
Your retrieval is based on structured attributes (category, price range, date) rather than semantic similarity -- a relational database with standard indices will be simpler and faster

Key Tradeoffs

The Fundamental Tradeoff: Recall vs. Throughput

Higher recall (more accurate nearest neighbors) requires visiting more index nodes, which increases latency. For most production systems, a recall@10 of 0.95-0.99 is the sweet spot. Below 0.90, retrieval quality degrades noticeably in downstream tasks like RAG -- your LLM starts missing critical context.

The Second Axis: Memory vs. Accuracy

Compressed indices (PQ, scalar quantization) use 4-16x less memory but reduce recall compared to flat or HNSW-only indices. For example, a 100M-vector corpus at 768 dimensions takes ~300 GB as a flat index. With PQ (8-byte codes), that drops to ~800 MB -- a 375x reduction. BUT your recall might drop from 0.99 to 0.92.

The right balance depends on your budget and quality requirements. For a cost-sensitive deployment serving a Flipkart product catalog, IVF+PQ might save you $2,000/month (~INR 1.68 lakh/month) in memory costs compared to a full HNSW index.

Rule of Thumb: Start with HNSW (best recall), measure your memory footprint, and only move to compressed indices if the cost is prohibitive.

Alternatives & Comparisons

Semantic Search (Full Pipeline)

Semantic search is a higher-level abstraction that includes the embedding model, vector store, and optional re-ranking. A vector store is one component within a semantic search pipeline. If you need end-to-end retrieval rather than just the storage layer, you may want a managed semantic search service. Think of it this way: a vector store is the engine, semantic search is the whole car.

Hybrid Search

Hybrid search combines dense vector retrieval with sparse lexical retrieval (BM25). When keyword precision matters alongside semantic recall -- for example, matching specific product codes, medical terms, or Indian pin codes -- hybrid search outperforms pure vector search alone. Most production systems at scale end up adopting hybrid search eventually.

Pros, Cons & Tradeoffs

Advantages

Sub-linear query time over millions to billions of vectors, enabling real-time similarity search at scale. We're talking 5-10ms queries over 100M vectors -- 1000x faster than brute force.
Native support for the geometric operations that dense embeddings require -- no impedance mismatch with SQL or document stores. The data structure matches the workload.
Metadata filtering allows multi-tenant, temporally scoped, or attribute-constrained retrieval in a single query. Essential for SaaS applications where each customer's data must be isolated.
Mature ecosystem of open-source options (FAISS, Milvus, Qdrant, Weaviate, Chroma) and managed services (Pinecone, Zilliz). You have real choices, not vendor lock-in.
Index types (HNSW, IVF, PQ) can be composed to optimize for different memory-recall-latency profiles. You can literally mix and match like building blocks.

Disadvantages

Approximate retrieval means some relevant results will be missed -- no guarantee of exact nearest neighbors. You're always operating with a recall < 1.0.
Index construction can be expensive: building an HNSW graph over 100M vectors requires significant CPU time (hours) and peak memory that can be 2-3x the final index size.
Most vector databases lack mature transaction support, schema evolution, and joins that relational databases provide. Don't try to use them as your primary database.
Embedding model upgrades require full re-indexing -- old and new embeddings are incompatible in the same space. For 1B vectors, this can take days and cost hundreds of dollars in compute.
Operational complexity increases with scale: sharding, replication, and compaction introduce infrastructure overhead that requires dedicated SRE attention.

Implement blue-green re-indexing: build a new collection with updated embeddings, validate recall on an evaluation set, swap the alias once validation passes, then delete the old collection. Never do in-place model swaps.

Placement in an ML System

Where Does It Sit in the Pipeline?

In a RAG pipeline, the vector store sits after the embedding model has converted chunks into vectors and before the re-ranker or context assembler that formats retrieved passages for the LLM.

For recommendation systems (think Spotify's Discover Weekly or Flipkart's "Similar Products"), it sits after the user/item encoder and before the scoring/ranking layer.

The vector store is the primary bottleneck for retrieval latency and directly determines the quality ceiling of downstream generation or ranking. If the vector store misses a relevant document, no amount of re-ranking or prompt engineering can recover it.

Key Insight: The vector store is the gatekeeper of your retrieval pipeline. Everything downstream can only work with what it returns. This is why recall@k matters so much.

Document Loader Text Chunker Embedding Model Vector Store Semantic Search Hybrid Search Re-Ranker Context Assembler

Pipeline Stage

Retrieval / Serving

Upstream

Embedding Model
Document Loader
Text Chunker

Downstream

Re-Ranker
Context Assembler
LLM (Generator)

Scaling Bottlenecks

Where It Gets Tight

Primary bottlenecks are memory (index size grows linearly with corpus size and embedding dimension) and write throughput during bulk ingestion.

At query time, HNSW scales well but becomes memory-bound. IVF+PQ trades recall for memory efficiency. Sharding across nodes introduces network latency for distributed search -- typically adding 2-5ms per cross-shard hop.

Some concrete numbers: a single Qdrant node can handle ~5,000 QPS for 1M vectors with HNSW. Scaling to 100M vectors usually requires 4-8 shards, bringing per-query latency from ~5ms to ~15-20ms.

Production Case Studies

SpotifyMusic Streaming

Spotify uses approximate nearest neighbor search over user and track embeddings to power its recommendation engine. Embeddings are trained via collaborative filtering and content-based models, then indexed for real-time retrieval of candidate tracks during personalized playlist generation. With 600M+ users and 100M+ tracks, brute-force search was never an option.

Outcome:

ANN-based retrieval reduced candidate generation latency from seconds (exhaustive search) to single-digit milliseconds -- roughly a 1000x speedup -- enabling real-time personalization at scale for 600M+ users.

NotionProductivity Software

Notion built its AI Q&A feature using a RAG architecture with a vector store backing semantic search over user workspace content. Documents are chunked, embedded, and indexed per-workspace. At query time, the user's question is embedded and used to retrieve relevant chunks, which are then fed to an LLM for answer generation. The per-workspace isolation is a textbook example of metadata-filtered multi-tenant retrieval.

Outcome:

The vector store enables sub-200ms retrieval across millions of documents per workspace, with metadata filtering ensuring retrieval is scoped to the user's accessible content. This powers Notion AI's ability to answer questions about your own notes and docs.

FlipkartE-commerce (India)

Flipkart employs vector similarity search for visual product recommendations and duplicate product detection. Product images are encoded into embeddings via a fine-tuned vision transformer, stored in a vector index, and queried to surface visually similar items or flag catalog duplicates. With millions of product listings and thousands of new additions daily, maintaining index freshness is a key engineering challenge.

Outcome:

Reduced duplicate product listings by approximately 30% in high-volume categories, improving catalog quality and reducing customer confusion. This directly impacts GMV by reducing return rates on misidentified products.

Tooling & Ecosystem

FAISS

C++ / PythonOpen Source

Meta's library for efficient similarity search. Supports GPU acceleration, composable index types (IVF, PQ, HNSW, flat). Best for embedding in application code without a separate database service.

Milvus

Go / C++Open Source

Cloud-native vector database with distributed architecture, multiple index types, hybrid search, and multi-tenancy. Backed by Zilliz. Best for large-scale production deployments.

Qdrant

RustOpen Source

High-performance vector database written in Rust. Supports payload-based filtering, quantization, and distributed deployment. Strong filtering capabilities for multi-tenant RAG.

Pinecone

Commercial

Fully managed vector database service. Zero operational overhead — handles indexing, sharding, and replication. Best for teams that want to avoid infrastructure management.

Weaviate

GoOpen Source

Open-source vector database with built-in vectorization modules, GraphQL API, and hybrid (vector + BM25) search. Good for applications that want integrated embedding generation.

pgvector

COpen Source

PostgreSQL extension for vector similarity search. Ideal when you want vector search alongside relational data in a single database without introducing a separate system.

Chroma

Python / RustOpen Source

Lightweight, developer-friendly embedding database. Low setup friction for prototyping and small-to-medium workloads. Wraps HNSWlib under the hood.

Research & References

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

Malkov & Yashunin (2018)IEEE TPAMI, Vol. 42, No. 4

Introduced the HNSW algorithm — a multi-layer proximity graph that achieves logarithmic search complexity. Now the dominant index type in most vector databases.

Product Quantization for Nearest Neighbor Search

Jegou, Douze & Schmid (2011)IEEE TPAMI, Vol. 33, No. 1

Decomposed high-dimensional vectors into sub-vector quantization, enabling compact codes and efficient distance estimation that scales to billions of vectors.

Billion-Scale Similarity Search with GPUs

Johnson, Douze & Jegou (2021)IEEE Transactions on Big Data

Described GPU-optimized k-selection and brute-force/IVF search implementations forming the algorithmic basis of the FAISS library.

The Faiss Library

Douze, Guzhva, Deng, Johnson, Szilvasy, Mazare, Lomeli, Hosseini & Jegou (2024)arXiv preprint

Comprehensive systems reference for FAISS covering all index types, composability, and benchmarks.

Milvus: A Purpose-Built Vector Data Management System

Wang, Yi, Guo, Li, Wang, Liu, Zhao, Li & Xu (2021)ACM SIGMOD 2021

Presented Milvus as a cloud-native vector database with heterogeneous computing support and hybrid scalar+vector queries.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, Perez, Piktus et al. (2020)NeurIPS 2020

Established the RAG paradigm — combining a dense passage retriever backed by a FAISS index with a seq2seq generator — that drives most production LLM+vector-store architectures.

Survey of Vector Database Management Systems

Pan, Wang & Li (2024)The VLDB Journal

Comprehensive survey of 20+ vector DBMSs analyzing indexing, storage, and query processing techniques across commercial and open-source systems.

Accelerating Large-Scale Inference with Anisotropic Vector Quantization

Guo, Sun, Lindgren, Geng, Simcha, Chern & Kumar (2020)ICML 2020

Introduced anisotropic quantization that penalizes the parallel component of residual error, achieving state-of-the-art recall-vs-speed tradeoffs for MIPS; underlies Google's ScaNN library.

Interview & Evaluation Perspective

Common Interview Questions

●
How would you design a vector store for a RAG system serving 10M documents?
●
What is the difference between HNSW and IVF indexing? When would you choose one over the other?
●
How do you handle embedding model upgrades when you have billions of indexed vectors?
●
Explain the recall-latency tradeoff in approximate nearest neighbor search.
●
How would you implement metadata filtering alongside vector search?

Key Points to Mention

●
Recall@k is the primary quality metric -- always quantify the recall you need before choosing an index type. Lead with numbers, not vibes.
●
HNSW provides the best recall-throughput tradeoff for most workloads but is memory-intensive (expect ~1.3x the raw vector size); IVF+PQ trades recall for memory efficiency (as low as 0.01x the raw size).
●
Distance metric must match the embedding model's training objective (cosine, dot product, or L2) -- this is a hard requirement, not a preference.
●
Blue-green re-indexing is the production pattern for embedding model upgrades. Never do in-place model swaps.
●
Pre-filtering vs. post-filtering has major implications for result quality and latency. Pre-filtering is almost always better but requires index support.

Pitfalls to Avoid

●
Claiming vector stores provide exact nearest neighbors -- they are approximate by design. The clue is in the name: Approximate Nearest Neighbor.
●
Ignoring the memory footprint: a naive flat index of 100M x 768-dim vectors requires ~300 GB of RAM. That's a $2,400/month (~INR 2 lakh/month) cloud instance just for the index.
●
Conflating the vector store with the entire retrieval pipeline -- it is one component, not the whole system. Always clarify the boundary.
●
Assuming one index type fits all workloads without discussing scale, latency, and recall requirements. HNSW is great, but it's not always the answer.

Senior-Level Expectation

A senior candidate should be able to discuss the full lifecycle: embedding model selection, chunking strategy, index type selection with quantitative justification (not just "I'd use HNSW"), metadata schema design, re-indexing strategy (blue-green with validation gates), monitoring (recall regression, latency P99, QPS headroom), and capacity planning (memory per vector, storage growth rate, cost projections). The ability to reason about cost-performance tradeoffs -- especially under budget constraints common in Indian startups -- separates senior engineers from mid-level ones.

Summary

Let's recap what we've covered:

A vector store is a purpose-built system for indexing, persisting, and querying high-dimensional embedding vectors using approximate nearest neighbor algorithms. It's the retrieval backbone of modern ML systems.
The core tradeoff is recall vs. throughput: every ANN index sacrifices exact accuracy for sub-linear query time. Tuning this balance -- not just once, but continuously -- is the primary operational concern.
HNSW dominates for its recall-throughput profile (5-10ms queries at 0.95+ recall). IVF+PQ is preferred when memory is constrained (4-16x compression). Both can be composed in systems like FAISS.
Always match the distance metric to the embedding model's training objective -- cosine, dot product, and L2 are not interchangeable. Getting this wrong is the #1 silent failure mode.
Vector stores own geometry, not semantics: retrieval quality is bounded by the upstream embedding model's representation quality. If your results are bad, look upstream first.

A vector store is the bridge between learned representations and real-time retrieval. It enables ML systems to leverage dense embeddings at production scale by providing the indexing infrastructure that brute-force search simply cannot. Moving on to the next block in the pipeline, remember: the vector store determines the quality ceiling for everything downstream.

Concept Snapshot

Why This Concept Exists

The Problem with Regular Databases

Two Trends That Made This Urgent

The Algorithmic Foundation

Core Intuition & Mental Model

The Core Promise

What a Vector Store Does NOT Do

A Useful Mental Model

Technical Foundations

The Math (Don't Worry, We'll Build Up to It)

Measuring Approximation Quality

Distance Functions

Which Metric Should You Use?

Internal Architecture

Key Components

Data Flow

How to Implement

Two Approaches to Implementation

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

The Fundamental Tradeoff: Recall vs. Throughput

The Second Axis: Memory vs. Accuracy

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Silent recall degradation

Metric mismatch

Metadata filter starvation

Memory exhaustion on index load

Stale index after embedding model update

Placement in an ML System

Where Does It Sit in the Pipeline?

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading