What is model serving in simple terms?

Model serving is the process of making a trained ML model available to applications so they can get predictions on demand. Think of it like this: training a model is like writing a recipe. Model serving is opening a restaurant that cooks that recipe for thousands of customers simultaneously, quickly, reliably, and at a cost that keeps the lights on. In practice, this means loading the model into GPU memory, wrapping it in an API endpoint (REST or gRPC), handling concurrent requests efficiently (batching, queuing), scaling up and down based on traffic, and monitoring everything to ensure predictions are fast and correct. The simplest form of model serving is a Flask app with `model.predict()`. The most sophisticated form involves dedicated inference servers (Triton, vLLM), GPU clusters with auto-scaling, continuous batching for LLMs, and canary deployment strategies. Where you sit on this spectrum depends on your scale, latency requirements, and team expertise.

How do I choose between vLLM, Triton, and TorchServe?

The choice depends primarily on what kind of models you're serving: **vLLM**: Choose this if you're serving **large language models** (7B+ parameters) and need maximum throughput with continuous batching and PagedAttention. It's purpose-built for autoregressive text generation. If you're building a chatbot, code assistant, or any LLM-powered application, vLLM is the default choice in 2025-2026. **NVIDIA Triton**: Choose this if you're serving **multiple models across different frameworks** (PyTorch, TensorFlow, ONNX, XGBoost) from a single infrastructure. Triton excels at multi-model serving, model ensembles, and GPU co-location. If your team maintains 10+ models of varying types, Triton is the industry standard. **TorchServe**: Choose this if you're a **PyTorch-only shop** that wants a simpler, well-supported solution with model versioning and built-in batching. TorchServe has lower operational complexity than Triton but only supports PyTorch models. Many organizations use **vLLM for LLMs and Triton for everything else**. That's a perfectly reasonable architecture.

How much does model serving cost in production?

Costs vary enormously based on model size, traffic volume, and hardware choice. Here are some concrete benchmarks: **Small models (BERT, XGBoost, <1GB)**: An NVIDIA T4 GPU (~INR 22,000-28,000/month or $270-$340/month on cloud) can serve these at 1,000-5,000 QPS. CPU serving on a $50/month VM is also viable for <100 QPS. **Medium models (7-8B LLMs, INT4 quantized)**: A single A100-40GB (~INR 60,000-80,000/month or $720-$960/month) can serve at ~50-80 concurrent requests. With AWQ quantization, an 8B model fits on a T4 for ~INR 25,000/month ($300/month). **Large models (70B+ LLMs)**: Requires 2-4 A100-80GB GPUs in tensor parallel mode. Cost: ~INR 2.5-7 lakh/month ($3,000-$8,400/month). INT4 quantization can halve this. **Managed platforms**: SageMaker adds ~30-50% premium over raw GPU cost but eliminates operational overhead. An ml.g5.xlarge endpoint runs ~INR 86,500/month ($1,030/month). The most impactful cost optimizations are: (1) quantization (2x-4x memory/cost reduction), (2) batch inference for pre-computable predictions (10-100x cheaper), and (3) auto-scaling to avoid paying for idle GPUs.

What is dynamic batching and why does it matter?

Dynamic batching is the technique of accumulating individual inference requests that arrive at different times into a single batch before sending them to the GPU together. It matters because **GPUs are massively parallel processors** -- running inference on a batch of 32 inputs is only marginally slower than running it on a single input, but you've served 32x the requests. Without batching, if your model takes 5ms for a single inference and you're getting 1,000 requests/second, you need 5 seconds of GPU time per second -- impossible. With batch size 32, each batch of 32 takes ~8ms, so you can process 4,000 requests/second on a single GPU. The trade-off is **latency**: the batcher must wait for requests to accumulate, adding a configurable delay (typically 5-50ms). This delay is the price you pay for throughput. For an API with a 200ms latency budget, adding 20ms of batching delay is barely noticeable but can 10x your throughput. For LLMs, **continuous batching** (from the Orca paper) is even more important: it operates at the granularity of individual decode iterations rather than waiting for all sequences in a batch to complete, preventing short sequences from being blocked by long ones.

What is the difference between real-time and batch model serving?

**Real-time (online) serving** processes individual requests synchronously with strict latency requirements. The application sends a request, blocks, and expects a response within milliseconds to seconds. Examples: fraud detection at Razorpay (must decide before the transaction completes), search ranking at Flipkart (must return before the user loses patience), or a chatbot generating responses. **Batch (offline) serving** processes large datasets asynchronously with no latency requirement. A scheduled job runs predictions over millions of records, stores the results, and applications read from the stored results later. Examples: nightly recommendation pre-computation, weekly customer churn scoring, daily content moderation of uploaded images. The key differences: | Aspect | Real-time | Batch | |--------|-----------|-------| | Latency | <100ms to <5s | Minutes to hours | | Cost per prediction | Higher (dedicated GPU) | Lower (spot instances) | | Infrastructure | Always-on endpoints | Ephemeral compute | | Features | Must be fresh | Can be pre-computed | Most production systems use **both**: pre-compute what you can in batch (cheap), personalize at request time with a real-time model (expensive but necessary). This hybrid pattern can reduce serving costs by 50-80%.

How do I auto-scale GPU serving endpoints?

Auto-scaling GPU serving endpoints is different from scaling stateless web services because of two factors: **model loading time** (30-120 seconds) and **GPU-specific metrics** that standard CPU/memory-based scaling doesn't capture. The standard approach on Kubernetes is the **Horizontal Pod Autoscaler (HPA)** with custom GPU metrics exposed via the Prometheus adapter: 1. **Metric selection**: Scale on `queue_depth` (number of waiting requests) or `gpu_utilization`, not CPU or memory. For vLLM, the `vllm_num_requests_waiting` metric is ideal. 2. **Asymmetric scaling**: Scale up aggressively (0-second stabilization, add up to 4 pods per minute) to handle traffic spikes quickly. Scale down conservatively (10-minute stabilization window, remove 1 pod at a time) to avoid thrashing. 3. **Minimum replicas**: Never scale to zero for latency-critical services -- the cold start penalty (30-120 seconds) is unacceptable. Maintain a minimum that handles normal traffic. 4. **Model weight caching**: Use ReadWriteMany (RWX) persistent volumes so new pods share pre-downloaded model weights instead of each downloading them individually (saves 5-10 minutes per pod). 5. **Predictive scaling**: For workloads with known patterns (e.g., Swiggy's lunch and dinner peaks), configure scheduled scaling rules in addition to reactive auto-scaling.

How does quantization affect model serving quality and performance?

Quantization reduces the numerical precision of model weights and activations, trading a small amount of quality for significant performance gains. Here's the landscape: **FP32 -> FP16 (half-precision)**: 2x memory reduction, 1.5-2x throughput improvement, typically INT8**: Additional 2x memory reduction (4x total from FP32), 1.5-2x additional throughput. Post-training quantization (PTQ) achieves INT4 (GPTQ, AWQ)**: 4x memory reduction from FP16. For LLMs, this is the sweet spot for cost-sensitive deployments: a 7B model drops from 14 GB to 3.5 GB, fitting on a single T4 (16 GB). Quality degradation is typically 2-5% depending on the task and quantization method. **The practical impact**: An INT4-quantized Llama 3.1 8B model on a T4 GPU (~INR 25,000/month) can match the throughput of an FP16 model on an A100 (~INR 70,000/month). That's a 2.8x cost reduction. For a startup serving a chatbot, this difference can determine whether the product is economically viable. > Always benchmark quantized models on your specific evaluation set before deploying. Aggregate metrics like perplexity can mask task-specific degradation.

What is PagedAttention and why is it important for LLM serving?

**PagedAttention** is a memory management technique introduced in the vLLM paper (SOSP 2023) that manages the KV cache -- the intermediate attention states that LLMs accumulate during autoregressive generation -- using ideas borrowed from operating system virtual memory management. The problem it solves: in naive LLM serving, the KV cache for each request is allocated as a single contiguous block of GPU memory. Since the final output length is unknown at the start, systems must pre-allocate for the maximum possible length. This causes **internal fragmentation** (wasted space within allocated blocks) and **external fragmentation** (gaps between blocks). Together, these waste approximately 60-80% of GPU memory. PagedAttention divides the KV cache into fixed-size **pages** (typically 16 tokens each) that can be stored anywhere in GPU memory -- they don't need to be contiguous. Pages are allocated on demand as the sequence generates more tokens and freed immediately when the sequence completes. This is exactly analogous to how OS virtual memory maps logical pages to physical frames. The result: near-zero memory fragmentation, which means you can fit **2-4x more concurrent sequences** in the same GPU memory. For a production LLM endpoint, this directly translates to 2-4x higher throughput at the same hardware cost. At scale, this is worth millions of dollars annually in GPU savings.

Deployment

Model Serving in Machine Learning

Q: What is dynamic batching and why does it matter?

Dynamic batching is the technique of accumulating individual inference requests that arrive at different times into a single batch before sending them to the GPU together. It matters because **GPUs are massively parallel processors** -- running inference on a batch of 32 inputs is only marginally slower than running it on a single input, but you've served 32x the requests. Without batching, if your model takes 5ms for a single inference and you're getting 1,000 requests/second, you need 5 seconds of GPU time per second -- impossible. With batch size 32, each batch of 32 takes ~8ms, so you can process 4,000 requests/second on a single GPU. The trade-off is **latency**: the batcher must wait for requests to accumulate, adding a configurable delay (typically 5-50ms). This delay is the price you pay for throughput. For an API with a 200ms latency budget, adding 20ms of batching delay is barely noticeable but can 10x your throughput. For LLMs, **continuous batching** (from the Orca paper) is even more important: it operates at the granularity of individual decode iterations rather than waiting for all sequences in a batch to complete, preventing short sequences from being blocked by long ones.

Q: What is the difference between real-time and batch model serving?

**Real-time (online) serving** processes individual requests synchronously with strict latency requirements. The application sends a request, blocks, and expects a response within milliseconds to seconds. Examples: fraud detection at Razorpay (must decide before the transaction completes), search ranking at Flipkart (must return before the user loses patience), or a chatbot generating responses. **Batch (offline) serving** processes large datasets asynchronously with no latency requirement. A scheduled job runs predictions over millions of records, stores the results, and applications read from the stored results later. Examples: nightly recommendation pre-computation, weekly customer churn scoring, daily content moderation of uploaded images. The key differences: | Aspect | Real-time | Batch | |--------|-----------|-------| | Latency | <100ms to <5s | Minutes to hours | | Cost per prediction | Higher (dedicated GPU) | Lower (spot instances) | | Infrastructure | Always-on endpoints | Ephemeral compute | | Features | Must be fresh | Can be pre-computed | Most production systems use **both**: pre-compute what you can in batch (cheap), personalize at request time with a real-time model (expensive but necessary). This hybrid pattern can reduce serving costs by 50-80%.

Q: How do I auto-scale GPU serving endpoints?

Auto-scaling GPU serving endpoints is different from scaling stateless web services because of two factors: **model loading time** (30-120 seconds) and **GPU-specific metrics** that standard CPU/memory-based scaling doesn't capture. The standard approach on Kubernetes is the **Horizontal Pod Autoscaler (HPA)** with custom GPU metrics exposed via the Prometheus adapter: 1. **Metric selection**: Scale on `queue_depth` (number of waiting requests) or `gpu_utilization`, not CPU or memory. For vLLM, the `vllm_num_requests_waiting` metric is ideal. 2. **Asymmetric scaling**: Scale up aggressively (0-second stabilization, add up to 4 pods per minute) to handle traffic spikes quickly. Scale down conservatively (10-minute stabilization window, remove 1 pod at a time) to avoid thrashing. 3. **Minimum replicas**: Never scale to zero for latency-critical services -- the cold start penalty (30-120 seconds) is unacceptable. Maintain a minimum that handles normal traffic. 4. **Model weight caching**: Use ReadWriteMany (RWX) persistent volumes so new pods share pre-downloaded model weights instead of each downloading them individually (saves 5-10 minutes per pod). 5. **Predictive scaling**: For workloads with known patterns (e.g., Swiggy's lunch and dinner peaks), configure scheduled scaling rules in addition to reactive auto-scaling.

Q: How does quantization affect model serving quality and performance?

Quantization reduces the numerical precision of model weights and activations, trading a small amount of quality for significant performance gains. Here's the landscape: **FP32 -> FP16 (half-precision)**: 2x memory reduction, 1.5-2x throughput improvement, typically INT8**: Additional 2x memory reduction (4x total from FP32), 1.5-2x additional throughput. Post-training quantization (PTQ) achieves INT4 (GPTQ, AWQ)**: 4x memory reduction from FP16. For LLMs, this is the sweet spot for cost-sensitive deployments: a 7B model drops from 14 GB to 3.5 GB, fitting on a single T4 (16 GB). Quality degradation is typically 2-5% depending on the task and quantization method. **The practical impact**: An INT4-quantized Llama 3.1 8B model on a T4 GPU (~INR 25,000/month) can match the throughput of an FP16 model on an A100 (~INR 70,000/month). That's a 2.8x cost reduction. For a startup serving a chatbot, this difference can determine whether the product is economically viable. > Always benchmark quantized models on your specific evaluation set before deploying. Aggregate metrics like perplexity can mask task-specific degradation.

Model serving is the discipline of taking a trained machine learning model and making it available to applications at production scale -- reliably, with low latency, and at a cost that doesn't bankrupt your organization.

It sounds deceptively simple. You have a model, you wrap it in an API, done. But the moment real traffic hits, everything changes. You need to think about batching, GPU memory management, model versioning, auto-scaling, canary rollouts, and the cold reality that a 200ms P99 latency target with a 7B-parameter model on a single GPU is an engineering problem, not a deployment checklist.

Model serving sits at the very end of the ML pipeline, but it's where all the upstream work -- data collection, feature engineering, training, evaluation -- either delivers value or doesn't. A model that can't serve predictions fast enough, or that costs too much to run, might as well not exist. From Flipkart's real-time product recommendations to Swiggy's order-assignment optimization running 5,000+ predictions per second, every user-facing ML application depends on a well-engineered serving layer.

This guide covers the full landscape: from classical serving frameworks like TensorFlow Serving and TorchServe, through GPU-optimized engines like NVIDIA Triton and vLLM, to managed platforms like AWS SageMaker and Google Vertex AI. We'll dive into model optimization techniques (quantization, distillation, pruning), auto-scaling strategies, batch vs. real-time trade-offs, and the failure modes that will bite you at 3 AM.

Concept Snapshot

What It Is: The infrastructure and systems layer responsible for loading trained ML models into memory, accepting inference requests, executing predictions, and returning results at production-grade latency, throughput, and reliability.
Category: Deployment
Complexity: Advanced
Inputs / Outputs: Inputs: trained model artifacts (weights, configs, tokenizers) + inference requests (features, text, images). Outputs: predictions (scores, classifications, generated text) with latency guarantees.
System Placement: Sits after model training and model registry (upstream) and before metrics collection, logging, and downstream application logic (downstream) in the ML pipeline.
Also Known As: inference serving, prediction serving, model deployment, inference server, model endpoint, serving infrastructure
Typical Users: ML Engineers, MLOps Engineers, Platform Engineers, SREs, Backend Engineers
Prerequisites: Model training basics, REST/gRPC APIs, Docker and containerization, Basic GPU concepts, Load balancing fundamentals
Key Terms: inference latencythroughput (QPS)batch inferencedynamic batchingmodel warm-upKV cachequantizationPagedAttentioncontinuous batchingmodel artifact

Why This Concept Exists

The Gap Between Training and Value

Here's a stat that should make every data scientist uncomfortable: according to Gartner, over 85% of ML projects fail to reach production. The single biggest reason? The serving layer. Teams invest months training a model that achieves 0.92 AUC on an evaluation set, then discover that putting it behind an API that handles 10,000 requests per second with sub-100ms latency is an entirely different engineering discipline.

Training is a batch, offline process. You can afford to wait hours. Serving is an online, latency-sensitive process where every millisecond matters to user experience and revenue. A recommendation model at Flipkart that takes 500ms to respond might as well not exist -- the user has already scrolled past.

Why Can't We Just Use Flask?

This is genuinely one of the most common questions from teams deploying their first model. And the answer is: you can, for a prototype. But Flask (or FastAPI, or any generic web framework) doesn't give you:

Dynamic batching: Accumulating individual requests into batches that exploit GPU parallelism. A batch of 32 inferences on a GPU is only marginally slower than a batch of 1, but you've served 32x the requests.
Model versioning and hot-swapping: Loading a new model version without downtime while the old version is still serving traffic.
GPU memory management: Efficiently sharing GPU memory across concurrent requests, especially for LLMs where the KV cache can consume gigabytes per request.
Health checking and graceful degradation: Automatically routing traffic away from unhealthy replicas.
Multi-model serving: Running dozens of models on the same GPU, each with different resource requirements.

Dedicated serving frameworks solve all of these problems. That's why they exist.

The LLM Inflection Point

The explosion of large language models in 2023-2025 fundamentally changed model serving. Traditional serving assumed models were relatively small (a few hundred MB), inference was fast (single-digit milliseconds), and CPU was often sufficient. LLMs shattered all three assumptions:

A 7B-parameter model in FP16 needs ~14 GB of GPU memory just for weights.
Autoregressive generation is inherently sequential -- each token depends on the previous one.
The KV cache grows linearly with sequence length, creating dynamic memory pressure that traditional serving frameworks were never designed to handle.

This is why purpose-built LLM serving engines like vLLM (with PagedAttention) and NVIDIA Triton (with TensorRT-LLM backend) emerged. They solve memory management problems that generic serving frameworks simply don't address.

Key Takeaway: Model serving exists because the gap between "model works in a notebook" and "model serves millions of users reliably" requires specialized infrastructure that generic web frameworks can't provide. The LLM era has only widened this gap.

Core Intuition & Mental Model

The Restaurant Kitchen Analogy

Think of model serving like running a high-volume restaurant kitchen. The trained model is your recipe. The serving infrastructure is everything else: the kitchen equipment, the line cooks, the expeditor who coordinates orders, and the system that decides how to batch similar orders together.

A single customer walks in and orders a pizza? Easy -- one cook, one oven, no coordination needed. That's your Flask prototype. But when 500 customers order simultaneously, you need an expeditor (the serving framework) who batches similar orders (dynamic batching), assigns them to available ovens (GPU scheduling), monitors cook health (health checks), and ensures no customer waits too long (latency SLOs).

The model itself is just the recipe. The serving layer is what makes the restaurant actually work.

The Two Fundamental Modes

Every serving system ultimately operates in one of two modes, and understanding this distinction is crucial:

Real-time (online) serving: The application sends a request and blocks until the response arrives. Latency matters enormously -- typically P99 < 100ms for traditional ML models, P99 < 2-5 seconds for LLM generation. This is what powers recommendation widgets, fraud detection, and chatbots.

Batch (offline) serving: Predictions are computed in bulk over a dataset, results stored for later consumption. Latency doesn't matter much -- you optimize for throughput and cost. This powers daily email recommendations, pre-computed search rankings, and periodic risk scoring.

Many production systems use both. Flipkart might pre-compute category-level recommendations in batch (cheap, done overnight) and then personalize them with a real-time model at request time (expensive, but only for the top candidates).

The Cost-Latency-Quality Triangle

Every serving decision involves three competing objectives:

Lower latency requires more (or better) hardware, which costs more.
Lower cost means fewer GPUs, which forces you to optimize models (quantization, distillation) at some quality loss.
Higher quality means larger models, which need more memory and compute, increasing both latency and cost.

You cannot optimize all three simultaneously. The art of model serving is finding the point on this triangle that matches your business requirements. A Zerodha fraud detection model might sacrifice some quality for ultra-low latency. A PhonePe recommendation model might accept higher latency for better quality. An IRCTC batch scoring job might sacrifice latency entirely for minimal cost.

Technical Foundations

Formalizing Serving Performance

Let's put some math behind the intuition. A model serving system $S$ takes inference requests $r \in R$ and produces predictions $\hat{y}$ subject to performance constraints.

Latency: For a single request, the end-to-end latency is:

$L_{e2e} = L_{network} + L_{queue} + L_{preprocess} + L_{inference} + L_{postprocess}$

where $L_{inference}$ typically dominates for GPU models. For autoregressive LLMs, inference latency for generating $n$ tokens is:

$L_{LLM} = L_{prefill}(|x|) + n \cdot L_{decode}$

where $L_{prefill}$ is the time to process the input prompt of length $|x|$ (compute-bound, parallelizable) and $L_{decode}$ is the per-token generation time (memory-bandwidth-bound, sequential).

Throughput: The maximum sustainable queries per second (QPS) for a serving instance with batch size $B$ is:

$\text{QPS}_{max} = \frac{B}{L_{inference}(B)}$

Note that $L_{inference}(B)$ grows sub-linearly with $B$ on GPUs (due to parallelism), so larger batches improve throughput -- up to the point where GPU memory is exhausted.

Dynamic Batching Efficiency: Given incoming requests with arrival rate $\lambda$ and a batching window $w$ , the expected batch size is:

$E[B] = \lambda \cdot w$

The batching window $w$ introduces additional latency but improves GPU utilization. The trade-off is:

$L_{total} = w + L_{inference}(E[B])$

GPU Memory Budget: For a model with $P$ parameters at precision $b$ bits, the minimum memory required is:

$M_{model} = P \times \frac{b}{8} \text{ bytes}$

For LLMs, the KV cache adds:

$M_{KV} = 2 \times n_{layers} \times d_{model} \times (|x| + |y|) \times \frac{b}{8} \times B$

where the factor 2 accounts for both keys and values, and $B$ is the number of concurrent sequences. This is often the dominant memory consumer -- a Llama 2 70B model with batch size 32 and 2K sequence length requires ~40 GB just for KV cache.

Quantization Compression Ratio: Quantizing from precision $b_1$ to $b_2$ yields:

$\text{Compression Ratio} = \frac{b_1}{b_2}$

For example, FP16 (16 bits) to INT4 (4 bits) gives a 4x compression. The quality degradation depends on the quantization method (PTQ vs. QAT) and can be measured as:

$\Delta_{quality} = \frac{|\text{metric}_{fp16} - \text{metric}_{quantized}|}{\text{metric}_{fp16}}$

Typically, well-calibrated INT8 quantization achieves $\Delta_{quality} < 0.01$ (less than 1% degradation), while INT4 may see $\Delta_{quality} \approx 0.02-0.05$ depending on the model and task.

Practical Note: In interviews and design discussions, being able to quickly estimate GPU memory requirements and throughput bounds using these formulas demonstrates a strong command of the serving space. For example: "A 7B model in FP16 needs 14 GB for weights, plus ~8 GB KV cache for batch size 16 at 2K context, so it fits on a single A100-80GB with room to spare."

Internal Architecture

A production model serving system consists of multiple layers, each handling a distinct concern. At the outermost layer, a load balancer distributes incoming requests across serving replicas. Each replica runs an inference server (Triton, vLLM, TorchServe, etc.) that manages model loading, request batching, GPU scheduling, and response formatting. Underneath, a model store (often backed by a model registry like MLflow or S3) provides versioned model artifacts. An auto-scaler monitors metrics like GPU utilization, queue depth, and latency P99 to dynamically adjust the number of replicas.

For LLM serving specifically, the inference server includes additional components: a KV cache manager that handles memory allocation for autoregressive generation, a continuous batching scheduler that dynamically adds and removes sequences from the running batch, and optionally a speculative decoding module that uses a smaller draft model to accelerate generation.

Model Serving in ML Systems Architecture — A directed architecture diagram showing: Client Application connects to Load Balancer, which fans...

The data flow for a real-time request looks like this: the client sends a request to the load balancer, which routes it to an available replica. The replica's dynamic batcher accumulates the request with others (waiting up to a configurable window, typically 5-50ms). Once a batch is formed, it's sent to the GPU runtime for inference. For LLMs, the KV cache manager allocates memory for the new sequence and the continuous batcher interleaves prefill and decode operations. The result is returned to the client, and metrics (latency, GPU utilization, queue depth) are emitted to the metrics collector, which feeds the auto-scaler.

Key Components

Load Balancer

Distributes incoming inference requests across serving replicas. Can be a simple round-robin (NGINX, Envoy) or an intelligent router that considers GPU utilization, queue depth, or even KV cache locality (vLLM Router). For LLM workloads, prefix-aware routing can yield 3-10x latency improvements by directing requests to replicas that already have relevant KV cache entries.

Inference Server

The core process that loads model artifacts, accepts requests, executes inference on hardware accelerators, and returns predictions. Examples: NVIDIA Triton, vLLM, TorchServe, TensorFlow Serving, BentoML. Handles model warm-up, health checks, and graceful shutdown.

Dynamic Batcher

Accumulates individual requests into batches to exploit GPU parallelism. Configurable parameters include maximum batch size, maximum wait time (batching window), and preferred batch sizes. For LLMs, continuous batching (iteration-level scheduling from the Orca paper) replaces traditional static batching -- sequences are added and removed at the granularity of each decode step rather than waiting for the entire batch to finish.

GPU Runtime / Backend

Executes the actual model computation on GPU (or CPU/TPU). May use optimized runtimes like TensorRT, ONNX Runtime, or custom CUDA kernels. Manages GPU memory allocation, kernel launch, and multi-GPU distribution (tensor parallelism, pipeline parallelism).

KV Cache Manager

Specific to LLM serving. Manages the key-value cache that stores intermediate attention states during autoregressive generation. vLLM's PagedAttention allocates KV cache in non-contiguous blocks (like OS virtual memory pages), reducing fragmentation from ~60% to near-zero and enabling 2-4x higher throughput.

Model Store / Registry

Provides versioned model artifacts (weights, configs, tokenizers) to inference servers. Typically backed by S3, GCS, Azure Blob Storage, or a model registry like MLflow. Supports model versioning, A/B testing, and rollback.

Auto-Scaler

Monitors serving metrics and adjusts replica count. Kubernetes HPA (Horizontal Pod Autoscaler) with custom GPU metrics (utilization, queue depth) is the standard approach. For GPU workloads, scale-up should be aggressive (0s stabilization) and scale-down conservative (10min stabilization) to avoid unnecessary pod churn and cold-start overhead.

Metrics Collector

Captures serving telemetry: latency (P50, P95, P99), throughput (QPS), GPU utilization, memory usage, queue depth, error rates, and model-specific metrics (tokens per second for LLMs). Feeds dashboards (Grafana) and alerting systems.

Data Flow

Real-time path: Client -> Load Balancer -> Inference Server -> Dynamic Batcher (accumulates for up to N ms) -> GPU Runtime -> Post-processing -> Response back to client. Total latency budget: typically 50-500ms for traditional ML, 1-10s for LLM generation.

Batch path: Scheduler triggers batch job -> reads input data from storage -> inference server processes large batches (thousands of items) -> writes predictions to output storage (database, S3, feature store). Optimized for throughput, not latency.

Model update path: New model version registered in Model Store -> Inference Server polls for updates (or receives webhook) -> loads new model alongside old version -> health check passes -> traffic gradually shifted (canary) -> old model unloaded.

Scaling path: Metrics Collector reports high GPU utilization / queue depth -> Auto-Scaler increases replica count -> new pods pull model from Model Store -> warm-up inference -> Load Balancer adds new replicas to rotation.

A directed architecture diagram showing: Client Application connects to Load Balancer, which fans out to N Inference Server Replicas. Each replica contains a GPU Runtime, KV Cache Manager, and Dynamic Batcher. A Model Registry feeds model artifacts to all replicas. An Auto-Scaler reads from a Metrics Collector (which receives data from all replicas) and adjusts the Load Balancer configuration.

How to Implement

Choosing Your Serving Stack

The serving landscape in 2025-2026 breaks into three tiers:

Tier 1: Purpose-built LLM engines -- vLLM, TensorRT-LLM (via Triton), and SGLang. These are optimized specifically for autoregressive generation with features like PagedAttention, continuous batching, and speculative decoding. If you're serving an LLM, start here.

Tier 2: General-purpose inference servers -- NVIDIA Triton, TorchServe, TensorFlow Serving, BentoML. These handle any model type (vision, NLP, tabular, ensembles) with framework-agnostic backends, dynamic batching, and model management. Triton is the industry standard for multi-framework GPU serving.

Tier 3: Managed platforms -- AWS SageMaker, Google Vertex AI, Azure ML. These abstract away infrastructure entirely. You upload a model, configure an endpoint, and the platform handles scaling, monitoring, and hardware. Higher cost, lower operational burden.

For a startup in Bengaluru building a chatbot, vLLM on a single GPU might be perfect -- zero infrastructure overhead, excellent throughput, and you can run it for ~INR 70,000/month ($830/month) on a cloud A100. For an enterprise like Flipkart running 60+ ML models across multiple frameworks, Triton behind a Kubernetes cluster with custom auto-scaling is the way to go.

Model Optimization Before Serving

Before you even think about the serving framework, optimize the model itself. The three pillars of model optimization are:

Quantization: Reduce numerical precision (FP32 -> FP16 -> INT8 -> INT4). INT8 post-training quantization (PTQ) typically preserves >99% accuracy while halving memory and doubling throughput. For LLMs, GPTQ and AWQ methods enable INT4 quantization with minimal quality loss.
Knowledge Distillation: Train a smaller "student" model to mimic a larger "teacher" model. Can achieve 5-50x size reduction while retaining 90-95% of the teacher's performance on the target task. This is becoming the most important technique in production AI -- companies like OpenAI, Anthropic, and Cohere all ship distilled models.
Pruning: Remove redundant weights or attention heads. Structured pruning can reduce model size by 2-10x. Most effective when combined with fine-tuning after pruning.

Cost Insight: On an NVIDIA T4 GPU (INR 22,000-28,000/month or ~ $270-$ 340/month on cloud), an INT8-quantized BERT model can serve ~3,000 QPS at P99 < 10ms. The same model in FP32 on a T4 serves ~800 QPS. That's a 3.75x throughput improvement for free -- no quality loss, no hardware upgrade.

vLLM -- Serve an LLM with OpenAI-compatible API33 lines

from vllm import LLM, SamplingParams
from vllm.entrypoints.openai.api_server import app
import uvicorn

# Option 1: Programmatic usage
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,      # number of GPUs
    max_model_len=4096,
    gpu_memory_utilization=0.90, # reserve 10% for overhead
    quantization="awq",          # use AWQ INT4 quantization
    enforce_eager=False,         # enable CUDA graphs for speed
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

prompts = ["Explain model serving in one paragraph."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

# Option 2: Launch OpenAI-compatible server (CLI)
# python -m vllm.entrypoints.openai.api_server \
#   --model meta-llama/Llama-3.1-8B-Instruct \
#   --tensor-parallel-size 1 \
#   --max-model-len 4096 \
#   --quantization awq \
#   --port 8000

vLLM is the dominant open-source LLM serving engine, powering inference at Meta, Stripe, IBM, and many others. The LLM class handles model loading, PagedAttention-based KV cache management, and continuous batching automatically. The --quantization awq flag enables INT4 quantization, reducing memory from ~16 GB to ~4 GB for an 8B model, allowing it to fit on a single T4 (16 GB). The OpenAI-compatible API server means you can swap in vLLM behind any application already using the OpenAI SDK.

NVIDIA Triton -- Multi-model serving with dynamic batching66 lines

# model_repository/
# └── bert_classifier/
#     ├── config.pbtxt
#     └── 1/
#         └── model.onnx

# config.pbtxt for the BERT classifier model
"""
name: "bert_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [512]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [512]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [3]
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 50000
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]
"""

# Client code to query Triton
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

input_ids = grpcclient.InferInput("input_ids", [1, 512], "INT64")
attention_mask = grpcclient.InferInput("attention_mask", [1, 512], "INT64")

input_ids.set_data_from_numpy(np.ones([1, 512], dtype=np.int64))
attention_mask.set_data_from_numpy(np.ones([1, 512], dtype=np.int64))

result = client.infer(
    model_name="bert_classifier",
    inputs=[input_ids, attention_mask],
)

logits = result.as_numpy("logits")
print(f"Predicted class: {np.argmax(logits)}")

NVIDIA Triton is the Swiss Army knife of model serving -- it supports TensorRT, PyTorch, TensorFlow, ONNX, and Python backends, all managed from a single server process. The config.pbtxt file defines the model's interface and serving behavior. dynamic_batching with max_queue_delay_microseconds: 50000 means Triton will wait up to 50ms to accumulate requests into a batch of up to 64, dramatically improving GPU utilization. The instance_group with count: 2 runs two model instances on GPU 0, enabling concurrent execution. Uber switched their entire Michelangelo platform to Triton for serving deep learning models.

TorchServe -- Package and serve a PyTorch model60 lines

# Step 1: Create a handler (handler.py)
import torch
from ts.torch_handler.base_handler import BaseHandler
import json

class SentimentHandler(BaseHandler):
    def preprocess(self, data):
        """Tokenize incoming text."""
        texts = []
        for row in data:
            input_text = row.get("data") or row.get("body")
            if isinstance(input_text, (bytes, bytearray)):
                input_text = input_text.decode("utf-8")
            texts.append(input_text)
        
        inputs = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors="pt",
        )
        return inputs.to(self.device)

    def inference(self, inputs):
        """Run forward pass."""
        with torch.no_grad():
            outputs = self.model(**inputs)
        return outputs.logits

    def postprocess(self, outputs):
        """Convert logits to predictions."""
        probs = torch.softmax(outputs, dim=-1)
        predictions = []
        for prob in probs:
            pred_class = torch.argmax(prob).item()
            confidence = prob[pred_class].item()
            predictions.append({
                "class": pred_class,
                "confidence": round(confidence, 4)
            })
        return predictions

# Step 2: Package the model
# torch-model-archiver --model-name sentiment \
#   --version 1.0 \
#   --serialized-file model.pt \
#   --handler handler.py \
#   --extra-files "tokenizer/" \
#   --export-path model_store/

# Step 3: Start TorchServe
# torchserve --start --model-store model_store \
#   --models sentiment=sentiment.mar \
#   --ts-config config.properties

# Step 4: Query
# curl -X POST http://localhost:8080/predictions/sentiment \
#   -H "Content-Type: application/json" \
#   -d '{"data": "This product is amazing!"}'

TorchServe is developed jointly by AWS and Meta (PyTorch team). It uses the .mar (Model Archive) format to bundle model weights, handler code, and dependencies into a single deployable artifact. The handler pattern (preprocess -> inference -> postprocess) gives you full control over the serving pipeline. TorchServe supports multi-model serving, request batching, model versioning, and integrates with Kubernetes via KServe. Its main limitation compared to Triton is that it only supports PyTorch models.

BentoML -- Build and deploy a multi-model inference graph42 lines

import bentoml
import numpy as np
from bentoml.io import JSON, NumpyNdarray

# Save a trained model to BentoML's model store
import sklearn.ensemble
model = sklearn.ensemble.RandomForestClassifier()
# ... train model ...
bentoml.sklearn.save_model("fraud_detector", model)

# Define the serving service
@bentoml.service(
    resources={"cpu": "2", "memory": "4Gi"},
    traffic={"timeout": 30, "max_concurrency": 50},
)
class FraudDetectionService:
    fraud_model = bentoml.models.get("fraud_detector:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.fraud_model)

    @bentoml.api
    def predict(self, features: np.ndarray) -> dict:
        """Predict fraud probability for a transaction."""
        probabilities = self.model.predict_proba(features)
        predictions = self.model.predict(features)
        return {
            "prediction": int(predictions[0]),
            "fraud_probability": float(probabilities[0][1]),
            "threshold": 0.5,
        }

    @bentoml.api
    def health(self) -> dict:
        return {"status": "healthy", "model_version": str(self.fraud_model.tag)}

# Build and containerize
# bentoml build
# bentoml containerize fraud_detection_service:latest

# Run locally
# bentoml serve fraud_detection_service:latest

BentoML stands out for its developer experience -- it bridges the gap between a Python script and a production container with minimal boilerplate. The @bentoml.service decorator configures resources and traffic handling. The bentoml build command packages everything into a Bento (a standardized OCI-compatible artifact), and bentoml containerize generates a production Docker image. BentoML supports adaptive batching, multi-model inference graphs, GPU scheduling, and scale-to-zero on its managed platform (BentoCloud). Particularly popular with teams that want the simplicity of Flask with production-grade features.

SageMaker -- Deploy a model endpoint with auto-scaling58 lines

import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Deploy a HuggingFace model to a SageMaker endpoint
hf_model = HuggingFaceModel(
    model_data="s3://my-bucket/model/model.tar.gz",
    role=role,
    transformers_version="4.37",
    pytorch_version="2.1",
    py_version="py310",
    env={
        "HF_MODEL_ID": "distilbert-base-uncased-finetuned-sst-2-english",
        "HF_TASK": "text-classification",
    },
)

predictor = hf_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",  # ~$1.41/hr (~INR 118/hr)
    endpoint_name="sentiment-endpoint",
)

# Configure auto-scaling
client = boto3.client("application-autoscaling")

client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/sentiment-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1,
    MaxCapacity=10,
)

client.put_scaling_policy(
    PolicyName="gpu-utilization-scaling",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/sentiment-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70.0,
        "CustomizedMetricSpecification": {
            "MetricName": "GPUUtilization",
            "Namespace": "/aws/sagemaker/Endpoints",
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,
        "ScaleOutCooldown": 60,
    },
)

# Test the endpoint
result = predictor.predict({"inputs": "This product is excellent!"})
print(result)

AWS SageMaker abstracts away the entire serving infrastructure -- no Kubernetes, no Docker builds, no load balancer configuration. You pay per second of endpoint uptime. The auto-scaling policy targets 70% GPU utilization with aggressive scale-out (60s cooldown) and conservative scale-in (600s cooldown). An ml.g5.xlarge instance costs ~ $1.41/hour (~INR 118/hour), which works out to ~$ 1,030/month (~INR 86,500/month) for a single always-on endpoint. SageMaker also offers serverless inference (pay only during inference, scale to zero) and asynchronous inference for long-running predictions.

Configuration Example69 lines

# Kubernetes deployment for vLLM with HPA auto-scaling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-serving
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model=meta-llama/Llama-3.1-8B-Instruct
          - --tensor-parallel-size=1
          - --max-model-len=4096
          - --gpu-memory-utilization=0.90
          - --quantization=awq
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 24Gi
        ports:
          - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-serving
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_waiting
      target:
        type: AverageValue
        averageValue: 5
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Common Implementation Mistakes

●
No model warm-up: Loading a model and immediately serving traffic causes the first N requests to suffer 5-50x higher latency (GPU kernel compilation, memory allocation, JIT compilation). Always run warm-up inferences before accepting production traffic. Triton and TorchServe support this natively via configuration.
●
Static batching for LLMs: Using traditional static batching (wait for all sequences in a batch to complete) wastes GPU cycles when sequences have different lengths. A 10-token response and a 500-token response in the same batch means the GPU sits idle for 490 tokens on the short sequence. Use continuous batching (iteration-level scheduling) instead.
●
Ignoring the prefill-decode asymmetry: LLM prefill is compute-bound and fast; decode is memory-bandwidth-bound and slow. Treating them identically leads to suboptimal scheduling. Modern systems like Splitwise and Distserv disaggregate prefill and decode onto different hardware configurations.
●
Over-provisioning GPU memory: Setting gpu_memory_utilization to 1.0 in vLLM leaves no headroom for transient allocations, causing OOM crashes under load. Always reserve 10-15% (use 0.85-0.90). This one will hit you in production at 2 AM.
●
Skipping quantization: Serving a 7B model in FP32 (28 GB) when INT8 (7 GB) gives essentially the same quality is throwing money away. Always benchmark quantized versions against your evaluation set before deploying the full-precision model.
●
Missing health checks and readiness probes: Kubernetes will route traffic to pods that are still loading the model (which can take 30-120 seconds for large models) unless you configure proper readiness probes. This causes a wave of timeouts after every scaling event.
●
Single model instance per GPU: Running one model instance on a GPU that's only 30% utilized wastes 70% of your investment. Use Triton's instance_group or run multiple vLLM workers to co-locate models on the same GPU when they have complementary resource profiles.

When Should You Use This?

Use When

Your ML model needs to serve predictions in real-time (sub-second latency) to end users -- recommendation engines, chatbots, fraud detection, search ranking
You're deploying LLMs and need efficient GPU memory management with continuous batching and PagedAttention
Multiple models with different frameworks (PyTorch, TensorFlow, ONNX) need to be served from the same infrastructure
You need auto-scaling that responds to GPU-specific metrics (utilization, queue depth, KV cache occupancy) rather than just CPU/memory
Model versioning and A/B testing are requirements -- you need to run multiple model versions simultaneously with traffic splitting
Your serving workload requires dynamic batching to maximize GPU utilization under variable load
You need canary deployments to safely roll out model updates without risking full traffic

Avoid When

Your model runs infrequently (< 100 predictions/day) -- a simple serverless function (Lambda, Cloud Functions) is cheaper and simpler. Don't bring a Triton server to a Lambda fight.
Predictions can be pre-computed in batch and served from a cache or database -- batch inference with a cron job is 10-100x cheaper than maintaining always-on GPU endpoints
Your model is tiny (< 100 MB) and CPU inference meets latency requirements -- a FastAPI service with ONNX Runtime on a $20/month VM is perfectly adequate
You're in the prototyping phase and need to iterate quickly -- the overhead of setting up Triton or KServe will slow you down. Start with BentoML or a simple Flask wrapper.
Your team has no Kubernetes experience and no DevOps support -- managed platforms (SageMaker, Vertex AI) are a better fit despite higher per-unit cost
The model is embedded in a mobile or edge device -- model serving is a server-side pattern; for edge, use TensorFlow Lite, Core ML, or ONNX Runtime Mobile

Key Tradeoffs

Self-hosted vs. Managed: The Central Decision

The biggest architectural decision is whether to self-host (Triton, vLLM on Kubernetes) or use a managed platform (SageMaker, Vertex AI).

Factor	Self-hosted	Managed (SageMaker/Vertex)
Cost at scale	2-5x cheaper	Premium pricing
Operational burden	High (SRE needed)	Low (platform handles)
Customization	Full control	Limited to platform APIs
Cold start	You manage warm pools	Platform manages (variable)
GPU efficiency	Can co-locate models	One model per endpoint
Setup time	Days to weeks	Hours

For a Bengaluru startup with 2 ML engineers, SageMaker at ~INR 86,500/month ($1,030/month) per endpoint is probably the right call -- the engineering time saved is worth more than the cost premium. For Flipkart running 60+ models at scale, self-hosted Triton on Kubernetes is 3-5x more cost-effective.

Batch vs. Real-time: Knowing When to Pre-compute

The most cost-effective serving strategy is often a hybrid: pre-compute what you can in batch, personalize in real-time.

Batch: Daily recommendations, periodic risk scores, embedding pre-computation. Cost: ~$0.02-0.05 per 1000 inferences on spot instances.
Real-time: Click-time ranking, fraud detection, conversational AI. Cost: ~$0.10-1.00 per 1000 inferences on dedicated GPU endpoints.

That's a 20-50x cost difference. If 80% of your predictions can be pre-computed, you've just cut your serving bill by 80%.

Quantization: The Free Lunch (Almost)

INT8 quantization is the closest thing to a free lunch in ML serving. For most models:

Memory: 2x reduction (FP16 -> INT8)
Throughput: 1.5-2x improvement
Quality: <1% degradation on most benchmarks

INT4 is more aggressive: 4x memory reduction but 2-5% quality loss. Worth it for cost-sensitive deployments, not for quality-critical applications.

Rule of Thumb: Always try INT8 first. If quality is acceptable, deploy it. If not, try FP16. Only use FP32 if you have a specific numerical stability requirement.

Alternatives & Comparisons

Serverless Inference (Lambda / Cloud Functions)

Serverless inference scales to zero and charges only per invocation, making it ideal for sporadic, low-volume workloads (< 100 QPS). However, cold starts (5-30 seconds for large models), limited memory (10 GB max on Lambda), and no GPU support make it unsuitable for latency-sensitive or GPU-dependent workloads. Choose serverless for lightweight CPU models with bursty traffic; choose dedicated model serving for anything that needs GPUs or consistent latency.

Batch Inference (Spark / MapReduce)

Batch inference processes large datasets offline with no latency requirement, using frameworks like Spark or simple job schedulers. It's 10-100x cheaper than real-time serving for high-volume, non-interactive workloads. Choose batch inference for pre-computing recommendations, periodic scoring, or data enrichment pipelines. Choose real-time serving when the prediction must be generated at request time using fresh features.

Edge Inference (TensorFlow Lite / Core ML / ONNX Mobile)

Edge inference runs models directly on user devices (mobile, IoT, browsers), eliminating network latency and server costs entirely. However, models must be aggressively optimized (quantized, pruned) to fit device constraints, and updates require app deployments. Choose edge for privacy-sensitive or ultra-low-latency applications (face detection, keyboard prediction); choose server-side serving when you need large models, centralized updates, or cross-device consistency.

Feature Store with Pre-computed Predictions

Store pre-computed predictions in a feature store (Redis, DynamoDB) and serve them as simple key-value lookups with sub-1ms latency. This is the cheapest and fastest serving pattern but only works when predictions can be computed ahead of time and the feature space is enumerable. Choose this for user-level recommendations or segment-level scores; choose model serving when predictions depend on real-time context that can't be pre-computed.

Pros, Cons & Tradeoffs

Advantages

GPU utilization optimization: Dynamic batching, continuous batching, and multi-model co-location can push GPU utilization from 10-20% (naive serving) to 70-90%, extracting maximum value from expensive hardware. At $3-4/hour per A100, this is the difference between a viable and a bankrupt serving strategy.
Model-agnostic deployment: Frameworks like Triton and BentoML serve PyTorch, TensorFlow, ONNX, XGBoost, and custom Python models from a single infrastructure, eliminating the need for per-framework deployment pipelines.
Automatic scaling: Integration with Kubernetes HPA or managed platform auto-scaling means your serving layer can handle 10x traffic spikes (Diwali sale on Flipkart, IPL finale on Hotstar) without manual intervention.
Model versioning and canary deployment: Serving frameworks enable A/B testing, shadow mode, and gradual rollout of new model versions, reducing the blast radius of a bad model update from 100% to 1-5% of traffic.
Standardized observability: Built-in Prometheus metrics, OpenTelemetry integration, and health check endpoints provide the monitoring foundation that SRE teams need to maintain reliability SLOs.
LLM-specific optimizations: PagedAttention, speculative decoding, and prefix caching in engines like vLLM deliver 2-4x throughput improvements over naive serving, making LLM deployment economically feasible.

Disadvantages

Infrastructure complexity: Running Triton on Kubernetes with auto-scaling, model versioning, and monitoring requires significant platform engineering expertise. A misconfigured readiness probe can bring down your entire serving cluster.
Cold start latency: Loading a large model into GPU memory takes 30-120 seconds. Scaling from 0 to 1 or from 2 to 3 replicas introduces a window where requests queue or timeout. ReadWriteMany persistent volumes help but add storage complexity.
GPU cost: Always-on GPU endpoints are expensive. A single A100-80GB instance runs ~INR 2.5-3.5 lakh/month ( $3,000-$ 4,200/month). Without careful capacity planning, serving costs can exceed training costs by 10x.
Vendor lock-in with managed platforms: SageMaker endpoints are tightly coupled to the AWS ecosystem. Migrating to GCP or self-hosted requires rewriting deployment pipelines, scaling configs, and monitoring.
Configuration tuning overhead: Optimal batch sizes, batching windows, instance counts, and model parallelism strategies vary by model, hardware, and traffic pattern. There's no one-size-fits-all configuration -- you must benchmark for your specific workload.
Model update coordination: Updating a serving model requires coordinating model registry, serving infrastructure, and monitoring. A model that passes offline evaluation but degrades online metrics (data drift, feature skew) needs rapid rollback capability.

Implement load shedding: reject requests with HTTP 429 when queue depth exceeds a threshold (e.g., 2x normal). Configure client-side retry with exponential backoff and jitter (not immediate retries). Set max_queue_size on the inference server. Use a rate limiter upstream of the serving layer. Monitor queue depth as a leading indicator and alert before the spiral starts.

Placement in an ML System

The Final Mile of the ML Pipeline

Model serving is the final and most visible stage of the ML pipeline. Upstream, the model registry provides versioned model artifacts after training and evaluation. The feature store provides real-time features that the model needs at inference time. Together, these form the inputs to the serving layer.

Downstream, the serving layer feeds into metrics collection (latency, throughput, error rates), which in turn drives the monitoring and alerting system. The load balancer sits in front of the serving layer, distributing traffic across replicas. The rate limiter protects the serving layer from abuse or unexpected traffic spikes. Canary deployment mechanisms control how traffic is shifted between model versions.

The serving layer is also the primary cost center in most production ML systems. While training is a one-time (or periodic) expense, serving is continuous. For companies like Uber (10 million predictions per second at peak) or Swiggy (5,000+ predictions per second per model), the serving infrastructure cost can dwarf training costs by 10-100x over the model's lifetime.

Key Insight: Optimizing serving cost (through quantization, batching, auto-scaling, and spot instances for batch workloads) often has a bigger impact on the total cost of ownership than optimizing training efficiency.

Pipeline Stage

Serving / Deployment

Upstream

model-registry
model-training
feature-store

Downstream

metrics-collector
load-balancer
rate-limiter
canary-deploy

Scaling Bottlenecks

Where It Gets Expensive

The primary bottleneck is GPU memory, not GPU compute. Modern GPUs (A100, H100) have enormous compute capacity but relatively limited memory (40-80 GB). A 70B-parameter LLM in FP16 needs 140 GB of GPU memory just for weights -- already requiring at least two A100-80GB GPUs before you even account for KV cache and activations.

The second bottleneck is GPU memory bandwidth during LLM decode. Autoregressive generation is memory-bandwidth-bound: each token requires loading the entire model weights from GPU HBM. An A100 has 2 TB/s memory bandwidth, which limits decode throughput to ~100-200 tokens/second for a 70B model regardless of batch size.

The third bottleneck is cold start time. Loading a 14 GB model from S3 to GPU memory takes 30-120 seconds depending on network bandwidth. At scale, this means auto-scaling events have a significant warm-up penalty that must be accounted for in capacity planning.

Some concrete numbers: a single A100-80GB can serve a Llama 3.1 8B model (AWQ INT4) at ~2,500 tokens/second with batch size 32, handling roughly 50-80 concurrent requests. Scaling to 1,000 concurrent users requires approximately 15-20 A100 GPUs, costing ~INR 37-70 lakh/month ( $45,000-$ 85,000/month) on cloud.

Production Case Studies

UberTransportation / Delivery

Uber's Michelangelo platform manages over 5,000 models in production, serving 10 million predictions per second at peak. They migrated from their custom Neuropod serving layer to NVIDIA Triton Inference Server for deep learning models, citing Triton's native support for TensorFlow and PyTorch backends. The platform supports both real-time serving (ETA prediction, surge pricing) and batch inference (driver-rider matching optimization). Over 20,000 model training jobs run monthly.

Outcome:

Triton adoption reduced serving latency by ~30% for deep learning models and simplified the deployment pipeline by eliminating the need for framework-specific serving code. The platform now supports the transition from predictive ML to generative AI workloads.

SwiggyFood Delivery (India)

Swiggy's Data Science Platform (DSP) serves ML predictions for order assignment, delivery time estimation, and search ranking. The order-assignment flow involves multiple ML models that must run in a batch for every (order, delivery executive) pair in a city, with city-level batch jobs triggering at configured intervals. A single API call can fan out to hundreds of prediction calls.

Outcome:

Achieved over 5,000 peak predictions per second for a single model with P99 latency of 71ms (batch size 30), a 2x latency improvement from the previous 144ms P99. This optimization directly improved order-assignment quality by allowing more frequent batch jobs.

StripeFintech

Stripe migrated their LLM inference stack to vLLM, processing 50 million daily API calls for fraud detection, document understanding, and merchant support automation. The migration involved replacing their previous HuggingFace-based serving setup with vLLM's PagedAttention and continuous batching.

Outcome:

Achieved a 73% inference cost reduction by running on 1/3 of the previous GPU fleet while maintaining the same throughput and latency targets. This translates to millions of dollars in annual savings at Stripe's scale.

NetflixEntertainment / Streaming

Netflix's ML Platform supports diverse serving patterns across recommendation, content understanding, and experimentation. Their architecture separates heavy computation (model training, batch feature engineering) from low-latency serving through microservices. Each serving environment has tunable "knobs" for model latency, data freshness, caching policies, and execution parallelism. They are currently transitioning from domain-specific serving to a unified, domain-agnostic serving platform.

Outcome:

The unified platform supports both real-time inference and batch processing for 200+ million members, with per-model latency tuning that balances freshness against serving cost.

FlipkartE-commerce (India)

Flipkart runs 60+ ML models on their data platform, serving real-time predictions for product search ranking, personalized recommendations, fraud detection, and image-based product matching. Their platform ingests 10-50 TB of raw data daily and supports near-real-time decision parameters with under 2 minutes latency. ML models are served through a combination of online endpoints (for real-time personalization) and batch inference (for catalog-level scoring).

Outcome:

ML-powered search ranking and recommendations contribute significantly to Flipkart's GMV, with the serving infrastructure handling peak loads during Big Billion Days sale events (10-50x normal traffic).

Tooling & Ecosystem

vLLM

Python / CUDAOpen Source

High-throughput LLM serving engine with PagedAttention for efficient KV cache management and continuous batching. Supports OpenAI-compatible API, tensor parallelism, quantization (GPTQ, AWQ, FP8), speculative decoding, and prefix caching. The dominant open-source LLM serving engine -- powers inference at Meta, Stripe, IBM, and Cohere. The vLLM Production Stack includes a Rust-based router for Kubernetes deployments with 3-10x latency improvements.

NVIDIA Triton Inference Server (Dynamo-Triton)

C++ / PythonOpen Source

Industry-standard multi-framework inference server supporting TensorRT, PyTorch, TensorFlow, ONNX, and Python backends. Features dynamic batching, model ensembles, concurrent model execution, and GPU/CPU scheduling. Recently rebranded to Dynamo-Triton. Used by Uber, LinkedIn, and American Express for production serving.

TorchServe

Java / PythonOpen Source

Official PyTorch model serving framework, developed by AWS and Meta. Supports model archiving (.mar format), multi-model serving, request batching, model versioning, and metrics. Integrates with KServe for Kubernetes deployments. Best for PyTorch-only shops that want a simple, well-supported serving solution.

TensorFlow Serving

C++Open Source

High-performance serving system for TensorFlow models using the SavedModel format. Features request batching with configurable latency controls, model versioning with automatic version management, and gRPC/REST APIs. Mature and battle-tested, but limited to TensorFlow models.

BentoML

PythonOpen Source

Developer-friendly framework for packaging and serving ML models. Packages code, models, and configs into Bentos (OCI-compatible artifacts). Features adaptive batching, multi-model inference graphs, GPU scheduling, and scale-to-zero on BentoCloud. Supports any Python ML framework. Best for teams that want FastAPI simplicity with production-grade features.

KServe

Go / PythonOpen Source

Kubernetes-native model serving platform, now a CNCF incubating project (joined September 2025). Provides a standardized serving interface with auto-scaling, canary rollouts, request logging, and explainability. Supports Triton, TorchServe, and custom containers as backends. The Kubernetes-native answer to SageMaker.

Seldon Core

Python / GoOpen Source

Open-source platform for deploying ML models on Kubernetes with advanced inference graphs, A/B testing, canary deployments, and outlier/drift detection. Supports pre-packaged servers for sklearn, XGBoost, TensorFlow, and custom containers. Strong focus on MLOps integration and model monitoring.

AWS SageMaker Inference

Python (SDK)Commercial

Fully managed model serving with real-time endpoints, serverless inference, batch transform, and asynchronous inference. Auto-scaling, model monitoring, and A/B testing built in. Pricing: ml.g5.xlarge at ~$1.41/hour (~INR 118/hour). Best for teams on AWS who want zero operational overhead. SageMaker Savings Plans can reduce costs by up to 64%.

NVIDIA TensorRT

C++ / PythonOpen Source

High-performance inference optimizer and runtime for NVIDIA GPUs. Converts models from PyTorch, TensorFlow, and ONNX into optimized engines with layer fusion, kernel auto-tuning, INT8/FP16 quantization, and dynamic tensor memory. Typically used as a backend within Triton, not standalone. Can deliver 2-6x speedup over native framework inference.

NVIDIA Model Optimizer (ModelOpt)

PythonOpen Source

Unified library for SOTA model optimization techniques including post-training quantization (PTQ), quantization-aware training (QAT), pruning, knowledge distillation, and speculative decoding. Compresses models for deployment on TensorRT-LLM, TensorRT, and vLLM. Supports FP8, INT8, INT4, and NVFP4 precision formats.

Research & References

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang & Stoica (2023)SOSP 2023

Introduced PagedAttention, which manages KV cache in non-contiguous memory blocks (like OS virtual memory pages), reducing memory waste from ~60% to near-zero and enabling 2-4x higher throughput. The foundation of vLLM.

Orca: A Distributed Serving System for Transformer-Based Generative Models

Yu, Jeong, Kim, Kim & Chun (2022)OSDI 2022

Proposed iteration-level scheduling (continuous batching) and selective batching for LLM serving, achieving 36.9x throughput improvement over NVIDIA FasterTransformer. These ideas are now standard in every production LLM serving engine.

Clipper: A Low-Latency Online Prediction Serving System

Crankshaw, Wang, Zhou, Franklin, Gonzalez & Stoica (2017)NSDI 2017

Pioneered the modular, framework-agnostic prediction serving architecture with caching, batching, and adaptive model selection. Introduced the concept of a prediction serving abstraction layer between applications and ML frameworks.

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Miao, Oliaro, Zhang, Cheng, Wang, Wong, Chen, Arfeen, Abhyankar & Jia (2024)arXiv preprint

Comprehensive survey covering LLM serving optimizations including speculative decoding, quantization, KV cache management, distributed serving, and scheduling algorithms. An excellent reference for understanding the full optimization landscape.

LLM Inference Serving: Survey of Recent Advances and Opportunities

Bai, Luo, Peng, Li & Zhang (2024)arXiv preprint

Survey of LLM inference serving systems published between January 2023 and June 2024, covering request scheduling, memory management, parallelism strategies, and emerging techniques like disaggregated serving.

Fast Distributed Inference Serving for Large Language Models

Wu, Zhong, Liu, Sun, Liu & Chen (2023)arXiv preprint

Introduced FastServe, a distributed LLM serving system with preemptive scheduling using a skip-join Multi-Level Feedback Queue, enabling request-level preemption to reduce head-of-line blocking and improve tail latency.

Fairness in Serving Large Language Models

Sheng, Cao, Li, Hooper, Lee, Yang, Chou, Zhu, Zheng, Keutzer, Gonzalez & Stoica (2024)arXiv preprint

Proposed the Virtual Token Counter (VTC), the first fair scheduling algorithm for continuous batching in LLM serving, ensuring equitable resource allocation across clients with different request sizes.

Interview & Evaluation Perspective

Common Interview Questions

●
How would you design a model serving system that handles 10,000 QPS with P99 latency < 100ms?
●
What is continuous batching, and why is it critical for LLM serving?
●
How does PagedAttention improve LLM serving throughput compared to static KV cache allocation?
●
Walk me through deploying a new model version to production with zero downtime.
●
How would you choose between self-hosted (Triton/vLLM) and managed (SageMaker) serving?
●
Explain the trade-offs between INT8 and INT4 quantization for production serving.
●
How would you auto-scale GPU-based serving endpoints on Kubernetes?
●
What failure modes should you monitor for in a production serving system?

Key Points to Mention

●
Continuous batching (iteration-level scheduling from Orca) is the single most important optimization for LLM serving -- it eliminates the idle time from static batching where short sequences wait for long ones.
●
PagedAttention (vLLM) manages KV cache in non-contiguous blocks, reducing memory fragmentation from ~60% to near-zero and enabling 2-4x throughput improvement.
●
Quantization is nearly free: INT8 typically preserves >99% quality while halving memory and doubling throughput. Always benchmark quantized models before dismissing them.
●
GPU memory is the bottleneck, not GPU compute. A 7B model in FP16 = 14 GB weights + KV cache. Quick math: $P \times b / 8$ bytes for model, plus KV cache that scales with batch size and sequence length.
●
Canary deployments with statistical significance testing are essential -- never promote a new model to 100% traffic based on offline metrics alone.
●
Batch vs. real-time is often a hybrid decision: pre-compute what you can in batch (10-100x cheaper), personalize in real-time only where necessary.
●
Auto-scaling on GPU metrics (utilization, queue depth), not CPU/memory. Scale-up aggressively, scale-down conservatively.

Pitfalls to Avoid

●
Saying you'd use Flask or FastAPI for production model serving without discussing batching, GPU management, health checks, or scaling. This signals lack of production experience.
●
Ignoring the cold start problem -- claiming you'd scale to zero and back without discussing the 30-120 second model loading time and how it affects user experience.
●
Claiming that quantization always degrades quality significantly. Well-calibrated INT8 is nearly lossless for most models. Show you understand the precision-quality spectrum.
●
Designing a serving system without discussing monitoring and rollback. Production systems fail, and your design must account for detecting and recovering from failures quickly.
●
Conflating training infrastructure with serving infrastructure. Training optimizes for throughput over large datasets; serving optimizes for latency on individual requests. The hardware, software, and cost models are fundamentally different.

Senior-Level Expectation

A senior/staff-level candidate should be able to design the full serving lifecycle: model optimization (quantization strategy with quality benchmarks), infrastructure selection (self-hosted vs. managed with cost analysis), deployment strategy (blue-green, canary with statistical significance), auto-scaling policy (custom GPU metrics, scale-up/down asymmetry), monitoring (P99 latency, throughput, GPU utilization, prediction distribution drift), and capacity planning (GPU memory budget, cost per 1000 inferences, monthly infrastructure cost in both USD and INR). They should discuss trade-offs in concrete terms: "A 70B model on 2x A100 costs ~INR 6 lakh/month but INT4 quantization lets us serve it on a single A100 for half that, with ~3% quality degradation on our eval set." The ability to reason about the cost-latency-quality triangle and make business-justified decisions separates staff-level from senior.

Summary

Model serving is the critical infrastructure layer that transforms trained ML models into production-grade prediction services. It is both the final mile of the ML pipeline and, paradoxically, the stage that most directly determines whether an ML investment generates business value.

The serving landscape has matured rapidly from 2023 to 2026. For LLMs, vLLM with PagedAttention and continuous batching has become the dominant open-source engine, delivering 2-4x throughput improvements over naive serving. For multi-framework deployments, NVIDIA Triton (now Dynamo-Triton) remains the industry standard, used by Uber, LinkedIn, and others to serve thousands of models at scale. Managed platforms like AWS SageMaker and Google Vertex AI trade cost-efficiency for operational simplicity -- the right choice for teams without dedicated platform engineering. The key technical decisions revolve around the cost-latency-quality triangle: quantization (INT8 for nearly-free 2x gains, INT4 for aggressive 4x compression with 2-5% quality trade-off), batching strategy (dynamic for traditional ML, continuous for LLMs), and scaling policy (GPU-metric-driven HPA with asymmetric scale-up/down).

Production serving demands rigorous operational practices: canary deployments with statistical significance testing, GPU memory monitoring as a leading indicator, load shedding to prevent request queuing spirals, and pre-warmed replicas to mitigate cold start cascades. The economics are stark -- a poorly optimized serving system can cost 10x more than a well-tuned one serving the same traffic. For Indian startups, the difference between an FP16 model on an A100 (~INR 70,000/month) and an INT4 quantized model on a T4 (~INR 25,000/month) can determine product viability. Mastering model serving is not optional for any team building production ML systems.

Concept Snapshot

Why This Concept Exists

The Gap Between Training and Value

Why Can't We Just Use Flask?

The LLM Inflection Point

Core Intuition & Mental Model

The Restaurant Kitchen Analogy

The Two Fundamental Modes

The Cost-Latency-Quality Triangle

Technical Foundations

Formalizing Serving Performance

Internal Architecture

Key Components

Data Flow

How to Implement

Choosing Your Serving Stack

Model Optimization Before Serving

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

Self-hosted vs. Managed: The Central Decision

Batch vs. Real-time: Knowing When to Pre-compute

Quantization: The Free Lunch (Almost)

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

GPU Out-of-Memory (OOM) under load

Cold start cascade on scale-up

Silent model quality degradation

Batch inference timeout / starvation

Canary deployment gone wrong

Request queuing death spiral

Placement in an ML System

The Final Mile of the ML Pipeline

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading