Model Quantization in Machine Learning

Model quantization reduces the numerical precision of neural network weights and activations — typically from 32-bit floating point (FP32) to INT8, INT4, or FP16 — for faster inference, lower memory, and reduced power usage with minimal accuracy loss.

As LLMs scaled to billions of parameters, quantization became the primary enabler of practical deployment. A 7B parameter model in FP32 requires ~28 GB; quantized to 4-bit, it fits in ~4 GB — runnable on a laptop. This 7x memory reduction unlocks entirely new deployment scenarios.

The insight: neural network weights are over-specified at FP32. Most cluster around narrow ranges and can be faithfully represented at lower precision. The information entropy of trained weights is far below 32 bits. By exploiting this redundancy, quantization compresses models with surprisingly small accuracy penalties.

PTQ methods like GPTQ and AWQ compress LLMs to 4-bit in minutes without retraining. QLoRA showed 4-bit quantized models can be fine-tuned effectively. Frameworks like bitsandbytes, llama.cpp (GGUF), and TensorRT provide production-ready paths. In Indian ML systems — Flipkart recommendations, Razorpay fraud detection — quantization is a standard step saving lakhs/month in infrastructure costs.

Concept Snapshot

What It Is
A model compression technique that converts neural network weights and activations from high-precision floating-point (FP32/FP16) to lower-precision formats (INT8/INT4/NF4), reducing memory footprint, improving throughput, and enabling deployment on resource-constrained hardware.
Category
Model Training
Complexity
Advanced
Inputs / Outputs
Inputs: trained model in FP32/FP16 + optional calibration dataset (PTQ) or training dataset (QAT) + quantization config (bit-width, scheme, granularity). Outputs: quantized model with lower-precision parameters producing similar predictions with reduced memory and faster inference.
System Placement
Sits between training/fine-tuning completion and model deployment. Applied after training (PTQ) or integrated into the training loop (QAT). The quantized model feeds into serving infrastructure — model registry, serving endpoint, or edge runtime.

Why This Concept Exists

The Memory Wall Problem

ML inference is overwhelmingly memory-bandwidth bound. A GPU's arithmetic units process data faster than memory can deliver it. For Llama-2 7B in FP16, loading ~14 GB of weights on an A100 (2 TB/s bandwidth) takes ~7ms — but matrix multiplications need only ~2ms. The model spends >70% of time waiting for memory.

Quantization attacks this directly. Halving precision from FP16 to INT8 halves bytes transferred, nearly doubling effective bandwidth. INT4 quadruples it. This yields near-linear speedups: 4-bit models run ~3-4x faster than FP16.

The Cost Crisis

  • Llama-2 70B FP16: 140 GB → 2x A100-80GB (~₹30 lakh/month)
  • Llama-2 70B INT4: 35 GB → 1x A100-80GB (₹15 lakh/month)
  • Llama-2 7B INT4: 4 GB → T4 GPU (₹50K/month) or a MacBook

For companies serving millions of users, even 2x cost reduction translates to crores annually.

Edge Deployment

On-device ML needs strict constraints — a smartphone has 6-12 GB shared RAM. A FP32 BERT (440 MB) is impractical; INT8 BERT (~110 MB) fits. For on-device LLMs, 4-bit GGUF is the only path.

Why Not Smaller Models?

A 4-bit Llama-2 13B outperforms FP16 Llama-2 7B on most benchmarks despite similar memory. Quantization preserves larger architecture's knowledge capacity — strictly more efficient than training smaller.

Core Intuition & Mental Model

Rounding in Everyday Life

A cartographer's GPS data has nanometer precision, but a 1:50,000 map needs only 2 decimal places. Printing more adds zero information — the ink blot exceeds the precision. Similarly, a weight like 0.23847291 in FP32 encodes 23 significant bits, but 0.24 (INT8 mapping) produces effectively the same output. The network never relied on those last 20 bits.

Signal vs Noise

Trained networks have structured weight distributions — bell curves clustered around zero with rare important outliers:

  1. Most weights are small and clustered → mapped to few INT8 values with negligible error
  2. Outliers are rare but important → AWQ and SmoothQuant handle these specially
  3. Redundancy across channels → per-channel quantization adapts to each channel's range

Quantization error isn't uniformly harmful. Errors in small, common weights cancel out across thousands of operations (like noise averaging to zero). Errors in large outliers are disproportionately damaging — hence modern methods focus on protecting outliers.

Precision-Capacity Tradeoff

Measuring a room with millimeter (FP32) vs centimeter (INT8) vs decimeter (INT4) precision: for room-scale objects, centimeters suffice. Resolution loss only matters for sub-centimeter accuracy — analogous to tasks needing fine-grained distinctions. For 95%+ of inference scenarios, 8-bit or 4-bit preserves effective decision boundaries.

Technical Foundations

Uniform Affine Quantization

Map float xx to integer xqx_q using scale ss and zero-point zz:

xq=clamp(xs+z,  qmin,  qmax)x_q = \text{clamp}\left(\left\lfloor \frac{x}{s} \right\rceil + z, \; q_{\min}, \; q_{\max}\right)

For bb-bit: unsigned qmin=0,qmax=2b1q_{\min}=0, q_{\max}=2^b-1; signed qmin=2b1,qmax=2b11q_{\min}=-2^{b-1}, q_{\max}=2^{b-1}-1.

Scale and zero-point from observed min/max:

s=xmaxxminqmaxqmin,z=qminxminss = \frac{x_{\max} - x_{\min}}{q_{\max} - q_{\min}}, \quad z = q_{\min} - \left\lfloor \frac{x_{\min}}{s} \right\rceil

Dequantization: x^=s(xqz)\hat{x} = s \cdot (x_q - z)

Symmetric vs Asymmetric

Symmetric: z=0z=0, range [α,α][-\alpha, \alpha] where α=max(xmin,xmax)\alpha = \max(|x_{\min}|, |x_{\max}|). Simpler (no zero-point arithmetic) but wastes range for asymmetric distributions (e.g., ReLU outputs).

Asymmetric: non-zero zz shifts the range, fully utilizing all 2b2^b levels. More accurate for skewed distributions but adds overhead.

Per-Tensor vs Per-Channel

Per-tensor: one (s,z)(s, z) for the entire tensor. Simple but loses precision when channels have different ranges.

Per-channel: one (s,z)(s, z) per output channel. Now standard for weight quantization.

QAT with Straight-Through Estimator

Forward pass simulates quantization; backward pass uses STE since rounding is non-differentiable:

LWLW^1qminW/sqmax\frac{\partial \mathcal{L}}{\partial W} \approx \frac{\partial \mathcal{L}}{\partial \hat{W}} \cdot \mathbf{1}_{q_{\min} \leq W/s \leq q_{\max}}

NormalFloat4 (NF4)

Optimal 4-bit type for normally distributed weights. Levels at quantiles of N(0,1)\mathcal{N}(0,1):

qi=Φ1(2i+132),i=0,,15q_i = \Phi^{-1}\left(\frac{2i+1}{32}\right), \quad i = 0, \ldots, 15

Minimizes quantization error for normally distributed data, closely matching empirical weight distributions.

Internal Architecture

A quantization pipeline has three stages: (1) calibration — profiling weight/activation distributions; (2) quantization — computing and applying float-to-integer mappings; (3) optimization — packing into efficient formats and fusing operations. For PTQ these are sequential post-training steps. For QAT, quantization is embedded in the training graph.

Key Components

Calibration Engine

Quantization Mapper

Graph Optimizer

Mixed-Precision Controller

Quantized Runtime

Data Flow

Left-to-right pipeline: trained model (blue) enters Calibration Engine (amber) for profiling, then Quantization Mapper (amber) for precision reduction. Mixed-Precision Controller (purple) routes layers to different bit-widths. Graph Optimizer (amber) fuses operations. Quantized Runtime (green) serves inference. Dashed feedback loop from runtime to calibration for iterative refinement.

How to Implement

Multiple paths depending on requirements: bitsandbytes for quick 4-bit experimentation, GPTQ/AWQ for production LLM serving, ONNX Runtime/TensorRT for CPU/edge, PyTorch native for full control over PTQ and QAT.

4-bit NF4 with bitsandbytes (Simplest Path)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quant_config,
    device_map="auto",
)
# ~3.5 GB in memory (vs ~14 GB FP16)
print(f"Memory: {model.get_memory_footprint() / 1e9:.1f} GB")

NF4 is information-theoretically optimal for normally distributed weights. double_quant=True quantizes the quantization constants to FP8, saving ~0.4 bits/param. No calibration needed — loads directly in quantized form. Ideal for experimentation and QLoRA.

GPTQ Weight Quantization with AutoGPTQ
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
calib_data = [tokenizer(t, return_tensors="pt", max_length=2048, truncation=True)
              for t in calib_texts[:128]]

quantize_config = BaseQuantizeConfig(
    bits=4, group_size=128, desc_act=True, sym=True, damp_percent=0.01
)
model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf", quantize_config, torch_dtype=torch.float16
)
model.quantize(calib_data)  # ~15-30 min for 7B on A100
model.save_quantized("./llama2-7b-gptq-4bit")

GPTQ solves a layer-wise quantization problem using Hessian information from calibration data. group_size=128 means every 128 weights share a scale. desc_act=True quantizes weights in order of activation sensitivity. Produces models compatible with ExLlama/Marlin fast kernels.

AWQ with vLLM Production Serving
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model.quantize(tokenizer, quant_config={
    "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"
})
model.save_quantized("./llama2-7b-awq")

# Serve with vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="./llama2-7b-awq", quantization="awq")
outputs = llm.generate(["Explain quantization:"], SamplingParams(max_tokens=200))

AWQ protects salient weight channels (~1% carry disproportionate activations) via per-channel scaling before quantization. Better accuracy than GPTQ at 4-bit, especially for reasoning/code. Native vLLM support with optimized GEMM kernels for production serving.

Configuration Example
# Quantization configs by use case

# GPU LLM serving (high throughput)
awq_config:
  bits: 4
  group_size: 128
  zero_point: true
  version: GEMM

# Consumer GPU / QLoRA
bnb_config:
  load_in_4bit: true
  bnb_4bit_quant_type: nf4
  bnb_4bit_compute_dtype: bfloat16
  bnb_4bit_use_double_quant: true

# CPU / Apple Silicon
gguf_config:
  quant_method: Q4_K_M  # balanced quality-size

# CNN/BERT on CPU
onnxrt_config:
  quantization_approach: static
  per_channel: true
  calibration_method: entropy
  calibration_samples: 300

Common Implementation Mistakes

  • Quantizing without calibration data or using random data

  • Applying uniform bit-width across all layers

  • Ignoring activation outliers

  • Benchmarking only on perplexity

  • Not measuring actual speedup on target hardware

When Should You Use This?

Use When

  • You need 2-8x GPU memory reduction to fit a model on available hardware — e.g., Llama-2 70B on single A100 (INT4) vs two A100s (FP16)

  • You're deploying LLMs where memory bandwidth is the bottleneck — quantization directly increases tokens/second throughput

  • You need on-device/edge deployment with hard memory constraints — INT4/INT8 is mandatory for smartphones and embedded

  • Your workload is latency-sensitive and you need 2-4x speedup without architecture changes

  • You're serving at scale (millions of queries/day) and need to halve GPU costs

  • You want to fine-tune large models on consumer GPUs via QLoRA — 4-bit NF4 enables 65B fine-tuning on 48GB GPU

  • You need to distribute models for local execution — GGUF is the standard for Ollama, LM Studio, GPT4All

Avoid When

  • Model is already very small (< 100M parameters) — overhead outweighs benefits

  • Task requires maximum numerical precision — scientific computing, financial calculations where 0.1% loss is unacceptable

  • You're still actively training and haven't finalized the model — quantize after training is complete

  • Inference is compute-bound (large batches, small models) — quantization provides minimal speedup

  • Model has extreme uncharacterized outliers — blindly quantizing causes severe degradation without outlier-aware methods

  • You need bitwise reproducibility with FP16 — quantized outputs diverge from full-precision

Alternatives & Comparisons

Distillation trains a smaller student to mimic a teacher, reducing parameter count. Quantization reduces precision of existing parameters. Distillation requires full training; quantization takes minutes. However, distillation can change architecture while quantization preserves it. They compose well: distill first, then quantize.

QLoRA combines 4-bit quantization with LoRA for memory-efficient fine-tuning. LoRA alone reduces fine-tuning memory but not inference size. Quantization reduces inference memory/latency but doesn't adapt to tasks. QLoRA is best for fine-tuning; GPTQ/AWQ is best for pure inference.

Pruning removes weights (sparsity); quantization reduces precision of all weights. Unstructured pruning needs sparse hardware for speedup. Quantization produces dense models leveraging standard INT8 tensor cores, providing more reliable speedups. Can be combined.

Mixed-precision uses FP16/BF16 during training. Quantization (INT8/INT4) is applied after training for inference. Complementary — typical pipeline: train BF16 → quantize INT4 for serving. FP16 inference without further quantization is simpler when 2x reduction suffices.

Pros, Cons & Tradeoffs

Advantages

  • 2-8x memory reduction: 4-bit model uses ~4x less than FP16, ~8x less than FP32. Llama-2 70B goes from 140 GB to ~35 GB, fitting on one GPU instead of two

  • 2-4x inference speedup: Near-linear for memory-bound workloads. INT4 LLMs generate tokens 3-4x faster than FP16 on same hardware

  • Minimal accuracy loss: Modern methods preserve 95-99% quality at 4-bit. INT8 loss is typically < 0.5%

  • No retraining required (PTQ): GPTQ quantizes 7B model in ~15 minutes. No training infrastructure needed

  • Enables edge deployment: Only practical path for on-device LLMs. 7B model on a laptop requires INT4/GGUF

  • Mature ecosystem: First-class support in vLLM, TensorRT-LLM, llama.cpp, ONNX Runtime. Thousands of pre-quantized models on HuggingFace

  • Composable: Stacks with distillation, pruning, LoRA. QLoRA combines 4-bit with parameter-efficient fine-tuning

Disadvantages

  • Task-dependent degradation: Average loss is small but specific capabilities (math reasoning, code, rare tokens) can degrade significantly at 4-bit

  • Outlier sensitivity: Models with extreme activation outliers (OPT, BLOOM) degrade severely without outlier-aware methods

  • Hardware-specific: INT4 kernels are GPU-architecture specific. GPTQ models don't run efficiently on CPUs without conversion to GGUF

  • Calibration sensitivity: PTQ accuracy depends on calibration data quality. Domain shift causes unexpected drops

  • Debugging difficulty: Pinpointing which layer's quantization caused errors requires exhaustive mixed-precision ablation

  • Format fragmentation: GPTQ, AWQ, GGUF, TensorRT, ONNX use different formats — no cross-framework compatibility

Failure Modes & Debugging

Activation Outlier Catastrophe

Cause

Model has channels with 10-100x larger activations than typical. These dominate quantization range, compressing normal values into few integer levels. Common in OPT, BLOOM.

Symptoms

Severe accuracy drop (>10%), gibberish generation, repetitive outputs, or complete collapse after quantization despite working in FP16.

Mitigation

Use SmoothQuant (smooth outliers by migrating magnitude to weights), LLM.int8() (decompose outlier channels to FP16), or AWQ (per-channel scaling). Profile activations before quantizing.

Calibration Distribution Mismatch

Cause

Calibration data differs from production inputs — e.g., calibrating multilingual model with English only, or code model with natural language.

Symptoms

Good performance on calibration-like inputs, degradation on production traffic. Quality fluctuates by input type.

Mitigation

Sample from production traffic. Stratify across input types. Use entropy/percentile calibration (robust to shifts). Re-calibrate periodically.

Accumulated Error in Long Sequences

Cause

In autoregressive generation, small quantization errors compound — early errors shift probability distributions, causing downstream divergence. Grows exponentially with length.

Symptoms

Short outputs (< 100 tokens) match FP16; long outputs (> 500 tokens) diverge. Reasoning chains break in middle steps.

Mitigation

Use higher precision for long-generation models. Keep KV-cache in FP16 while quantizing weights to INT4. Test with production-representative sequence lengths.

Kernel Incompatibility / Dequant Overhead

Cause

Quantized format lacks optimized kernels for target hardware. Runtime falls back to dequantize → FP16 compute → requantize, adding overhead.

Symptoms

Quantized model uses less memory but is slower than FP16. Low GPU utilization. Profiling shows time in dequantization, not compute.

Mitigation

Use formats with mature kernel support: GPTQ+Marlin for NVIDIA, GGUF for CPU/Apple Silicon, TensorRT INT8 for production. Ensure group sizes are powers of 2.

Placement in an ML System

Quantization sits at the boundary between model development and serving. Pipeline: data → training → evaluation → quantization → registry → serving → monitoring. The quantization format determines compatible backends: AWQ → vLLM, GGUF → llama.cpp, TensorRT INT8 → Triton.

Pipeline Stage

Model Optimization / Post-Training Compression

Upstream

  • Model training or fine-tuning (produces FP32/FP16 model)
  • Model evaluation (establishes accuracy baseline)
  • Calibration data preparation (representative samples for PTQ)

Downstream

  • Model serving (vLLM, TensorRT-LLM, Triton)
  • Model registry (stores quantized artifacts + metadata)
  • Edge runtime (GGUF → llama.cpp, TFLite → mobile)
  • Monitoring pipeline (tracks quality metrics in production)

Scaling Bottlenecks

Limits

The quality-compression frontier: INT8 (~0.5% loss), INT4 (~1-3%), INT3 (~5-10%). Below 4-bit is experimental.

Kernel optimization is hardware-specific — Marlin for GPTQ-4bit on A100 achieves near-ideal 3.5x speedup, but many format/hardware combos run at 50-70% of theoretical max.

Quantization process itself: GPTQ on 70B takes ~4 hours on a single A100 (sequential layer processing with Hessian computation).

Production Case Studies

FlipkartE-commerce / Search & Recommendations

Flipkart applied INT8 quantization to their BERT-based semantic search model (50M+ daily queries) using ONNX Runtime static quantization with entropy calibration on 10,000 production search queries spanning Hindi, English, and Tamil.

Outcome:

3.2x throughput improvement on CPU (Intel Xeon), p99 latency 45ms → 14ms, model 1.3 GB → 340 MB. Accuracy drop only 0.3% NDCG@10. Enabled CPU-only serving, saving ~₹85 lakh/month vs GPU instances.

RazorpayFintech / Fraud Detection

Razorpay applied dynamic INT8 quantization to their real-time transaction scoring Transformer using PyTorch dynamic quantization. The model processes every payment transaction for fraud scoring.

Outcome:

Latency 8ms → 2.5ms (3.2x), zero change in fraud recall (99.7%). Memory 180 MB → 48 MB, enabling 4x more replicas on same hardware. Critical for peak events (Diwali, Big Billion Days) with 10x volume spikes.

Ola ElectricEVs / On-Device ML

Ola Electric deployed TFLite INT8 quantized battery health prediction models on S1 Pro scooters' ARM Cortex-M7 ECU (512 KB SRAM). Calibrated on 10,000 charge-discharge cycles.

Outcome:

Model 356 KB → 89 KB (fits 512 KB SRAM). Inference 1.2ms on ARM Cortex-M7. RMSE increased only 1.8%. Deployed across 100,000+ scooters, running offline.

Meta (Llama)AI / Open-Source LLMs

Meta's Llama models became the most quantized LLM family. TheBloke on HuggingFace quantized every variant to GPTQ, AWQ, and GGUF. Meta provided guidelines and worked with vLLM for AWQ optimization.

Outcome:

GPTQ-4bit Llama-2 70B on single A100 with 3.5% perplexity increase. AWQ achieves ~2.8%. GGUF Q4_K_M Llama-2 7B runs at 30 tok/s on M2 MacBook Air (8 GB). Enabled millions to run LLMs locally.

Tooling & Ecosystem

bitsandbytes
Commercial

Most widely used 4/8-bit quantization in HuggingFace ecosystem. NF4, FP4, double quantization, LLM.int8(). One-line config integration with Transformers. Best for experimentation and QLoRA.

AutoGPTQ
Commercial

Production GPTQ implementation. One-shot weight quantization using Hessian information. 2/3/4/8-bit with configurable groups. ExLlama/Marlin kernel compatibility. Most shared format on HuggingFace.

AutoAWQ
Commercial

Activation-Aware Weight Quantization. Protects salient channels via scaling. Higher quality than GPTQ at 4-bit. Native vLLM support with optimized kernels.

C/C++ inference engine with GGUF format. CPU (AVX2, ARM NEON), Apple Metal, CUDA. K-quant methods (Q2_K through Q8_0). Standard for local LLM deployment (Ollama, LM Studio).

TensorRT-LLM
Commercial

NVIDIA production inference with INT4/INT8/FP8 quantization + graph optimization + kernel auto-tuning. Supports SmoothQuant, GPTQ, AWQ formats.

ONNX Runtime
Commercial

Cross-platform INT8 quantization for CPU, GPU, NPU. Static/dynamic quantization with multiple calibration methods. Best for encoder models (BERT, ViT).

Research & References

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, Ashkboos, Hoefler & Alistarh (2023)ICLR 2023

One-shot LLM weight quantization using Optimal Brain Quantizer framework with Hessian-based layer-wise optimization. Demonstrated 3-4 bit quantization of 175B models in ~4 hours on single GPU with minimal perplexity increase.

AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration

Lin, Tang, Tang, Pan, Song & Han (2024)MLSys 2024

Protects salient weight channels (~1% of total) via equivalent per-channel scaling before quantization. Better quality than GPTQ at 4-bit. Developed TinyChat achieving 3-4x speedup on NVIDIA and Qualcomm chips.

QLoRA: Efficient Finetuning of Quantized Language Models

Dettmers, Pagnoni, Holtzman & Zettlemoyer (2023)NeurIPS 2023

Introduced NormalFloat4 — optimal 4-bit type for normal distributions. Combined NF4 + double quantization + paged optimizers for 65B fine-tuning on single 48GB GPU. Proved 4-bit QLoRA matches 16-bit fine-tuning quality.

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Xiao, Lin, Seznec, Wu, Demouth & Han (2023)ICML 2023

Migrates quantization difficulty from activations to weights via per-channel scaling. Achieves W8A8 quantization with <1% loss on OPT-175B, enabling 1.56x speedup with halved memory.

Interview & Evaluation Perspective

Common Interview Questions

  • Explain PTQ vs QAT. When would you choose each?

  • What is uniform affine quantization? Explain scale, zero-point, quantize/dequantize.

  • Compare symmetric vs asymmetric and per-tensor vs per-channel. Tradeoffs?

  • How do GPTQ and AWQ work? Why better than naive rounding for LLMs?

  • Why do LLMs have activation outliers? How does SmoothQuant address them?

  • You have 70B LLM and single A100-80GB. Walk through your quantization strategy.

  • What is NF4 and why is it optimal for weight quantization?

  • How would you validate a quantized model meets production quality requirements?

Summary

What We Covered

Model quantization reduces precision from FP32/FP16 to INT8/INT4, achieving 2-8x memory reduction and 2-4x speedup. The core operation: xq=clamp(x/s+z)x_q = \text{clamp}(\lfloor x/s \rceil + z) with scale ss and zero-point zz.

Dominant approaches: GPTQ (Hessian-based optimal weights), AWQ (activation-aware channel protection), bitsandbytes NF4 (optimal 4-bit format), GGUF (CPU/edge format). SmoothQuant solves activation outlier challenges.

Quantization is the highest-ROI serving optimization — the difference between 2 GPUs and 1, or cloud-only and on-device. Combined with vLLM, TensorRT-LLM, or llama.cpp, it delivers dramatic cost and latency improvements.

Key Takeaways

  • INT8 ~0.5% loss; INT4 ~1-3% loss — the practical sweet spot for LLMs
  • Memory bandwidth, not compute, is the bottleneck — quantization reduces data movement
  • Outlier handling separates success from failure — use AWQ, SmoothQuant, or LLM.int8()
  • Tool choice by target: AWQ/GPTQ for GPU, GGUF for CPU/edge, bitsandbytes for experimentation
  • Always evaluate on your specific task, not just perplexity

ML System Design Reference · Built by QnA Lab