ML Building Blocks & Models

Batch Data Source

Historical data from databases, data lakes, or files

Streaming Source

Real-time data from Kafka, Kinesis, or event streams

API Endpoint

REST or GraphQL API for data ingestion

File Upload

Upload files (CSV, JSON, Parquet, images)

Webhook

Receive data via HTTP webhooks

Web Scraper

Extract structured data from websites for ML training and RAG pipelines

Category

Data Processing

5 components

Data Validation

Schema validation, null checks, outlier detection

Data Cleaning

Handle missing values, remove duplicates, normalize

Data Transformation

ETL transformations, aggregations, joins

Normalization

Min-max, Z-score, or decimal scaling

Deduplication

Remove duplicate records

Category

Feature Engineering

5 components

Feature Extraction

Extract features from raw data (embeddings, encodings)

Feature Selection

Select relevant features, reduce dimensionality

Feature Store

Centralized repository for feature management

Encoding

One-hot, label, or target encoding

Feature Scaling

StandardScaler, MinMaxScaler, RobustScaler

Category

Model Training

25 components

Train/Test Split

Split data into training, validation, and test sets

Model Training

Train ML model with hyperparameter tuning

Hyperparameter Tuning

Grid search, random search, or Bayesian optimization

Cross-Validation

K-fold, stratified, or time-series cross-validation

Full Fine-tuning

Update all model parameters on task-specific data

LoRA

Low-Rank Adaptation - add small trainable matrices to attention layers

QLoRA

Quantized LoRA - 4-bit quantization + LoRA for memory efficiency

Adapter Layers

Insert small trainable modules between frozen transformer layers

Prefix Tuning

Learn continuous soft prompts prepended to each layer

Prompt Tuning

Learn task-specific prompt embeddings (input layer only)

IA³

Infused Adapter - learn rescaling vectors for activations

Instruction Tuning

Fine-tune on instruction-following datasets (e.g., Alpaca, ShareGPT)

RLHF

Reinforcement Learning from Human Feedback with reward model

Reward Modeling

Train reward model from human preference comparisons

DPO

Direct Preference Optimization - simplified RLHF without reward model

ORPO

Odds Ratio Preference Optimization - single-stage SFT + preference

Constitutional AI

Self-improvement via AI feedback based on constitutional principles

Feature Extraction

Freeze base model, train only the classification head

Domain Adaptation

Adapt pretrained model to a new domain (e.g., medical, legal)

Continued Pretraining

Further pretrain on domain-specific corpus before fine-tuning

Knowledge Distillation

Train smaller student model to mimic larger teacher model

Multi-Task Learning

Train on multiple tasks simultaneously with shared representations

Transfer Learning

Reuse pretrained model knowledge for new tasks via fine-tuning or feature extraction

Active Learning

Iteratively select most informative samples for labeling to minimize annotation cost

Model Quantization

Reduce model precision (FP32→INT8/INT4) for faster inference and smaller footprint

Category

Evaluation

47 components

Accuracy

Overall classification accuracy (TP+TN)/(Total)

Precision/Recall/F1

Precision, Recall, F1-Score (per class & macro/micro)

Confusion Matrix

Visualize TP, TN, FP, FN across classes

ROC-AUC Curve

Receiver Operating Characteristic & Area Under Curve

PR Curve

Precision-Recall curve (for imbalanced datasets)

Log Loss

Cross-entropy loss for probabilistic predictions

Cohen's Kappa

Agreement metric accounting for chance

MAE

Mean Absolute Error

MSE / RMSE

Mean Squared Error / Root MSE

R² Score

Coefficient of determination

MAPE

Mean Absolute Percentage Error

Residual Plot

Visualize prediction residuals

Precision@K

Precision at top K results

Recall@K

Recall at top K results

MAP

Mean Average Precision

MRR

Mean Reciprocal Rank

NDCG

Normalized Discounted Cumulative Gain

Hit Rate

Fraction of queries with at least one relevant result

Catalog Coverage

Percentage of items ever recommended

Diversity Score

Intra-list diversity of recommendations

Novelty Score

Average popularity rank of recommended items

Serendipity

Unexpected but relevant recommendations

CTR / Conversion

Click-through rate & conversion metrics

BLEU Score

Bilingual Evaluation Understudy (translation/generation)

ROUGE Score

Recall-Oriented Understudy (summarization)

BERTScore

Semantic similarity using BERT embeddings

Perplexity

Language model quality metric

Faithfulness

Factual consistency with source (RAG)

Answer Relevance

Relevance of generated answer to query (RAG)

IoU / Jaccard

Intersection over Union (detection/segmentation)

mAP (Detection)

Mean Average Precision for object detection

Dice Coefficient

Segmentation overlap metric

PSNR / SSIM

Image quality metrics (generation/super-resolution)

FID Score

Fréchet Inception Distance (generative models)

Silhouette Score

Cluster cohesion and separation

Davies-Bouldin Index

Cluster similarity ratio

Calinski-Harabasz

Variance ratio criterion

ARI / NMI

Adjusted Rand Index / Normalized Mutual Info

A/B Test Runner

Statistical A/B test framework

Statistical Significance

P-value and confidence interval calculator

Uplift Model

Incremental impact measurement

K-Means Clustering

Partition-based clustering algorithm minimizing within-cluster variance

PCA (Principal Component Analysis)

Dimensionality reduction via eigendecomposition of covariance matrix

Elbow Method

Heuristic for selecting optimal K in clustering via inertia curve

Context Recall

RAG evaluation metric measuring retrieval completeness against ground truth

Linear Regression

Fundamental regression model fitting linear relationships with OLS

Gradient Boosting (XGBoost/LightGBM)

Ensemble method building sequential trees on residual errors

Category

Data Generation

31 components

Gaussian Generator

Generate data from Gaussian/Normal distribution

GAN Data Generator

Generative Adversarial Network for synthetic data

VAE Generator

Variational Autoencoder for data generation

Diffusion Generator

Diffusion model for high-quality synthetic data

CTGAN

Conditional Tabular GAN for structured data

TVAE

Tabular VAE for synthetic tabular data

Copula Generator

Copula-based synthetic data preserving correlations

Faker Generator

Rule-based fake data (names, addresses, etc.)

Time Series Generator

Synthetic time series (ARIMA, seasonal patterns)

LLM Data Generator

Use LLMs to generate synthetic training data

SMOTE

Synthetic Minority Over-sampling Technique

SMOTE-NC

SMOTE for mixed numerical/categorical data

Borderline-SMOTE

SMOTE focusing on borderline samples

ADASYN

Adaptive Synthetic Sampling

Random Oversampler

Simple random duplication of minority class

Random Undersampler

Random removal from majority class

Tomek Links

Remove Tomek links from majority class

Edited Nearest Neighbors

Remove samples misclassified by k-NN

Cluster Centroids

Replace majority class with cluster centroids

NearMiss

Heuristic undersampling based on distance

SMOTE + ENN

SMOTE followed by Edited Nearest Neighbors cleaning

SMOTE + Tomek

SMOTE followed by Tomek links removal

Image Augmentation

Rotate, flip, crop, color jitter, mixup, cutout

Text Augmentation

Synonym replacement, back-translation, EDA

Audio Augmentation

Time stretch, pitch shift, noise injection

Mixup

Convex combination of training examples

CutMix

Cut and paste patches between images

Differential Privacy

Add noise for differential privacy guarantees

Federated Synthesis

Generate synthetic data in federated setting

Condensed Nearest Neighbour (CNN)

Undersampling by finding minimal consistent subset preserving 1-NN boundary

Batch Normalization

Normalize layer inputs across mini-batch for stable deep learning training

Category

Deployment

6 components

Model Registry

Version and manage trained models

Model Serving

Deploy model for real-time or batch inference

Load Balancer

Distribute inference requests across replicas

Canary Deploy

Gradual rollout with traffic splitting

Blue-Green Deploy

Zero-downtime deployment with instant switchover

Batch Inference

Run ML predictions on large datasets offline in batch mode

Category

Monitoring

5 components

Metrics Collector

Collect model performance and system metrics

Drift Detection

Detect data drift and model degradation

Alerting System

Send alerts based on thresholds and anomalies

Logging

Structured logging for debugging and audit

APM

Application Performance Monitoring

Category

Storage

4 components

Data Lake

Store raw and processed data at scale

Object Storage

S3, GCS, Azure Blob for unstructured data

Cache Layer

Redis or Memcached for fast access

Time-Series DB

InfluxDB, TimescaleDB for time-series data

Category

Orchestration

5 components

Pipeline Scheduler

Schedule and orchestrate ML pipelines (Airflow, Kubeflow)

CI/CD Pipeline

Automated testing and deployment pipeline

Workflow Engine

DAG-based workflow orchestration

Event Trigger

Trigger pipelines on events

Experiment Tracker

Log ML experiments — hyperparameters, metrics, artifacts — for reproducibility

Category

RAG Pipeline

15 components

Document Loader

Load documents from various sources (PDF, DOCX, web)

Text Chunker

Split documents into chunks (recursive, semantic)

Embedding Model

Generate embeddings from text

Vector Store

Store and retrieve vector embeddings

Semantic Search

Vector similarity search

Hybrid Search

Combine keyword and semantic search

Re-Ranker

Re-rank retrieved results (Cohere, BGE-reranker)

Context Assembler

Assemble context for LLM prompt

BM25 (Lexical Search)

Best Match 25 algorithm for term-frequency based document retrieval

Sparse Retrieval

Term-based retrieval using inverted indexes (TF-IDF, BM25)

Learned Sparse Retrieval (SPLADE)

Neural sparse retrieval with learned term expansion (SPLADE, DeepImpact)

ColBERT (Late Interaction)

Late interaction retrieval with per-token matching via MaxSim

Learning to Rank (LTR)

ML-based ranking combining multiple relevance signals (LambdaMART, RankNet)

Self-RAG

Self-reflective RAG with adaptive retrieval and critique tokens

LLM Generator

Language model that generates answers from retrieved context in RAG

Category

LLM Operations

6 components

Prompt Template

Define reusable prompt templates

Guardrails

Input/output validation and safety filters

Output Parser

Parse LLM output to structured format

Token Counter

Count tokens for context management

Rate Limiter

Control API request rate

Response Cache

Cache LLM responses for cost savings

Category

Agentic Systems

6 components

Agent Orchestrator

Central controller for agent workflows (ReAct, Plan-Execute)

Tool Executor

Execute agent tools and functions

Memory Store

Short-term and long-term agent memory

Planning Module

Generate and refine action plans

Human-in-Loop

Request human approval or input

ReAct Loop

Reasoning + Acting loop pattern

Category

Multi-Agent

6 components

LangGraph Node

Stateful node in LangGraph workflow

CrewAI Agent

Role-based agent in CrewAI framework

Agent Router

Route tasks to appropriate agents

Shared Memory

Shared context between agents

Task Decomposer

Break complex tasks into subtasks

Agent Supervisor

Coordinate and supervise multiple agents

Category

Vector Databases

6 components

Pinecone

Managed vector database service

Weaviate

Open-source vector search engine

Qdrant

Vector similarity search engine

Milvus

Scalable vector database for AI

ChromaDB

Lightweight embedding database

pgvector

PostgreSQL extension for vector similarity

Category

Computer Vision

17 components

Image Preprocessor

Resize, normalize, and prepare images

Object Detector

Detect and localize objects in images

Image Classifier

Classify images into categories

Segmentation

Semantic or instance segmentation

OCR

Extract text from images

Face Detection

Detect faces in images

YOLO v8 Nano

Ultra-fast object detection

YOLO v8 Medium

Balanced object detection

YOLO v8 XLarge

Most accurate YOLO v8

YOLO v9

State-of-the-art object detection

SAM (Segment Anything)

Meta universal segmentation model

SAM 2

Meta video + image segmentation

CLIP

OpenAI vision-language model

DINOv2

Meta self-supervised vision model

ResNet-50

Classic image classification backbone

EfficientNet-B0

Efficient image classification

ViT-Base

Vision Transformer base model

Category

NLP

5 components

Tokenizer

Tokenize text into words or subwords

NER Extractor

Extract named entities from text

Sentiment Analyzer

Analyze sentiment of text

Text Classifier

Classify text into categories

Summarizer

Generate text summaries

Category

Responsible AI

6 components

Bias Detector

Detect bias in model predictions

Fairness Checker

Check fairness metrics across groups

Explainer (SHAP)

SHAP values for model explanations

Explainer (LIME)

LIME for local explanations

Privacy Filter

Remove or mask PII from data

Content Moderator

Filter harmful or inappropriate content

Category

LLM Models

16 components

GPT-4o

OpenAI flagship multimodal model

GPT-4o Mini

Cost-effective OpenAI model for simple tasks

Claude 3.5 Sonnet

Anthropic balanced model for most tasks

Claude 3 Opus

Anthropic most capable model

Gemini Pro

Google DeepMind multimodal model

Gemini 2.0 Flash

Google fast and efficient model

Llama 3.1 8B

Meta small but capable LLM

Llama 3.1 70B

Meta open-source LLM, excellent reasoning

Llama 3.1 405B

Meta largest open-source LLM

Qwen 2.5 7B

Alibaba small multilingual LLM

Qwen 2.5 72B

Alibaba flagship multilingual LLM

Mistral 7B

Mistral AI efficient small model

Mixtral 8x7B

Mistral AI Mixture of Experts model

DeepSeek V3

DeepSeek flagship MoE model

Phi-3 Mini

Microsoft small language model

Phi-3 Medium

Microsoft medium language model

Category

NLP & Embedding Models

13 components

all-MiniLM-L6-v2

Fast and lightweight sentence embedding

BGE-small-en-v1.5

BAAI small embedding model

BGE-base-en-v1.5

BAAI base embedding model

BGE-large-en-v1.5

BAAI large embedding model, excellent for RAG

E5-small-v2

Microsoft small embedding model

E5-large-v2

Microsoft large embedding model

GTE-large

Alibaba general text embedding

Instructor-XL

Instruction-following embedding model

text-embedding-3-small

OpenAI small embedding model

text-embedding-3-large

OpenAI large embedding model

Cohere Embed v3

Cohere multilingual embedding

Whisper Large v3

OpenAI speech-to-text model, state-of-the-art ASR

TTS-1

OpenAI text-to-speech model

Category

3D Models

6 components

NeRF

Neural Radiance Fields for 3D reconstruction

Instant-NGP

Fast neural graphics primitives

3D Gaussian Splatting

Fast 3D scene reconstruction

Point-E

OpenAI text-to-3D point cloud

Shap-E

OpenAI text-to-3D mesh generation

GET3D

NVIDIA generative 3D model