Streaming Source in Machine Learning
A streaming data source is the entry point for continuous, unbounded data flowing into an ML system in real time. Unlike batch pipelines that process data in discrete chunks on a schedule, streaming sources ingest events as they happen -- user clicks, sensor readings, transactions, GPS pings -- and make them available for downstream processing within milliseconds to seconds.
Why does this matter for ML? Because the world doesn't pause while your nightly batch job runs. Fraud happens in real time. Ride prices need to reflect current demand. Recommendation feeds must adapt to what a user just browsed 10 seconds ago. Streaming data sources are what make real-time ML features possible -- and without real-time features, your model is always making decisions based on stale information.
The dominant technologies in this space are Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub, and Apache Pulsar for message transport, paired with Apache Flink, Spark Structured Streaming, or Kafka Streams for stateful stream processing. Together, they form the ingestion backbone of virtually every production ML system that needs to react to the world as it happens -- from Netflix processing 2 trillion messages per day, to Zomato handling 450+ million messages per minute during peak New Year's Eve traffic.
This guide covers everything: the theory behind streaming semantics, the architecture of production streaming pipelines, implementation patterns with real code, failure modes you will encounter, and how to think about streaming data sources in ML system design interviews.
Concept Snapshot
- What It Is
- A continuously flowing, unbounded data ingestion channel that delivers events to ML pipelines in real time via message brokers or event streams.
- Category
- Data Ingestion
- Complexity
- Intermediate
- Inputs / Outputs
- Inputs: raw events from producers (application logs, user actions, IoT sensors, CDC streams, webhooks). Outputs: ordered, partitioned event streams consumed by downstream processors, feature stores, or model serving layers.
- System Placement
- Sits at the very beginning of the ML pipeline, upstream of data validation, feature engineering, and model training/serving. It is the first component that touches raw data from the outside world.
- Also Known As
- event stream, real-time data source, message stream, event bus, streaming ingestion layer, real-time data pipeline
- Typical Users
- Data Engineers, ML Engineers, Platform Engineers, Backend Engineers, SREs / DevOps
- Prerequisites
- Distributed systems basics (partitioning, replication), Message queues and pub/sub patterns, Basic understanding of event-driven architecture, Serialization formats (Avro, Protobuf, JSON)
- Key Terms
- event time vs processing timeexactly-once semanticsat-least-once deliverywindowingwatermarksbackpressureconsumer grouppartitionoffsetcheckpointschema registry
Why This Concept Exists
The Problem with Batch-Only Pipelines
For years, the default approach to feeding ML systems was batch ingestion: collect a day's worth of data, run an ETL job overnight, update feature tables, retrain models on a schedule. This worked -- until it didn't.
Consider a fraud detection system that only sees yesterday's transactions. A sophisticated attacker can drain an account in minutes. By the time your batch pipeline catches up 24 hours later, the money is gone. The same staleness problem applies to dynamic pricing (Uber's surge, airline tickets), real-time recommendations (Swiggy suggesting restaurants based on what you're craving right now), and anomaly detection on IoT sensor data where a factory machine might overheat in seconds.
The core tension: ML models are only as fresh as the data they see. Batch pipelines introduce a latency floor -- typically 1-24 hours -- that streaming data sources eliminate.
Two Trends That Made Streaming Essential
Trend 1: User expectations shifted to real time. When a user adds an item to their cart on Flipkart, the recommendation engine should immediately adjust. When a rider opens Ola, the ETA should reflect traffic from 30 seconds ago, not 30 minutes ago. Consumers now expect sub-second responsiveness, and batch-derived features simply cannot deliver it.
Trend 2: ML models got good enough to exploit fresh data. With the rise of online learning, real-time feature stores (Tecton, Feast), and streaming feature computation (Flink, Spark Streaming), the infrastructure matured to a point where feeding fresher data into models yields measurably better outcomes. LinkedIn reported that switching from offline (24-48 hour delay) to real-time feature generation significantly improved their ML model performance.
The Evolution of Streaming Technology
The journey started with simple message queues (RabbitMQ, ActiveMQ) in the 2000s, evolved through Apache Kafka (2011, open-sourced by LinkedIn), matured with Apache Flink (2014, true event-time processing), and reached cloud-native managed services like Amazon Kinesis (2013), Google Cloud Pub/Sub, and Azure Event Hubs.
Kafka was the inflection point. By treating the message log as a durable, replayable, partitioned commit log, it unified stream processing and storage in a way that previous message queues couldn't. This insight -- that the log is the fundamental data structure for streaming -- changed how the industry thinks about data integration.
Today, Kafka alone handles trillions of messages daily across organizations like LinkedIn (7 trillion messages in their ecosystem), Netflix (2 trillion/day via their Keystone pipeline), and Uber (petabytes of real-time data). The technology is battle-proven at scales that would have been unthinkable a decade ago.
Core Intuition & Mental Model
The Mental Model: A Never-Ending Conveyor Belt
Think of a streaming data source as a conveyor belt in a factory. Items (events) are placed on the belt continuously by various producers. Downstream workers (consumers) pick items off the belt, process them, and pass the results along. The belt never stops. There is no "end of data" -- only a moving frontier of the latest event.
This is fundamentally different from batch processing, which is more like a delivery truck: it arrives once a day with a full load, you unpack everything at once, and then you wait for the next truck. Streaming is the conveyor belt that runs 24/7.
Three Questions Every Streaming System Must Answer
Here's an intuition that will serve you well. Every streaming data source must answer three questions, formalized beautifully in Google's Dataflow Model paper:
- What results are being computed? (The transformation logic)
- Where in event time are results computed? (Windowing)
- When in processing time are results materialized? (Triggers and watermarks)
- How do refinements of results relate? (Accumulation modes)
If you remember nothing else from this guide, remember these four questions. They are the conceptual foundation of all stream processing.
Why "Exactly-Once" Is So Hard
Here's an analogy that makes this click. Imagine you're counting people entering a stadium through multiple gates. Each gate counter sends a message to a central tally. Now imagine a counter crashes, restarts, and re-sends its last count. Did 500 people enter or 501? Was that last message a duplicate or a new entry?
In distributed systems, messages can be lost (network failure), duplicated (retry after timeout), or reordered (different network paths). Achieving exactly-once processing -- where every event is processed neither more nor less than once -- requires coordinating producers, brokers, and consumers through idempotent writes, transactional protocols, and distributed snapshots. It's one of the hardest problems in distributed computing, and getting it wrong means your ML features are either missing data (under-counting) or double-counting.
Key Insight: Streaming isn't just "batch but faster." It introduces fundamentally new challenges around ordering, completeness, and consistency that don't exist in batch processing. If your team treats streaming as just a speed upgrade over batch, you're going to have a bad time.
Technical Foundations
Formal Model of a Streaming Data Source
A streaming data source produces an unbounded, append-only sequence of events:
where each event consists of a key , a value (payload) , an event time (when the event actually occurred in the real world), and an ingestion time (when the system received it).
The critical insight is that and can differ significantly due to network delays, buffering, and out-of-order arrival. This event-time skew is the fundamental challenge that watermarks address.
Delivery Guarantees
Streaming systems provide one of three delivery semantics:
- At-most-once: Each event is delivered zero or one times. Fast but lossy. Formally: per event.
- At-least-once: Each event is delivered one or more times. No loss but possible duplicates. Formally: per event.
- Exactly-once: Each event is processed exactly one time. The gold standard. Formally: the effect of processing is as if each event was processed exactly once, even under failures.
Exactly-once is typically achieved through a combination of idempotent producers and transactional consumers (Kafka's approach) or distributed snapshots with checkpoint barriers (Flink's approach based on the Chandy-Lamport algorithm).
Windowing
Windowing partitions the unbounded stream into finite chunks for aggregation. The three fundamental window types are:
- Tumbling (fixed) windows: Non-overlapping, fixed-duration intervals. A 5-minute tumbling window groups events into
- Sliding (hopping) windows: Fixed-duration intervals that advance by a slide interval. A 10-minute window sliding every 2 minutes produces overlapping groups.
- Session windows: Dynamic windows that close after a gap of inactivity. Defined by a gap duration : a session window closes when no event arrives for time units.
Formally, for tumbling windows of size , event with event time belongs to window:
Watermarks
A watermark is an assertion by the system that all events with event time have been observed. Formally:
where accounts for expected skew. When advances past a window boundary, it triggers window computation. A watermark that advances too aggressively causes data loss (late events dropped); one that advances too conservatively causes increased latency (results delayed waiting for stragglers).
Throughput and Latency
For a partitioned streaming system with partitions, theoretical maximum throughput is:
where is the per-partition write rate. End-to-end latency is:
Typical values for Kafka: , , , yielding end-to-end latencies of 5-50ms before processing.
Internal Architecture
A production streaming data source pipeline consists of four major layers: producers that emit events, a message broker (Kafka, Kinesis, Pub/Sub) that durably stores and distributes them, a stream processor (Flink, Spark Streaming, Kafka Streams) that transforms and enriches events, and sinks that write processed data to downstream stores (feature stores, data lakes, serving layers).
The broker is the heart of the architecture. In Kafka's case, it's a distributed commit log partitioned by key, replicated across brokers for fault tolerance, and retained for a configurable duration (or indefinitely with compaction). Producers write to topic partitions; consumer groups read from them with offset tracking for exactly-once consumption.
The stream processor sits between the broker and the sinks. It handles the hard parts: stateful computation (aggregations, joins), windowing, watermark tracking, and checkpoint-based fault tolerance. Flink's distributed snapshot algorithm (based on Chandy-Lamport) periodically captures consistent state across all operators, enabling recovery to the last checkpoint without data loss or duplication.

This architecture supports independent scaling of each layer. Producers scale with the number of data sources. The broker scales horizontally by adding partitions and brokers. The stream processor scales by adding task slots or executors. And sinks can be independently optimized for their specific write patterns (append-only for data lakes, upsert for feature stores).
Key Components
Producers / Event Emitters
Generate and publish events to the message broker. These can be application servers emitting user action logs, IoT devices sending telemetry, database CDC (Change Data Capture) connectors like Debezium capturing row-level changes, or webhook receivers translating HTTP callbacks into stream events. Producers handle serialization (typically Avro or Protobuf via a schema registry), partitioning key selection, and batching for throughput optimization.
Message Broker / Event Bus
The durable, distributed backbone that stores and routes events. Apache Kafka implements this as a partitioned, replicated commit log with configurable retention. Amazon Kinesis uses shard-based partitioning with 7-day default retention. Google Cloud Pub/Sub provides a fully managed, serverless pub/sub service. The broker decouples producers from consumers, enabling independent scaling and fault isolation.
Schema Registry
Manages and enforces schemas for events flowing through the broker. Confluent Schema Registry (for Kafka) or AWS Glue Schema Registry (for Kinesis) ensure that producers and consumers agree on data structure, enable schema evolution (adding/removing fields without breaking consumers), and support serialization formats like Avro, Protobuf, and JSON Schema.
Stream Processor
Consumes events from the broker and performs stateful transformations: filtering, enrichment, aggregation, windowed joins, and feature computation. Apache Flink is the gold standard for stateful stream processing with exactly-once guarantees and event-time semantics. Spark Structured Streaming provides micro-batch processing with DataFrame APIs. Kafka Streams offers lightweight, library-based processing embedded in your application.
Checkpoint / State Backend
Persists the internal state of stream processors for fault tolerance. Flink uses RocksDB as its state backend with periodic snapshots to distributed storage (S3, HDFS). On failure, the processor restarts from the last checkpoint, replays events from the broker (using committed offsets), and recovers to a consistent state -- achieving exactly-once end-to-end.
Dead Letter Queue (DLQ)
Captures events that fail processing (deserialization errors, schema mismatches, processing exceptions) rather than blocking the pipeline or losing data. Events in the DLQ can be inspected, fixed, and replayed. This is critical for maintaining pipeline liveness while preserving data for debugging.
Monitoring and Observability Layer
Tracks consumer lag (how far behind consumers are from the latest produced offset), throughput (messages/sec), end-to-end latency, error rates, checkpoint durations, and backpressure signals. Tools include Kafka's built-in JMX metrics, Flink's web UI and metrics reporters, Prometheus + Grafana dashboards, and cloud-native monitoring (CloudWatch for Kinesis, Cloud Monitoring for Pub/Sub).
Data Flow
- A producer (e.g., an application server) serializes an event using the schema from the registry, selects a partition key (e.g.,
user_id), and publishes to a Kafka topic. - The Kafka broker appends the event to the appropriate partition's commit log, replicates to follower brokers (ISR), and acknowledges the producer.
- The event is now durable and available for consumption.
- A Flink consumer (part of a consumer group) reads the event from its assigned partitions, tracking offsets.
- The stream processor applies transformations: parsing, filtering, enriching with lookup data (e.g., joining with a user profile table), computing windowed aggregations (e.g., rolling 5-minute click count).
- Computed features are written to the feature store (e.g., Redis, DynamoDB) for real-time serving, and raw/enriched events are written to the data lake (S3, Delta Lake) for batch training.
- On successful processing, the consumer commits its offset to Kafka, advancing its position in the log.
- Flink periodically takes distributed snapshots (checkpoints) of all operator states and committed offsets.
- On failure, Flink restarts from the last successful checkpoint, rewinds Kafka offsets to the checkpointed position, and replays events.
- Combined with idempotent sinks, this achieves exactly-once end-to-end processing.
A directed flow diagram showing four producer types (App Events, IoT Sensors, CDC Streams, Webhooks) feeding into a central Message Broker (Kafka/Kinesis/Pub-Sub) with a connected Schema Registry. The broker feeds into a Stream Processor (Flink/Spark Streaming) with an associated Checkpoints store. The processor outputs to four sinks: Feature Store, Data Lake, Serving Layer, and Alerting. The diagram emphasizes the fan-in from diverse producers, the central role of the broker, and the fan-out to multiple downstream consumers.
How to Implement
Choosing Your Stack
Implementation begins with a fundamental choice: which broker and which processor?
For the message broker, the decision typically comes down to operational model and cloud provider:
- Apache Kafka (self-managed or via Confluent Cloud / AWS MSK): The industry default. Highest throughput, richest ecosystem, largest community. Choose this unless you have a strong reason not to.
- Amazon Kinesis: Native AWS integration, serverless scaling in on-demand mode. Ideal if your entire stack is AWS and you want zero broker ops. But: lower throughput per shard (1 MB/s in, 2 MB/s out) and less ecosystem support.
- Google Cloud Pub/Sub: Fully serverless, global, no partitioning to manage. Best for GCP-native stacks. But: no ordering guarantees by default (must use ordering keys).
For the stream processor, the choice depends on latency requirements and team expertise:
- Apache Flink: True event-time processing, exactly-once semantics, sub-second latency. The gold standard for stateful streaming. Used by Netflix, Uber, LinkedIn at scale.
- Spark Structured Streaming: Micro-batch with good-enough latency (~100ms-1s). Better if your team already knows Spark and you need unified batch+stream processing.
- Kafka Streams: Lightweight library (no separate cluster needed). Ideal for simple transformations embedded in your application. Not suitable for complex stateful processing at high scale.
Cost Guidance: A 3-broker Kafka cluster on AWS MSK (kafka.m5.large) costs approximately 460/month (~INR 38,600/month) for a basic production setup. Confluent Cloud's Basic cluster starts at ~1,100/month or INR 92,000/month). Amazon Kinesis On-Demand costs 0.04/GB for reads, which can be significantly cheaper for low-throughput workloads but expensive at scale. For a budget-conscious Indian startup processing 100 GB/day, self-managed Kafka on EC2 (3x t3.xlarge) can run as low as ~$300/month (INR 25,000/month) including storage.
Schema Management Is Non-Negotiable
Before writing a single line of processing code, set up a schema registry. This is the single most impactful decision you'll make for long-term pipeline health. Without enforced schemas, you will inevitably face deserialization failures, silent data corruption, and painful producer-consumer coordination. Use Apache Avro for its compact binary format and excellent schema evolution support, or Protobuf if your team is already using gRPC.
from confluent_kafka import Producer
from confluent_kafka.serialization import SerializationContext, MessageField
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
import json
# Schema definition
user_event_schema = """
{
"type": "record",
"name": "UserEvent",
"namespace": "com.example.ml",
"fields": [
{"name": "user_id", "type": "string"},
{"name": "event_type", "type": "string"},
{"name": "item_id", "type": ["null", "string"], "default": null},
{"name": "event_time", "type": "long", "logicalType": "timestamp-millis"},
{"name": "properties", "type": {"type": "map", "values": "string"}}
]
}
"""
# Configure schema registry and serializer
schema_registry_conf = {"url": "http://schema-registry:8081"}
schema_registry_client = SchemaRegistryClient(schema_registry_conf)
avro_serializer = AvroSerializer(
schema_registry_client,
user_event_schema,
lambda obj, ctx: obj # identity transform
)
# Configure producer with idempotence for exactly-once
producer_conf = {
"bootstrap.servers": "kafka-broker:9092",
"enable.idempotence": True, # Prevents duplicate writes
"acks": "all", # Wait for all ISR replicas
"max.in.flight.requests.per.connection": 5,
"retries": 2147483647, # Infinite retries
"linger.ms": 5, # Batch for 5ms for throughput
"compression.type": "lz4", # Compress for network savings
}
producer = Producer(producer_conf)
def publish_event(user_id: str, event_type: str, item_id: str = None,
properties: dict = None):
"""Publish a user event to Kafka with schema validation."""
import time
event = {
"user_id": user_id,
"event_type": event_type,
"item_id": item_id,
"event_time": int(time.time() * 1000),
"properties": properties or {}
}
producer.produce(
topic="user-events",
key=user_id.encode("utf-8"), # Partition by user_id
value=avro_serializer(
event,
SerializationContext("user-events", MessageField.VALUE)
),
on_delivery=lambda err, msg: print(
f"Delivered to {msg.topic()}[{msg.partition()}]@{msg.offset()}"
if not err else f"Delivery failed: {err}"
)
)
producer.flush() # In production, flush periodically, not per-message
# Usage
publish_event(
user_id="user_12345",
event_type="product_view",
item_id="SKU_98765",
properties={"category": "electronics", "source": "search"}
)This producer demonstrates three critical production patterns: (1) Idempotent writes via enable.idempotence=True which ensures that retries don't create duplicates in Kafka -- the broker deduplicates based on producer ID and sequence number. (2) Schema enforcement via Avro + Schema Registry, which prevents producers from sending malformed events. (3) Key-based partitioning using user_id, which ensures all events for the same user land in the same partition, preserving per-user ordering -- essential for session-based feature computation.
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import (
KafkaSource, KafkaOffsetsInitializer
)
from pyflink.common.serialization import SimpleStringSchema
from pyflink.datastream.window import TumblingEventTimeWindows
from pyflink.datastream.functions import (
AggregateFunction, ProcessWindowFunction
)
from pyflink.common.time import Time
from pyflink.common.watermark_strategy import (
WatermarkStrategy, TimestampAssigner
)
import json
from datetime import timedelta
# Initialize Flink environment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)
env.enable_checkpointing(60000) # Checkpoint every 60 seconds
# Configure Kafka source
kafka_source = (
KafkaSource.builder()
.set_bootstrap_servers("kafka-broker:9092")
.set_topics("user-events")
.set_group_id("feature-computation-group")
.set_starting_offsets(KafkaOffsetsInitializer.latest())
.set_value_only_deserializer(SimpleStringSchema())
.build()
)
# Define watermark strategy with 10-second tolerance for late events
class EventTimestampAssigner(TimestampAssigner):
def extract_timestamp(self, value, record_timestamp):
event = json.loads(value)
return event["event_time"]
watermark_strategy = (
WatermarkStrategy
.for_bounded_out_of_orderness(timedelta(seconds=10))
.with_timestamp_assigner(EventTimestampAssigner())
)
# Read from Kafka with watermarks
stream = env.from_source(
kafka_source, watermark_strategy, "Kafka User Events"
)
# Compute rolling 5-minute click counts per user
class ClickCountAggregator(AggregateFunction):
def create_accumulator(self):
return 0
def add(self, value, accumulator):
event = json.loads(value)
if event["event_type"] == "product_view":
return accumulator + 1
return accumulator
def get_result(self, accumulator):
return accumulator
def merge(self, acc1, acc2):
return acc1 + acc2
class FeatureEmitter(ProcessWindowFunction):
def process(self, key, context, elements):
count = list(elements)[0]
feature = {
"user_id": key,
"feature_name": "click_count_5min",
"feature_value": count,
"window_start": context.window().start,
"window_end": context.window().end
}
yield json.dumps(feature)
# Build the feature computation pipeline
features = (
stream
.key_by(lambda x: json.loads(x)["user_id"])
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(
ClickCountAggregator(),
FeatureEmitter()
)
)
# Sink to feature store (Redis via custom sink or Kafka topic)
features.print() # Replace with Redis/Kafka sink in production
env.execute("Real-Time Feature Computation")This Flink job demonstrates the core pattern for real-time ML feature computation: (1) Watermark strategy with 10-second bounded out-of-orderness tolerance -- events arriving up to 10 seconds late will still be included in the correct window. (2) Key-by user_id to partition computation, ensuring each user's events are processed together. (3) Tumbling 5-minute windows to aggregate click counts, producing one feature value per user per window. (4) Checkpointing every 60 seconds for exactly-once fault tolerance. In production, the output would sink to a feature store like Redis or an online feature serving system like Tecton.
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.common.serialization.Serdes;
import java.time.Duration;
import java.util.Properties;
public class FeatureEnrichmentStream {
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG,
"feature-enrichment-app");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG,
"kafka-broker:9092");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG,
Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG,
Serdes.String().getClass());
// Exactly-once processing
config.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG,
StreamsConfig.EXACTLY_ONCE_V2);
StreamsBuilder builder = new StreamsBuilder();
// User events stream (clicks, views, purchases)
KStream<String, String> userEvents =
builder.stream("user-events");
// User profiles table (compacted topic as a KTable)
KTable<String, String> userProfiles =
builder.table("user-profiles");
// Enrich events with user profile data via stream-table join
KStream<String, String> enrichedEvents = userEvents.join(
userProfiles,
(event, profile) -> {
// Merge event with profile for enriched feature
return String.format(
"{\"event\": %s, \"profile\": %s}",
event, profile
);
}
);
// Compute session-windowed features
enrichedEvents
.groupByKey()
.windowedBy(
SessionWindows.ofInactivityGapWithNoGrace(
Duration.ofMinutes(30)
)
)
.count()
.toStream()
.to("user-session-features");
KafkaStreams streams = new KafkaStreams(
builder.build(), config
);
streams.start();
// Graceful shutdown
Runtime.getRuntime().addShutdownHook(
new Thread(streams::close)
);
}
}This Kafka Streams example shows two patterns critical for ML pipelines: (1) Stream-table join -- enriching real-time events with slowly-changing user profile data stored as a compacted Kafka topic. This is how you attach user demographics, account age, or segment labels to raw events without hitting an external database on every event. (2) Session windowing -- grouping user actions into sessions defined by 30 minutes of inactivity, then counting events per session. Session-based features (session length, actions per session) are among the most predictive features for engagement and conversion models. The EXACTLY_ONCE_V2 processing guarantee ensures these counts are correct even through failures.
import boto3
import json
import time
from datetime import datetime
# --- Producer: Put records into Kinesis stream ---
kinesis_client = boto3.client(
"kinesis",
region_name="ap-south-1" # Mumbai region for India deployments
)
def put_event_to_kinesis(
stream_name: str,
user_id: str,
event_type: str,
payload: dict
):
"""Write a single event to a Kinesis data stream."""
record = {
"user_id": user_id,
"event_type": event_type,
"event_time": datetime.utcnow().isoformat(),
"payload": payload
}
response = kinesis_client.put_record(
StreamName=stream_name,
Data=json.dumps(record).encode("utf-8"),
PartitionKey=user_id # Ensures ordering per user
)
return response["SequenceNumber"]
def put_batch_to_kinesis(
stream_name: str,
records: list[dict]
):
"""Write up to 500 records in a single batch call."""
kinesis_records = [
{
"Data": json.dumps(record).encode("utf-8"),
"PartitionKey": record["user_id"]
}
for record in records
]
response = kinesis_client.put_records(
StreamName=stream_name,
Records=kinesis_records
)
failed = response["FailedRecordCount"]
if failed > 0:
print(f"Warning: {failed} records failed, retrying...")
# In production: implement exponential backoff retry
return response
# --- Consumer: Read records from Kinesis stream ---
def consume_kinesis_stream(stream_name: str, max_records: int = 100):
"""Read records from all shards of a Kinesis stream."""
response = kinesis_client.describe_stream(
StreamName=stream_name
)
shards = response["StreamDescription"]["Shards"]
for shard in shards:
shard_id = shard["ShardId"]
iterator_response = kinesis_client.get_shard_iterator(
StreamName=stream_name,
ShardId=shard_id,
ShardIteratorType="LATEST"
)
shard_iterator = iterator_response["ShardIterator"]
while shard_iterator:
records_response = kinesis_client.get_records(
ShardIterator=shard_iterator,
Limit=max_records
)
for record in records_response["Records"]:
event = json.loads(
record["Data"].decode("utf-8")
)
yield event
shard_iterator = records_response.get(
"NextShardIterator"
)
time.sleep(0.2) # Respect 5 reads/sec/shard limit
# Usage
seq = put_event_to_kinesis(
stream_name="ml-user-events",
user_id="user_42",
event_type="add_to_cart",
payload={"item_id": "SKU_100", "price_inr": 1499}
)
print(f"Published with sequence: {seq}")This example shows the Kinesis equivalent of Kafka-based ingestion, targeting the ap-south-1 (Mumbai) region for India-based deployments. Key differences from Kafka: (1) Kinesis uses shards instead of partitions, with each shard capped at 1 MB/s ingest and 2 MB/s read -- you'll need to plan shard count based on throughput requirements. (2) The put_records batch API allows up to 500 records per call, which is important for cost optimization since Kinesis charges per PUT payload unit (25 KB each). (3) Consumer-side rate limits (5 reads/sec/shard) require careful pacing, unlike Kafka which allows arbitrary consumer read rates. For production Kinesis consumers, use the Kinesis Client Library (KCL) rather than raw API calls.
# Kafka topic configuration for ML event ingestion
# Create with: kafka-topics.sh --create --topic user-events ...
topic:
name: user-events
partitions: 24 # 24 partitions for parallel consumption
replication-factor: 3 # 3 replicas for durability
configs:
retention.ms: 604800000 # 7 days retention
cleanup.policy: delete # Delete old segments (not compact)
min.insync.replicas: 2 # At least 2 replicas must ACK
compression.type: lz4 # Broker-side compression
max.message.bytes: 1048576 # 1 MB max message size
segment.ms: 3600000 # Roll segments every hour
# Flink job configuration
flink:
parallelism: 8
checkpointing:
interval: 60000 # Checkpoint every 60 seconds
mode: EXACTLY_ONCE
timeout: 300000 # 5 minute checkpoint timeout
min-pause: 30000 # Min 30s between checkpoints
state-backend: rocksdb
state-backend-rocksdb:
incremental: true # Incremental checkpoints save bandwidth
restart-strategy:
type: exponential-delay
initial-backoff: 1s
max-backoff: 60s
backoff-multiplier: 2.0
reset-backoff-threshold: 300sCommon Implementation Mistakes
- ●
Not enforcing schemas from day one: Without a schema registry, producers inevitably ship breaking changes (renamed fields, changed types) that silently corrupt downstream ML features. By the time you notice, weeks of training data may be compromised. Set up Avro + Schema Registry before your first event.
- ●
Using processing time instead of event time for ML features: If you window by processing time (when the system received the event) rather than event time (when the event actually happened), your features will be inconsistent between training (where you backfill from logs with known timestamps) and serving (where processing delays vary). This creates training-serving skew that silently degrades model quality.
- ●
Ignoring consumer lag until it becomes a crisis: Consumer lag -- the gap between the latest produced offset and the latest consumed offset -- is the most important operational metric for streaming pipelines. If it grows faster than you can process, you'll either drop data or blow your SLA. Set up alerts for lag exceeding 5 minutes on critical topics.
- ●
Over-partitioning or under-partitioning Kafka topics: Too few partitions limit parallelism and throughput. Too many partitions increase metadata overhead, checkpoint sizes, and end-to-end latency (Kafka rebalancing slows down). Rule of thumb: start with
max(expected_throughput_MB_s * 3, consumer_parallelism)partitions. For most ML pipelines, 12-64 partitions is the sweet spot. - ●
Treating exactly-once as a magic switch: Enabling
enable.idempotenceon the producer andEXACTLY_ONCE_V2on Kafka Streams is necessary but not sufficient. Your sinks must also be idempotent (e.g., upsert to a feature store keyed by user_id + window_start, not blind appends). Otherwise, you have exactly-once within Kafka but at-least-once end-to-end. - ●
Not planning for schema evolution: Your event schema will change. Plan for backward-compatible changes (adding optional fields, never removing required fields) and test that existing consumers can handle events from both the old and new schema versions simultaneously during rolling deployments.
- ●
Ignoring backpressure signals: When a slow consumer can't keep up, events queue up in the broker. Without proper backpressure handling, this leads to memory exhaustion in the consumer, increased end-to-end latency, and eventually data loss. Flink handles this natively with credit-based flow control; Spark Streaming requires careful
maxOffsetsPerTriggertuning.
When Should You Use This?
Use When
Your ML model needs features computed from events that happened in the last seconds to minutes -- not hours or days. Examples: real-time fraud scoring, dynamic pricing, live recommendation updates.
The business impact of stale features is quantifiable and significant. If a 1-hour delay in feature freshness costs you measurable revenue or increased fraud losses, streaming is justified.
Your data sources are inherently continuous: user clickstreams, IoT sensor telemetry, financial transactions, social media feeds, GPS location updates.
You need to trigger ML inference or alerts in real time based on event patterns -- e.g., detecting anomalous transaction sequences, triggering re-ranking when a user's session behavior shifts.
You have multiple downstream consumers that need the same data at different latencies: a real-time feature store, a data lake for batch training, and a monitoring dashboard all consuming from the same stream.
Your system must handle bursty traffic patterns (e.g., Flipkart Big Billion Days, Zomato NYE peaks) where batch jobs would create unacceptable backlogs.
You need an audit trail or replayability -- the ability to reprocess historical events when you update your feature computation logic or fix a bug.
Avoid When
Your ML models retrain on a daily or weekly schedule and feature freshness measured in hours is perfectly acceptable. Adding streaming complexity for no latency benefit is over-engineering.
Your data arrives in natural batches (e.g., daily CSV exports from a vendor, monthly regulatory reports). Forcing batch data into a streaming paradigm creates unnecessary complexity.
Your team lacks operational experience with distributed systems. Kafka, Flink, and their failure modes require significant expertise. A misconfigured Kafka cluster can lose data or create silent duplicates. Start with batch and graduate to streaming when the business need is clear.
Your data volume is small enough that periodic batch queries (e.g., SQL against a database every 5 minutes) meet latency requirements. For datasets under 1 GB/day with relaxed latency needs, a cron job is simpler, cheaper, and easier to debug.
Budget constraints make the infrastructure costs prohibitive. A minimal Kafka + Flink setup costs ~2,000/month, streaming may not be the right investment yet.
Exactly-once semantics aren't required and data completeness matters more than freshness. Some analytical workloads (e.g., training data for monthly model retrains) benefit from the simplicity of batch ETL with full data validation.
Key Tradeoffs
The Fundamental Tradeoff: Freshness vs. Complexity
Streaming data sources buy you feature freshness at the cost of operational complexity. A batch pipeline has one failure mode: the job fails and you rerun it. A streaming pipeline has a dozen: consumer lag, checkpoint failures, schema evolution breaks, rebalancing storms, backpressure cascades, watermark mis-estimation, and more.
| Dimension | Batch Ingestion | Streaming Ingestion |
|---|---|---|
| Feature freshness | Hours to days | Milliseconds to seconds |
| Operational complexity | Low (cron + retry) | High (broker + processor + monitoring) |
| Infrastructure cost | Pay per job run | Pay continuously (24/7 brokers) |
| Debugging difficulty | Easy (rerun with logs) | Hard (stateful, distributed, temporal) |
| Data completeness | Guaranteed (process all data) | Approximate (late events may be dropped) |
| Team expertise required | SQL + basic ETL | Distributed systems + streaming semantics |
The Second Axis: Cost Structure
Batch pipelines have variable costs: you pay when jobs run. Streaming pipelines have fixed costs: brokers, consumers, and state backends run continuously. For workloads that need processing only during business hours, this 24/7 cost overhead can be significant.
However, for high-throughput continuous workloads (>100 GB/day), streaming can actually be cheaper than batch because you avoid the massive resource spikes of batch processing. Instead of provisioning a cluster large enough to process 24 hours of data in a 2-hour batch window, you process continuously on smaller, steady-state infrastructure.
Practical Guidance: The 80/20 rule applies. Most ML systems should have a batch backbone for training data and a streaming fast path for serving-time features. Don't try to do everything in streaming. Use streaming where freshness has measurable business impact, and batch everywhere else.
Alternatives & Comparisons
Batch data sources process data in discrete, scheduled jobs (daily, hourly). Choose batch when feature freshness measured in hours is acceptable, data arrives in natural batches, or your team lacks streaming expertise. Choose streaming when sub-minute feature freshness has measurable business impact (fraud detection, dynamic pricing, real-time recommendations). Most production systems use both: streaming for serving-time features and batch for training data.
API endpoints provide synchronous, request-response data access -- the consumer pulls data when needed. Streaming is push-based -- the broker delivers data as it arrives. Choose APIs when consumers need data on-demand at low volume. Choose streaming when you need continuous data flow to multiple consumers at high throughput. In ML systems, APIs are often used for on-demand feature lookups while streams feed the feature computation pipeline.
Webhooks are lightweight push notifications triggered by external events (payment completed, PR merged). They're simpler than full streaming but lack durability, ordering, and replay. Choose webhooks for low-volume, external-system notifications. Choose streaming when you need durable, ordered, high-throughput ingestion with exactly-once guarantees. In practice, webhooks are often producers that feed into a streaming pipeline via an adapter service.
Event triggers (AWS EventBridge, Cloud Functions triggers) invoke processing logic in response to discrete events. They're serverless and simpler to operate but limited in stateful processing. Choose event triggers for simple, stateless event routing (e.g., triggering a Lambda on S3 upload). Choose streaming when you need windowed aggregations, stream-table joins, or complex stateful computation across millions of events per second.
Pros, Cons & Tradeoffs
Advantages
Sub-second feature freshness enables ML models to make decisions based on what just happened, not what happened hours ago. This directly improves model accuracy for time-sensitive use cases like fraud detection, dynamic pricing, and real-time personalization.
Decoupled producers and consumers via the message broker allow independent scaling, deployment, and failure isolation. Adding a new downstream consumer (e.g., a new ML model) doesn't require changes to producers.
Replayability -- Kafka retains events for a configurable period (often 7-30 days), allowing consumers to rewind and reprocess historical events when feature logic changes. This is invaluable for backfilling features after bug fixes or model updates.
Natural fit for event-driven architectures where ML inference triggers immediate downstream actions (blocking a fraudulent transaction, updating a recommendation, sending a notification). The event-action loop is native to streaming.
Handles bursty traffic gracefully through buffering in the broker. During Flipkart's Big Billion Days or Zomato's New Year's Eve peaks, the broker absorbs traffic spikes that would overwhelm synchronous systems, letting consumers process at their own pace.
Single source of truth for multiple consumers: the same event stream feeds real-time feature computation, batch training data pipelines, monitoring dashboards, and audit logs. No need to build separate ingestion paths.
Strong ecosystem and tooling maturity -- Kafka has become the de facto standard with widespread adoption. Rich connector ecosystem (Kafka Connect, Debezium for CDC), managed services (Confluent Cloud, AWS MSK, Azure Event Hubs), and deep integration with ML tools (Flink, Feast, Tecton).
Disadvantages
Significant operational complexity -- running Kafka clusters (broker management, partition rebalancing, ISR management, retention policies) and Flink jobs (checkpoint tuning, state backend sizing, parallelism optimization) requires dedicated platform engineering effort.
Higher infrastructure cost compared to batch. A production Kafka + Flink setup costs a minimum of ~$500-1,000/month (INR 42,000-84,000/month) even at modest scale, plus continuous compute costs for stream processors. This is a fixed cost, not variable like batch jobs.
Debugging streaming pipelines is hard. Issues are temporal (happen at specific event-time windows), stateful (depend on accumulated state), and distributed (span multiple operators and machines). Reproducing a bug often requires replaying a specific sequence of events with the exact state snapshot.
Late and out-of-order events require explicit handling via watermarks and allowed-lateness configurations. If watermarks are too aggressive, you lose data. If too conservative, you increase latency. Tuning watermarks correctly requires understanding your data's temporal characteristics.
Training-serving skew risk is amplified. Features computed in streaming (with approximate windowing and potential late-event drops) may differ subtly from the same features computed in batch (with complete data). If you train on batch features but serve with streaming features, model quality can degrade silently.
Schema evolution is harder than in batch. Changing an event schema requires coordinating producers and consumers, potentially running multiple schema versions simultaneously during rolling deployments, and ensuring backward/forward compatibility.
Failure Modes & Debugging
Consumer Lag Spiral
Cause
Consumers process events slower than producers publish them. This can happen due to slow downstream sinks (e.g., a feature store write bottleneck), insufficient consumer parallelism, or unexpectedly increased input throughput (flash sale traffic spikes). Once lag starts growing, it compounds -- the backlog increases memory pressure, which slows processing further.
Symptoms
Consumer group offset falls progressively behind the latest producer offset. End-to-end latency increases from seconds to minutes to hours. Feature freshness degrades. Kafka's records-lag-max metric climbs steadily. In extreme cases, events age out of retention and are lost.
Mitigation
Monitor consumer lag as a primary SLA metric with alerts at multiple thresholds (e.g., warn at 5 minutes, critical at 30 minutes). Scale consumer parallelism by adding partitions and consumer instances. Implement backpressure-aware consumers that signal upstream to throttle. For Flink, ensure checkpoint intervals are shorter than retention periods so you can always recover. For predictable traffic spikes (Diwali sales, IPL matches), pre-scale consumers ahead of time.
Checkpoint Failure Cascade
Cause
Flink checkpoint fails (timeout, state backend full, S3 write failure), causing the job to restart from the last successful checkpoint. If the checkpoint interval is long (e.g., 10 minutes) and throughput is high, the replay volume is massive, causing the next checkpoint to also fail -- creating a restart loop.
Symptoms
Flink job enters repeated restart cycles. Checkpoints consistently fail or timeout. Job manager logs show increasing checkpoint durations. Consumer lag grows with each restart. State backend storage fills up.
Mitigation
Use incremental checkpoints (RocksDB backend) to reduce checkpoint size. Set checkpoint timeouts generously (2-3x the typical checkpoint duration). Configure min-pause-between-checkpoints to prevent checkpoint storms. Monitor checkpoint duration trends -- if they're growing, investigate state size growth. Keep Flink state lean: use TTL on state entries to evict stale data, and avoid storing raw events in state when aggregates suffice.
Schema Poisoning
Cause
A producer publishes events with a schema that is backward-incompatible (e.g., a required field was removed, a field type was changed from string to int). Without strict schema validation at the registry level, these events pass through the broker and cause deserialization failures in every downstream consumer.
Symptoms
Sudden spike in deserialization errors across multiple consumers. Dead letter queue fills up rapidly. Features go stale as valid events get stuck behind the poisoned batch. If the DLQ is not configured, consumers crash and restart repeatedly.
Mitigation
Configure the schema registry with BACKWARD or FULL compatibility mode to reject incompatible schema changes at registration time, before they reach the broker. Implement canary deployments for producer schema changes: deploy to a single producer instance first, verify consumer compatibility, then roll out broadly. Always have a DLQ configured to capture un-processable events without blocking the pipeline.
Partition Skew (Hot Partitions)
Cause
Poor partition key selection results in uneven event distribution. For example, using country as the partition key when 80% of traffic comes from India causes one partition to handle 80% of the load while others sit idle. Similarly, a viral user or popular item can create a "celebrity key" problem.
Symptoms
One or a few partitions accumulate significantly more lag than others. The consumers assigned to hot partitions become bottlenecks while other consumers are underutilized. End-to-end latency varies widely across partition keys.
Mitigation
Choose partition keys with high cardinality and uniform distribution (e.g., user_id or session_id rather than country or category). For known hot keys, use a salted partition key (append a random suffix to spread the hot key across multiple partitions, then reaggregate downstream). Monitor per-partition lag, not just aggregate lag.
Watermark Stall
Cause
An idle partition (no events arriving) prevents the watermark from advancing because Flink's watermark tracks the minimum across all partitions. If one partition goes silent, the global watermark freezes, and no windows close -- even if all other partitions have plenty of data.
Symptoms
Windowed aggregations stop producing output despite active event streams on most partitions. Watermark metrics show the global watermark stuck at a fixed timestamp. Downstream feature store stops receiving updates.
Mitigation
Configure idle source timeout in Flink's watermark strategy: WatermarkStrategy.withIdleness(Duration.ofMinutes(2)). This tells Flink to exclude idle partitions from watermark computation after 2 minutes of inactivity. Also consider enabling Kafka's log.message.timestamp.type=LogAppendTime as a fallback timestamp source for partitions with sparse producers.
Rebalancing Storm
Cause
Frequent consumer group rebalancing triggered by consumer failures, scaling events, or configuration issues (e.g., session.timeout.ms set too low). During rebalancing, consumption pauses on all partitions, causing a lag spike and processing hiccup.
Symptoms
Periodic spikes in consumer lag correlated with rebalancing events. Consumers repeatedly log "revoked partitions" and "assigned partitions" messages. End-to-end latency oscillates with a regular period matching the rebalance frequency.
Mitigation
Use cooperative incremental rebalancing (Kafka 2.4+) instead of eager rebalancing to minimize disruption. Increase session.timeout.ms to 30-45 seconds to tolerate brief consumer pauses without triggering rebalance. Set max.poll.interval.ms appropriately for your processing time. For Flink consumers, ensure sufficient TaskManager slots so scaling events don't cause restarts.
Placement in an ML System
The First Mile of Real-Time ML
The streaming data source is the first component in any real-time ML pipeline. It sits before everything: data validation, feature computation, model serving, and monitoring. If the streaming source fails or falls behind, every downstream component operates on stale or missing data.
In a typical ML system architecture, the streaming source feeds two parallel paths:
-
The fast path (serving): Events flow through Flink for real-time feature computation, writing to an online feature store (Redis, DynamoDB, Tecton). When a prediction request arrives, the serving layer reads these fresh features and feeds them to the model.
-
The slow path (training): The same events are simultaneously written to a data lake (S3, Delta Lake, Apache Iceberg) where they're available for offline batch processing, model training, and feature backfilling.
This Lambda architecture (or its simpler Kappa variant where everything goes through the stream) ensures that training and serving features are derived from the same source events, reducing training-serving skew.
Critical Insight: The streaming data source determines the freshness ceiling of your entire ML system. No downstream component can produce features fresher than what the streaming source delivers. If the source has 5 seconds of latency, your fastest possible feature update is 5+ seconds. Invest in getting this layer right.
Pipeline Stage
Data Ingestion / Feature Engineering
Upstream
- Application servers
- IoT devices
- CDC connectors (Debezium)
- Webhook receivers
- Third-party API adapters
Downstream
- data-validation
- Feature Store (Feast, Tecton)
- Data Lake (S3, Delta Lake)
- Model Serving Layer
- batch-data-source (for training data)
Scaling Bottlenecks
The primary scaling bottleneck for Kafka is partition count -- each partition is an ordered log served by a single broker leader, so write throughput scales linearly with partitions up to the broker's disk I/O or network capacity. A single Kafka broker can typically handle 100-200 MB/s of writes across all partitions.
For the stream processor (Flink), bottlenecks shift to state size and checkpoint overhead. As stateful computations accumulate data (e.g., session windows for millions of users), the state backend (RocksDB) grows, and checkpoints take longer. At Netflix scale (tens of thousands of Flink jobs), checkpoint coordination becomes a system-level concern.
For Kinesis, the bottleneck is more rigid: each shard caps at 1 MB/s ingest, 2 MB/s read, and 1,000 records/s. Scaling requires resharding, which is a managed but non-instant operation. At 1,000 shards (the soft limit), you're at 1 GB/s ingest -- enough for most workloads, but heavy-hitters like ad-tech or telemetry can exceed this.
Some concrete numbers for capacity planning: a 6-broker Kafka cluster with 100 partitions can sustain ~500 MB/s write throughput. At 1 KB per event, that's 500,000 events/second -- enough for a mid-size Indian e-commerce platform. Uber and Netflix operate clusters handling millions of events per second across thousands of topics.
Production Case Studies
Netflix's Keystone pipeline processes 2 trillion messages per day (~3 PB ingested, 7 PB output daily) using Apache Kafka and Flink. Their Real-Time Distributed Graph (RDG) ingests member actions published to Kafka topics (each generating up to ~1 million messages per second), with Flink jobs consuming, enriching, and materializing data into graph storage. Events are encoded in Apache Avro with schemas managed via an internal centralized schema registry.
The real-time streaming pipeline enables Netflix to react to member behavior within seconds, powering personalization, content discovery, and operational dashboards for 260M+ subscribers. The move from batch to streaming reduced feature freshness from hours to sub-second, directly improving recommendation relevance.
Uber uses Kafka and Apache Flink to build scalable streaming pipelines for real-time ML feature generation. These pipelines compute features like surge pricing multipliers, driver-rider matching scores, and fraud risk signals from streaming ride requests and driver availability data. The pipeline handles high volume, intensive computation, and large states, with a target of near-real-time latency under 5 minutes.
Streaming feature pipelines replaced batch ETL for critical ML models, reducing feature staleness from 24 hours to under 5 minutes. This enabled more accurate surge pricing (better supply-demand matching), improved fraud detection (catching fraudulent patterns in real time), and faster driver-rider matching.
LinkedIn processes 4 trillion events daily through their streaming infrastructure, which originated with the creation of Apache Kafka itself. They built a managed stream processing platform using Apache Beam that generates ML features in real time by filtering, processing, and aggregating events emitted to Kafka. The platform reduced the time-to-production for new streaming pipelines from months to days.
Transitioning from offline ML feature generation (24-48 hour delay) to real-time streaming reduced end-to-end pipeline latency to milliseconds/seconds. This significantly improved ML model performance across feed ranking, job recommendations, and ad targeting.
Zomato's event-driven architecture handles 450+ million messages per minute through Kafka clusters achieving a total peak throughput of 5.6 GBps, particularly during New Year's Eve peak traffic. Real-time features for their ML models (ETA prediction, restaurant ranking, fraud detection) are computed via event streams published to Kafka and processed by Apache Flink, as described in their scalable ML engineering blog.
The streaming infrastructure allowed Zomato to maintain sub-second feature freshness even during 10x traffic spikes on New Year's Eve 2023, ensuring accurate ETAs and restaurant recommendations under extreme load -- critical for customer satisfaction during peak ordering times.
Razorpay built a real-time data highway using Kafka streaming for their payment processing platform. A Maxwell daemon detects change data capture (CDC) from databases and pushes updates to Kafka topics, with Spark consumers continuously reading from the stream and batching updates. This powers real-time fraud detection models that score every transaction as it flows through the payment gateway.
The streaming CDC pipeline reduced the latency of fraud feature updates from batch-hourly to near-real-time, enabling Razorpay to block fraudulent transactions within seconds rather than detecting them retroactively. This is critical for a platform processing payments for 10M+ businesses across India.
Tooling & Ecosystem
The de facto standard distributed event streaming platform. Provides a durable, partitioned, replicated commit log with exactly-once semantics via idempotent producers and transactional consumers. Supports millions of events/second per cluster. Kafka 4.0 (2025) runs in KRaft mode by default, eliminating the ZooKeeper dependency.
The gold standard for stateful stream processing. Provides true event-time processing, exactly-once semantics via distributed snapshots (Chandy-Lamport), sophisticated windowing (tumbling, sliding, session), and the richest watermark model in the ecosystem. Used by Netflix, Uber, LinkedIn, and Alibaba at massive scale.
AWS-native managed streaming service with on-demand and provisioned capacity modes. Integrates with Lambda, Flink (via Managed Service for Apache Flink), Firehose, and SageMaker. Each shard provides 1 MB/s in, 2 MB/s out. Best for AWS-native stacks where operational simplicity is prioritized over raw throughput.
Fully managed Kafka service with additional enterprise features: Confluent Schema Registry (Avro, Protobuf, JSON Schema), ksqlDB for SQL-based stream processing, managed connectors, and Stream Governance. Reduces Kafka operational overhead by ~71% compared to self-managed deployments according to Confluent's TCO analysis.
Micro-batch stream processing built on Spark's DataFrame API. Provides exactly-once guarantees, event-time windowing, and watermark support. Best for teams already using Spark for batch who want a unified batch+stream processing framework. Latency is typically 100ms-1s (higher than Flink's true streaming).
Fully managed, serverless event ingestion and delivery system on GCP. Provides at-least-once delivery by default with exactly-once processing available via Dataflow. No partition management required -- scales automatically. Pairs with Google Cloud Dataflow (managed Apache Beam) for stream processing.
Distributed messaging and streaming platform with a unique architecture separating serving (brokers) from storage (Apache BookKeeper). Provides multi-tenancy, geo-replication, and tiered storage natively. Used by Flipkart (as an alternative to Kafka for certain workloads), Tencent, and Yahoo Japan.
Open-source CDC (Change Data Capture) platform built on Kafka Connect. Captures row-level database changes from MySQL, PostgreSQL, MongoDB, and others, and streams them to Kafka topics. Essential for turning database updates into real-time event streams without modifying application code.
Kafka-compatible streaming data platform written in C++ with no JVM, no ZooKeeper, and no garbage collection pauses. Provides P99 latencies under 10ms and simplified operations. Drop-in replacement for Kafka with lower tail latency, gaining traction for latency-sensitive ML workloads.
Research & References
Akidau, Bradshaw, Chambers, Chernyak, Fernandez-Moctezuma, Lax, McVeety, Mills, Perry, Schmidt & Whittle (2015)VLDB Endowment, Vol 8, No 12
The foundational paper on modern stream processing semantics. Introduced the What/Where/When/How framework for reasoning about stream computation, formalized windowing, triggers, and accumulation modes. Underlies Apache Beam and Google Cloud Dataflow.
Carbone, Katsifodimos, Ewen, Markl, Haridi & Tzoumas (2015)IEEE Data Engineering Bulletin, Vol 38, No 4
Describes Flink's unified architecture for stream and batch processing, including its pipelined execution engine, event-time processing, and iterative computation support. Established Flink as the first true stream-first processing engine.
Carbone, Fora, Ewen, Haridi & Tzoumas (2015)arXiv preprint (implemented in Apache Flink)
Extends the Chandy-Lamport distributed snapshot algorithm for streaming dataflows. Introduces checkpoint barriers that flow through the data stream to capture consistent operator state without stopping processing. This is the foundation of Flink's exactly-once fault tolerance.
Kreps, Narkhede & Rao (2011)NetDB Workshop
The original Kafka paper from LinkedIn describing the distributed commit log architecture. Introduced the key insight that a persistent, partitioned log can serve as both a messaging system and a storage system, unifying stream processing and data integration.
Wang, Hsu, Sax, Nishizawa, Dillinger, Laber, Yavuz & Guozhang (2021)ACM SIGMOD 2021
Describes Kafka Streams' approach to exactly-once stream processing using read-process-write cycles translated as transactional log appends. Shows how Kafka's log-centric design simplifies distributed stream processing compared to separate processing frameworks.
Begoli, Akidau, Edmon, Srivastava & Barga (2021)VLDB Endowment, Vol 14, No 12
Provides the first formal definition and comparative analysis of watermarks across two major streaming systems. Critical reading for understanding how event-time completeness is tracked and how watermark implementations differ in practice.
Hesse, Vogel, Weidlich & Markl (2024)arXiv preprint
Benchmarks fault recovery across Kafka Streams, Flink, Spark Streaming, and Spark Structured Streaming. Quantifies recovery time, throughput degradation during recovery, and exactly-once overhead -- essential data for choosing between frameworks.
Interview & Evaluation Perspective
Common Interview Questions
- ●
Design a real-time feature pipeline for a fraud detection system. How would you ingest transaction events and compute features with sub-second latency?
- ●
Explain the difference between event time and processing time. Why does it matter for ML feature computation?
- ●
How does Apache Kafka achieve exactly-once semantics? Walk through the mechanism.
- ●
What are watermarks in stream processing? How do you handle late-arriving events?
- ●
You notice consumer lag growing on a critical Kafka topic feeding your ML feature store. Walk me through your debugging and resolution process.
- ●
Compare Kafka, Kinesis, and Pub/Sub for a streaming ML pipeline. How would you choose?
- ●
How would you design a streaming pipeline that feeds both a real-time feature store and a batch training data lake from the same event source?
- ●
What is backpressure in streaming systems and how do you handle it?
Key Points to Mention
- ●
Always distinguish event time from processing time. ML features must be computed on event time to maintain consistency between training (batch, using logged timestamps) and serving (streaming, using live events). This is the #1 source of training-serving skew.
- ●
Exactly-once is achieved via two complementary mechanisms: idempotent producers (Kafka's producer ID + sequence number deduplication) and transactional consumers (atomic read-process-write cycles in Kafka Streams) or distributed snapshots (Flink's checkpoint barriers based on Chandy-Lamport).
- ●
Watermarks are the system's best estimate of event-time completeness. They trigger window computations. The key tradeoff: aggressive watermarks cause data loss (late events dropped), conservative watermarks increase latency (waiting for stragglers).
- ●
Consumer lag is the primary health metric. Know how to calculate it, alert on it, and resolve it (add partitions + consumers, optimize processing, fix slow sinks).
- ●
For ML specifically, mention the dual-path architecture: streaming for real-time serving features, batch for training data. The event stream serves as the single source of truth for both paths, reducing training-serving skew.
- ●
Discuss partition key selection -- it affects both parallelism (more partitions = more consumers) and data locality (events with the same key are ordered and co-located). For ML, keying by
user_idis the most common pattern.
Pitfalls to Avoid
- ●
Claiming streaming is always better than batch. A senior candidate acknowledges that streaming adds significant complexity and cost, and should only be adopted when feature freshness has measurable business impact. Most ML systems use both.
- ●
Conflating exactly-once within Kafka with exactly-once end-to-end. The sink must also be idempotent (e.g., upsert with deduplication keys, not blind appends) for true end-to-end exactly-once.
- ●
Ignoring the cost dimension. When asked to design a streaming pipeline for a startup, don't propose a 50-broker Kafka cluster. Show that you can reason about cost-effective architectures -- maybe Kinesis On-Demand for an early-stage product, or a 3-broker Kafka cluster on EC2 for a budget-conscious team.
- ●
Forgetting about schema management. A system design without a schema registry is incomplete. Mention Avro/Protobuf and schema evolution from the start.
- ●
Not addressing how you'd handle the transition from batch to streaming. Real-world systems don't flip a switch. Discuss dual-write during migration, shadow mode for validation, and gradual cutover.
Senior-Level Expectation
A senior/staff-level candidate should be able to design the full streaming ML feature pipeline end-to-end: from producer schema design and partition key selection, through broker topology and retention policy, to stream processor configuration (parallelism, checkpointing, watermark strategy, state TTL), all the way to sink design (idempotent writes to feature store, append-only writes to data lake). They should quantify throughput requirements, estimate infrastructure costs (in both USD and INR for India-context interviews), discuss failure modes proactively (consumer lag, checkpoint failures, schema poisoning, partition skew), and explain monitoring strategy (consumer lag alerts, checkpoint duration trending, end-to-end latency SLOs). The ability to discuss training-serving skew mitigation -- how streaming features during serving can diverge from batch features during training, and how to detect and minimize this -- separates staff-level thinking from senior-level. Bonus points for discussing Kappa architecture (single streaming path for both batch and real-time) vs. Lambda architecture (dual path) and when each is appropriate.
Summary
Wrapping It All Up
A streaming data source is the real-time ingestion backbone of modern ML systems. It replaces the batch paradigm's inherent staleness with continuous, sub-second data flow -- enabling ML models to make decisions based on what is happening right now, not what happened hours ago. The core technologies are Apache Kafka (the dominant message broker, processing trillions of messages daily at companies like LinkedIn and Netflix), Apache Flink (the gold standard for stateful stream processing with exactly-once guarantees), and managed alternatives like Amazon Kinesis and Google Cloud Pub/Sub for teams that prefer operational simplicity.
The key concepts every ML engineer must internalize are: event time vs. processing time (always compute features on event time to avoid training-serving skew), exactly-once semantics (achieved via idempotent producers + transactional consumers in Kafka, or distributed snapshots in Flink), watermarks (the system's assertion of event-time completeness, critical for triggering windowed aggregations), and backpressure (the mechanism that prevents fast producers from overwhelming slow consumers). Getting these right determines whether your streaming pipeline delivers fresh, accurate features or silently corrupts your ML models.
In practice, most production ML systems use a dual-path architecture: streaming for real-time serving features (sub-second freshness) and batch for training data (completeness over speed), with the event stream serving as the single source of truth for both paths. The streaming data source sits at the very beginning of this architecture -- it determines the freshness ceiling for everything downstream. For Indian companies, a production setup (Kafka + Flink + Redis feature store) starts at approximately INR 67,000-1.26 lakh/month ($800-1,500/month), making it accessible to mid-stage startups with clear real-time requirements. The investment is justified when feature freshness has measurable business impact: faster fraud detection, more accurate dynamic pricing, more relevant real-time recommendations.