Shared Memory in Machine Learning
When multiple AI agents collaborate on a task, they need a way to share context, intermediate results, and evolving state. Shared memory is the architectural pattern that makes this possible -- it provides a centralized or distributed data structure that all agents in a multi-agent system can read from and write to during execution.
Think about a team of software engineers working on a complex feature. They don't each work in isolation -- they share a codebase, leave comments for each other, update tickets, and maintain a shared understanding of the project's current state. Shared memory serves the same purpose for AI agents: it is the common ground where agents coordinate, exchange partial results, and build toward a collective solution.
This pattern has deep roots in classical AI. The blackboard architecture, first formalized in the 1970s for the HEARSAY-II speech recognition system, established the foundational idea: a shared workspace (the "blackboard") where independent specialist modules post and consume information, coordinated by a control mechanism. Today, that same principle powers LangGraph's StateGraph, CrewAI's shared context, and Redis-backed state stores in production multi-agent deployments.
Shared memory is arguably the most critical infrastructure decision in any multi-agent system. Get it wrong, and your agents will talk past each other, overwrite each other's work, or drown in stale context. Get it right, and you unlock emergent collaboration that no single agent could achieve alone.
Concept Snapshot
- What It Is
- A centralized or distributed data store that enables multiple AI agents to read, write, and synchronize shared state during collaborative task execution.
- Category
- Multi-Agent Systems
- Complexity
- Advanced
- Inputs / Outputs
- Inputs: agent observations, intermediate results, tool outputs, conversation history. Outputs: synchronized state accessible by all agents, conflict-resolved updates, checkpointed snapshots.
- System Placement
- Sits at the center of a multi-agent orchestration layer, connecting agent nodes, the orchestrator/router, and external tools or knowledge stores.
- Also Known As
- blackboard, shared state, common workspace, agent scratchpad, collaborative memory, shared context window, global state store
- Typical Users
- ML Engineers, AI Platform Engineers, Backend Engineers, Solutions Architects, AI Agent Developers
- Prerequisites
- Multi-agent system fundamentals, Distributed systems concepts (consistency, concurrency), State management patterns, Basic understanding of LLM agents, Key-value stores (Redis, Memcached)
- Key Terms
- blackboard architecturestate graphcheckpointerconflict resolutionCRDTvector clockevent sourcingoptimistic concurrencyreducer functionthread-level persistence
Why This Concept Exists
The Fundamental Problem: Agent Isolation
A single AI agent operates in its own context window -- it sees its prompt, its conversation history, and its tool outputs. That's sufficient for isolated tasks. But the moment you orchestrate multiple agents to collaborate -- a researcher agent gathering data, an analyst agent interpreting it, a writer agent composing a report -- you hit a wall: how does agent B know what agent A discovered?
Without shared memory, the only option is to pass everything through the orchestrator, which becomes a bottleneck. The orchestrator must serialize all communication, maintain the complete history, and decide what each agent needs to see. For two agents, this is manageable. For five agents working in parallel on a complex task, it quickly becomes intractable.
The Blackboard Architecture: Where It All Started
The blackboard architecture emerged from the HEARSAY-II speech recognition project at Carnegie Mellon University in the 1970s. The core insight was elegant: instead of having specialist modules (knowledge sources) communicate directly with each other in a tangled web of point-to-point connections, give them a shared workspace -- the blackboard -- where they can post partial solutions and react to each other's contributions.
The architecture has three components:
- The blackboard: a global, hierarchical data structure holding the evolving solution state
- Knowledge sources: independent specialist modules that monitor the blackboard and contribute partial solutions
- A control mechanism: decides which knowledge source to activate next based on the current blackboard state
This pattern maps remarkably well to modern LLM-based multi-agent systems. Knowledge sources become specialized agents (researcher, coder, reviewer), the blackboard becomes a shared state object (LangGraph's StateGraph, a Redis store, or an in-memory dictionary), and the control mechanism becomes the orchestrator or router agent.
Why It Became Critical in 2024-2025
Three developments made shared memory essential rather than optional:
1. Multi-agent frameworks matured. LangGraph, CrewAI, AutoGen, and OpenAI's Swarm moved from research curiosities to production tools. Each required a principled way to manage shared state across agent boundaries.
2. Agent tasks grew more complex. Simple chain-of-thought prompting gave way to multi-step workflows involving tool use, web search, code execution, and human-in-the-loop reviews. Each step generates state that downstream agents need.
3. Long-running agent sessions became common. Agents that run for minutes or hours -- processing large datasets, iterating on code, or conducting multi-turn negotiations -- need persistent state that survives failures and can be checkpointed for recovery.
Key Insight: Shared memory is not just a convenience -- it is the mechanism that transforms a collection of independent agents into a coherent collaborative system. Without it, multi-agent systems are just multiple single agents running in parallel with no coordination.
Core Intuition & Mental Model
The Whiteboard in the War Room
Imagine a crisis response center. Multiple specialists -- logistics, medical, communications, intelligence -- are working the same incident. In the center of the room stands a large whiteboard. Anyone can walk up and write new information, update existing entries, or read what others have posted. The whiteboard is not owned by any single specialist -- it belongs to the room, to the mission.
That whiteboard is shared memory. Each specialist (agent) has their own expertise and their own local notes, but the whiteboard is where coordination happens. When the intelligence analyst writes "bridge at grid reference X is destroyed," the logistics planner immediately sees it and re-routes supply convoys without anyone explicitly telling them to. The information propagates through the shared medium.
Now, there are rules. You don't erase someone else's entry without good reason. If two people try to update the same section simultaneously, they talk it out (conflict resolution). And periodically, someone photographs the whiteboard (checkpointing) so they can reconstruct the state if something goes wrong.
What Shared Memory Is NOT
Shared memory is not the same as message passing. In message-passing architectures, agents send discrete messages to each other -- think of it like email. The sender decides what to share and who to share it with. In shared memory, all agents have access to the same global state -- think of it like a shared Google Doc where everyone can see everything (with appropriate access controls).
Both patterns are valid, and production systems often combine them. But the distinction matters because shared memory provides implicit coordination -- agents can react to state changes they didn't explicitly ask for -- while message passing requires explicit coordination where every piece of information must be deliberately sent.
The Fundamental Tension
Here is the tension that makes shared memory design interesting: you want agents to have access to everything they need while not being overwhelmed by everything that exists. A shared memory store with 50,000 tokens of accumulated state is useless if an agent's context window is 8,000 tokens. The art of shared memory design lies in maintaining a rich global state while providing each agent with a relevant, focused view of that state.
Technical Foundations
Formal Model of Shared Memory in Multi-Agent Systems
Let us define a multi-agent system where:
- is a set of agents
- is the shared memory state space
- is the set of possible transitions (read/write operations)
- is the conflict resolution policy
State Representation
The shared memory at time is a structured object composed of typed fields:
where is a key (field name), is the value (which may be a scalar, list, or nested structure), and is a logical timestamp or version vector.
State Transitions
Each agent can perform two operations on shared memory:
Read: -- agent reads the current value of key
Write: -- agent updates key to value , producing a new state
Reducer Functions (LangGraph Model)
In frameworks like LangGraph, state updates are mediated by reducer functions for each key :
where is the update proposed by an agent. Common reducers include:
- Last-writer-wins:
- Append: (concatenation for lists)
- Merge: (set union for dictionaries)
- Custom: Application-specific logic (e.g., summarize-and-replace for conversation history)
Conflict Detection with Vector Clocks
For concurrent writes by agents and to the same key , we use vector clocks where is agent 's logical counter. Two writes conflict if and only if:
meaning neither write causally precedes the other. The conflict resolution policy then determines the outcome -- last-writer-wins, semantic merge, or escalation to the orchestrator.
Consistency Models
Shared memory systems choose among consistency guarantees:
- Strong consistency: All agents see the same state at all times. Cost: high latency due to synchronization barriers.
- Eventual consistency: Agents may temporarily see different states, but all converge. Cost: requires conflict resolution.
- Causal consistency: If agent 's write depends on agent 's write, all agents see them in that order. A practical middle ground.
For most LLM-based multi-agent systems, causal consistency provides the best balance -- agents see causally related updates in order while allowing independent updates to propagate asynchronously.
Internal Architecture
A production shared memory system for multi-agent AI consists of five interacting layers: the state schema defining the structure of shared data, the access layer providing read/write APIs to agents, the conflict resolution engine handling concurrent updates, the persistence backend storing state durably, and the view layer projecting relevant subsets of state to individual agents.
The architecture follows a hub-and-spoke model where shared memory sits at the center and agents connect as spokes. The orchestrator typically mediates access, though agents can also read/write directly in decentralized designs (the blackboard pattern).

The View Projector is a critical component that is often overlooked. When shared state grows large -- which it inevitably does in long-running multi-agent workflows -- you cannot dump the entire state into every agent's context window. The view projector selects, summarizes, or retrieves the most relevant portions of state for each agent based on its role, current task, and context window budget. This is where shared memory intersects with RAG: the view projector may use vector similarity search over the state store to find the most relevant entries for a given agent query.
Key Components
State Schema / State Graph
Defines the typed structure of shared state -- which fields exist, their types, and their reducer functions. In LangGraph, this is the TypedDict or Pydantic model passed to StateGraph. It acts as the contract between all agents: everyone agrees on what the shared memory looks like.
Access Layer (Read/Write API)
Provides the interface through which agents interact with shared memory. Supports atomic reads, atomic writes, and batch operations. May enforce access control policies -- e.g., agent A can write to research_findings but only read from final_report. In Redis-backed systems, this maps to GET/SET/HSET operations with optional Lua scripting for atomicity.
Conflict Resolution Engine
Handles concurrent writes to the same key by multiple agents. Implements strategies like last-writer-wins (simple but lossy), append-only (preserves all contributions), CRDT-based merge (mathematically guaranteed convergence), or semantic merge (uses an LLM to intelligently combine conflicting updates). The choice of strategy depends on the data type and the tolerance for information loss.
Checkpointer / Persistence Backend
Periodically snapshots the shared state for durability and fault recovery. LangGraph's RedisSaver, PostgresSaver, and SqliteSaver are canonical implementations. Enables time-travel debugging -- the ability to roll back to any previous state and replay execution from that point. Also supports human-in-the-loop workflows by persisting state while waiting for human approval.
View Projector / Context Assembler
Projects a relevant subset of shared state into each agent's context window. Uses strategies like recency filtering (last N entries), relevance filtering (vector similarity), role-based filtering (only show fields relevant to this agent's role), and summarization (compress older state into summaries). Prevents context window overflow in long-running sessions.
Event Bus / Notification Layer
Notifies agents when relevant portions of shared state change. Implements publish-subscribe patterns so agents can react to state updates without polling. In Redis, this maps to SUBSCRIBE/PUBLISH or Redis Streams. Enables reactive multi-agent patterns where agents wake up and act when relevant state changes occur.
Data Flow
- Agent completes a step and produces an update (e.g.,
{"research_findings": [new_finding]}) - The update passes through the access layer, which validates types and permissions
- The conflict resolution engine checks for concurrent writes using version vectors or timestamps
- The reducer function for the target key merges the update with existing state (e.g., append to list)
- The checkpointer persists the new state snapshot to the backend (Redis, PostgreSQL, etc.)
- The event bus notifies subscribed agents of the state change
- Agent requests state (either full state or a specific key)
- The view projector determines what subset of state is relevant for this agent
- If the full state exceeds the agent's context budget, the projector applies summarization or vector-based retrieval
- The filtered view is returned to the agent as part of its next prompt
- On failure or timeout, the orchestrator retrieves the last checkpoint from the persistence backend
- The state is restored to the checkpointed version
- The failed agent (or a replacement) is re-invoked with the restored state
- Execution continues from the checkpoint rather than restarting from scratch
A hub-and-spoke architecture where three agents (Researcher, Analyst, Writer) connect to a central Shared Memory Layer containing a State Store, Conflict Resolver, Checkpointer, and View Projector. The Shared Memory Layer connects to persistence backends (Redis, PostgreSQL, Vector Store). An Orchestrator/Router also connects to the State Store. The View Projector sends filtered views back to each agent.
How to Implement
Implementation Spectrum
Shared memory implementations range from simple in-process dictionaries to distributed, persistent state stores. The right choice depends on three factors:
- Number of agents and concurrency level: Two sequential agents? A Python dictionary suffices. Ten parallel agents? You need proper concurrency control.
- Session duration: Sub-minute workflows can use ephemeral memory. Hour-long sessions need persistence and checkpointing.
- Failure tolerance: If losing state mid-execution is acceptable (prototyping), in-memory is fine. If agents process expensive operations (API calls costing money), you need durable checkpointing.
Framework-Native vs. Custom
Most teams start with their framework's built-in shared state -- LangGraph's StateGraph with reducers, CrewAI's shared context, or AutoGen's conversation history. These work well for standard patterns. However, production systems frequently outgrow framework defaults and require custom solutions: Redis for sub-millisecond access, PostgreSQL for durable state with SQL queryability, or vector stores for semantic retrieval over accumulated state.
Cost Note: A Redis instance on AWS ElastiCache with 6.1 GB memory (cache.r6g.large) costs approximately 0.50-2.00 (~INR 42-168) in API calls alone. Checkpointing to Redis pays for itself after preventing a few hundred failures per month.
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.redis import RedisSaver
import operator
# Define shared state schema with reducer functions
class SharedMemory(TypedDict):
# append reducer: each agent's findings are accumulated
research_findings: Annotated[list[str], operator.add]
# append reducer: conversation history grows
messages: Annotated[list[dict], operator.add]
# last-writer-wins: only the latest summary matters
current_summary: str
# append reducer: action items from all agents
action_items: Annotated[list[str], operator.add]
# metadata for coordination
iteration_count: int
# Agent functions that read/write shared state
def researcher_agent(state: SharedMemory) -> dict:
"""Reads existing findings, adds new ones."""
existing = state.get("research_findings", [])
# Agent uses existing findings to avoid duplicates
new_finding = f"Finding #{len(existing) + 1}: Market size is $4.2B"
return {
"research_findings": [new_finding],
"messages": [{"role": "researcher", "content": new_finding}],
}
def analyst_agent(state: SharedMemory) -> dict:
"""Reads research findings, produces analysis."""
findings = state.get("research_findings", [])
analysis = f"Analysis of {len(findings)} findings: growth trend detected"
return {
"current_summary": analysis,
"messages": [{"role": "analyst", "content": analysis}],
"action_items": ["Validate growth trend with secondary sources"],
}
def writer_agent(state: SharedMemory) -> dict:
"""Reads summary and findings, writes final output."""
summary = state.get("current_summary", "")
findings = state.get("research_findings", [])
report = f"Report: Based on {len(findings)} findings. {summary}"
return {
"messages": [{"role": "writer", "content": report}],
"iteration_count": state.get("iteration_count", 0) + 1,
}
# Build the graph with shared memory
graph = StateGraph(SharedMemory)
graph.add_node("researcher", researcher_agent)
graph.add_node("analyst", analyst_agent)
graph.add_node("writer", writer_agent)
graph.add_edge("researcher", "analyst")
graph.add_edge("analyst", "writer")
graph.add_edge("writer", END)
graph.set_entry_point("researcher")
# Compile with Redis checkpointer for persistence
checkpointer = RedisSaver(redis_url="redis://localhost:6379")
app = graph.compile(checkpointer=checkpointer)
# Run with thread-level persistence
config = {"configurable": {"thread_id": "session-001"}}
result = app.invoke(
{"research_findings": [], "messages": [], "action_items": [], "iteration_count": 0},
config=config,
)
print(result["current_summary"])
print(f"Total action items: {len(result['action_items'])}")This example demonstrates LangGraph's shared memory pattern. The SharedMemory TypedDict defines the schema, and Annotated types with operator.add specify append reducers -- when multiple agents write to research_findings, their contributions are concatenated rather than overwritten. The current_summary field uses last-writer-wins (no reducer annotation). The RedisSaver checkpointer persists state after each node execution, enabling recovery if any agent fails. The thread_id in the config ensures each conversation session has its own isolated state.
import redis
import json
import time
import hashlib
from typing import Any, Optional
from dataclasses import dataclass, field
@dataclass
class StateEntry:
value: Any
version: int
author: str
timestamp: float
checksum: str = ""
class SharedMemoryStore:
"""Production shared memory with optimistic concurrency control."""
def __init__(self, redis_url: str = "redis://localhost:6379", namespace: str = "agent"):
self.redis = redis.from_url(redis_url, decode_responses=True)
self.ns = namespace
def _key(self, session_id: str, field: str) -> str:
return f"{self.ns}:{session_id}:{field}"
def read(self, session_id: str, field: str) -> Optional[StateEntry]:
"""Read a field from shared memory."""
raw = self.redis.get(self._key(session_id, field))
if raw is None:
return None
data = json.loads(raw)
return StateEntry(**data)
def write(
self,
session_id: str,
field: str,
value: Any,
author: str,
expected_version: Optional[int] = None,
) -> bool:
"""Write with optimistic concurrency control.
Returns True if write succeeded, False if version conflict.
Uses Redis WATCH/MULTI for atomic check-and-set.
"""
key = self._key(session_id, field)
with self.redis.pipeline() as pipe:
try:
pipe.watch(key)
current_raw = pipe.get(key)
current_version = 0
if current_raw:
current = json.loads(current_raw)
current_version = current["version"]
# Optimistic concurrency check
if expected_version is not None and current_version != expected_version:
return False # Conflict detected
new_entry = StateEntry(
value=value,
version=current_version + 1,
author=author,
timestamp=time.time(),
checksum=hashlib.sha256(
json.dumps(value, sort_keys=True).encode()
).hexdigest()[:16],
)
pipe.multi()
pipe.set(key, json.dumps(new_entry.__dict__))
pipe.execute()
# Publish state change event for reactive agents
self.redis.publish(
f"{self.ns}:events:{session_id}",
json.dumps({"field": field, "author": author, "version": new_entry.version}),
)
return True
except redis.WatchError:
return False # Another agent modified the key
def append_to_list(
self, session_id: str, field: str, items: list, author: str
) -> int:
"""Atomic append to a list field. Returns new list length."""
key = self._key(session_id, field)
# Use Lua script for atomic read-modify-write
lua_script = """
local current = redis.call('GET', KEYS[1])
local data
if current then
data = cjson.decode(current)
else
data = {value={}, version=0, author='', timestamp=0, checksum=''}
end
local new_items = cjson.decode(ARGV[1])
for _, item in ipairs(new_items) do
table.insert(data.value, item)
end
data.version = data.version + 1
data.author = ARGV[2]
data.timestamp = tonumber(ARGV[3])
redis.call('SET', KEYS[1], cjson.encode(data))
return #data.value
"""
return self.redis.eval(
lua_script, 1, key, json.dumps(items), author, str(time.time())
)
def checkpoint(self, session_id: str) -> str:
"""Snapshot all fields for a session. Returns checkpoint ID."""
pattern = f"{self.ns}:{session_id}:*"
keys = list(self.redis.scan_iter(match=pattern))
snapshot = {}
for key in keys:
field_name = key.split(":")[-1]
snapshot[field_name] = self.redis.get(key)
checkpoint_id = f"cp:{session_id}:{int(time.time())}"
self.redis.set(checkpoint_id, json.dumps(snapshot), ex=86400) # 24h TTL
return checkpoint_id
def restore(self, checkpoint_id: str) -> bool:
"""Restore shared memory from a checkpoint."""
raw = self.redis.get(checkpoint_id)
if not raw:
return False
snapshot = json.loads(raw)
# Extract session_id from checkpoint_id
session_id = checkpoint_id.split(":")[1]
for field_name, value in snapshot.items():
key = self._key(session_id, field_name)
self.redis.set(key, value)
return True
# Usage example
store = SharedMemoryStore(redis_url="redis://localhost:6379")
# Agent 1 writes research findings
store.append_to_list("session-42", "findings", ["GDP growth at 7.2%"], "researcher")
# Agent 2 reads and writes analysis (with version check)
entry = store.read("session-42", "findings")
if entry:
store.write("session-42", "analysis",
f"Analyzed {len(entry.value)} findings",
author="analyst", expected_version=None)
# Checkpoint before risky operation
cp_id = store.checkpoint("session-42")
print(f"Checkpointed: {cp_id}")This production-grade implementation uses Redis WATCH/MULTI for optimistic concurrency control -- if two agents try to write the same key simultaneously, only one succeeds and the other gets a False return indicating a conflict. The append_to_list method uses a Lua script for atomic read-modify-write, ensuring list appends are safe under concurrent access. The checkpoint and restore methods enable fault recovery by snapshotting all session state. The publish/subscribe pattern on state changes allows reactive agents to wake up when relevant state is modified.
from typing import Any, Callable
from dataclasses import dataclass, field
import threading
import time
@dataclass
class BlackboardEntry:
key: str
value: Any
source_agent: str
confidence: float
timestamp: float = field(default_factory=time.time)
class Blackboard:
"""Classical blackboard architecture for multi-agent coordination."""
def __init__(self):
self._state: dict[str, BlackboardEntry] = {}
self._subscribers: dict[str, list[Callable]] = {}
self._lock = threading.RLock()
self._history: list[tuple[str, BlackboardEntry]] = []
def write(self, key: str, value: Any, source: str, confidence: float = 1.0):
"""Post information to the blackboard."""
with self._lock:
entry = BlackboardEntry(
key=key, value=value, source_agent=source,
confidence=confidence, timestamp=time.time(),
)
# Conflict resolution: higher confidence wins
if key in self._state:
existing = self._state[key]
if existing.confidence > confidence:
return # Existing entry has higher confidence
self._state[key] = entry
self._history.append(("write", entry))
# Notify subscribers outside the lock
for callback in self._subscribers.get(key, []):
callback(entry)
def read(self, key: str) -> BlackboardEntry | None:
"""Read from the blackboard."""
with self._lock:
return self._state.get(key)
def read_all(self, prefix: str = "") -> dict[str, BlackboardEntry]:
"""Read all entries matching a prefix."""
with self._lock:
return {
k: v for k, v in self._state.items()
if k.startswith(prefix)
}
def subscribe(self, key: str, callback: Callable):
"""Subscribe to changes on a specific key."""
self._subscribers.setdefault(key, []).append(callback)
def get_history(self, last_n: int = 10) -> list[tuple[str, BlackboardEntry]]:
"""Get recent history for debugging / time-travel."""
return self._history[-last_n:]
# Define knowledge sources (agents)
class KnowledgeSource:
def __init__(self, name: str, blackboard: Blackboard):
self.name = name
self.bb = blackboard
class ResearchAgent(KnowledgeSource):
def act(self, query: str):
# Simulate research
result = f"Research on '{query}': Found 3 relevant papers"
self.bb.write(f"research:{query}", result, self.name, confidence=0.85)
self.bb.write("status", f"{self.name} completed research", self.name)
class AnalysisAgent(KnowledgeSource):
def act(self):
# Read all research entries from the blackboard
research = self.bb.read_all(prefix="research:")
analysis = f"Synthesized {len(research)} research entries"
self.bb.write("analysis:summary", analysis, self.name, confidence=0.90)
class QualityAgent(KnowledgeSource):
def act(self):
# Review analysis quality
analysis = self.bb.read("analysis:summary")
if analysis and analysis.confidence < 0.95:
self.bb.write(
"review:needed", True, self.name, confidence=1.0
)
# Orchestrate via the blackboard
bb = Blackboard()
researcher = ResearchAgent("researcher-1", bb)
analyst = AnalysisAgent("analyst-1", bb)
reviewer = QualityAgent("reviewer-1", bb)
# Agents interact through the shared blackboard
researcher.act("transformer architectures")
researcher.act("multi-agent memory")
analyst.act()
reviewer.act()
# Inspect the blackboard state
for key, entry in bb._state.items():
print(f"[{entry.source_agent}] {key}: {entry.value} (conf: {entry.confidence})")This implements the classical blackboard architecture where agents (knowledge sources) interact exclusively through a shared workspace. The confidence-based conflict resolution means that when two agents write to the same key, the higher-confidence write wins. The subscribe mechanism enables reactive behavior -- agents can register callbacks that fire when specific blackboard keys change, eliminating the need for polling. The history log supports time-travel debugging. This pattern is especially powerful when agents have overlapping capabilities and the system needs to select the best contribution.
from mem0 import Memory
# Initialize Mem0 with shared memory configuration
config = {
"llm": {
"provider": "openai",
"config": {"model": "gpt-4o-mini", "temperature": 0.1},
},
"embedder": {
"provider": "openai",
"config": {"model": "text-embedding-3-small"},
},
"vector_store": {
"provider": "qdrant",
"config": {"collection_name": "agent_memory", "host": "localhost", "port": 6333},
},
}
memory = Memory.from_config(config)
# Agent 1: Customer research agent stores findings
memory.add(
"Customer prefers budget smartphones under INR 15,000 with good camera",
user_id="customer-123",
agent_id="research-agent",
metadata={"category": "preference", "confidence": 0.92},
)
memory.add(
"Customer previously purchased Redmi Note 12 in March 2025",
user_id="customer-123",
agent_id="research-agent",
metadata={"category": "history", "confidence": 0.99},
)
# Agent 2: Recommendation agent reads shared memory
relevant_memories = memory.search(
"What phones should I recommend to this customer?",
user_id="customer-123",
limit=5,
)
for mem in relevant_memories["results"]:
print(f"Memory: {mem['memory']} (score: {mem['score']:.2f})")
# Agent 3: Follow-up agent adds new information
memory.add(
"Customer mentioned interest in 5G connectivity for Jio network",
user_id="customer-123",
agent_id="followup-agent",
metadata={"category": "preference", "confidence": 0.88},
)
# All agents now have access to the combined memory
all_memories = memory.get_all(user_id="customer-123")
print(f"Total shared memories for customer: {len(all_memories['results'])}")This example uses Mem0 for cross-session shared memory -- memories persist across different agent invocations and even different sessions. Unlike the stateful approaches above, Mem0 provides semantic retrieval over accumulated memories using vector search. Each agent contributes memories tagged with its agent_id, and any agent can search the shared memory pool using natural language queries. This is particularly powerful for customer-facing multi-agent systems (like an e-commerce assistant at Flipkart or Swiggy) where different specialized agents need access to a unified customer memory across interactions.
# LangGraph + Redis shared memory configuration
langgraph:
state_type: SharedMemory
checkpointer:
backend: redis
redis_url: redis://localhost:6379
key_prefix: "agent:state:"
ttl_seconds: 86400 # 24 hours
reducers:
messages: append # list concatenation
findings: append # list concatenation
summary: replace # last-writer-wins
metadata: merge # dict merge
view_projection:
max_tokens_per_agent: 4000
strategy: recency_and_relevance
summarize_after: 20 # summarize state after 20 entries
conflict_resolution:
strategy: optimistic_concurrency
max_retries: 3
retry_backoff_ms: 50Common Implementation Mistakes
- ●
Dumping entire state into every agent's prompt: When shared memory accumulates thousands of tokens, naively including all of it in each agent's context window wastes tokens and degrades performance. Use view projection or summarization to give each agent only the state it needs.
- ●
Using last-writer-wins for all fields: This is the default in many frameworks, but it silently discards concurrent updates. For list-type data (findings, action items, messages), always use an append reducer. Reserve last-writer-wins for fields where only the latest value matters (current status, final summary).
- ●
No checkpointing for long-running workflows: If your multi-agent workflow takes more than 30 seconds and makes expensive API calls, running without checkpointing means any failure forces a full restart. A single Redis checkpoint adds <1ms of latency but can save minutes of re-computation and dollars in API costs.
- ●
Ignoring state schema evolution: As your multi-agent system evolves, the shared memory schema changes -- new fields are added, old ones become obsolete. Without versioning, loading a checkpoint from an older schema version will crash or produce silent data corruption. Always version your state schema.
- ●
Treating shared memory as a message queue: Shared memory is for state, not for events. If agents need to send discrete messages to each other ("please review this"), use a proper message/event system alongside shared memory. Conflating the two leads to complex, brittle state management.
When Should You Use This?
Use When
Multiple agents need to collaborate on a shared task and must see each other's intermediate results -- this is the primary use case
Your multi-agent workflow runs for more than a few seconds and needs fault recovery via checkpointing
Agents have different specializations but need a unified view of the evolving solution (the classic blackboard pattern)
You need human-in-the-loop approval steps where the system pauses, persists state, and resumes after human input
Cross-session memory is required -- e.g., a customer support system where different agents serve the same customer across multiple conversations
You are building a system where agents work in parallel and their results must be merged (e.g., parallel research agents each investigating different sources)
The cost of re-running a failed multi-agent workflow exceeds the cost of maintaining shared state infrastructure
Avoid When
Your system has a single agent -- shared memory adds unnecessary complexity when there is nothing to share
Agents are fully independent and their outputs do not depend on each other's results -- use simple parallel execution with result aggregation instead
The workflow is short-lived (< 5 seconds) and cheap to re-run -- the overhead of persistence and conflict resolution is not justified
You need strict message ordering guarantees between agents -- use message queues (Kafka, RabbitMQ) instead of shared state
Your agents operate on completely different data domains with no overlap -- separate state stores per agent are simpler and avoid unnecessary coupling
Regulatory requirements mandate that certain agents cannot access each other's data (e.g., PII isolation) -- shared memory makes isolation harder to enforce
Key Tradeoffs
Consistency vs. Performance
Strong consistency (every agent always sees the latest state) requires synchronization barriers that add latency. In a system with 5 agents writing concurrently, strong consistency can add 5-15ms per write due to locking or consensus protocols. Eventual consistency allows agents to proceed without waiting but introduces the risk of stale reads. For most LLM-based multi-agent systems, causal consistency is the sweet spot -- agents see causally related updates in order but don't block on unrelated updates.
| Consistency Model | Latency Overhead | Data Freshness | Complexity | Best For |
|---|---|---|---|---|
| Strong | High (5-15ms/write) | Always current | Medium | Financial, safety-critical |
| Causal | Medium (1-3ms/write) | Causally ordered | High | Most multi-agent workflows |
| Eventual | Low (<1ms/write) | May be stale | Low | High-throughput, loss-tolerant |
Centralized vs. Distributed State
Centralized shared memory (single Redis instance) is simpler to reason about but creates a single point of failure and a potential bottleneck. Distributed shared memory (Redis Cluster, CRDTs) provides higher availability and throughput but introduces complexity in conflict resolution and network partitioning. For most multi-agent AI systems with fewer than 50 concurrent agents, a single Redis instance with replication is sufficient and dramatically simpler.
Memory Footprint vs. Context Quality
Keeping all historical state gives agents maximum context but can overwhelm context windows and increase costs. A 100-step multi-agent workflow might accumulate 50,000+ tokens of state. Aggressive pruning saves tokens but risks losing critical context. The practical approach is tiered memory: keep the last N steps in full detail, summarize older steps, and use vector retrieval for long-term memory.
Cost Comparison for an Indian Startup: Running a Redis-backed shared memory system for a 10-agent customer support bot costs approximately INR 10,900/month (500-1,000) in wasted API calls if even 500 workflows fail per month. The math strongly favors investing in proper shared memory infrastructure.
Alternatives & Comparisons
A per-agent memory store gives each agent its own isolated memory -- useful when agents don't need to coordinate or when privacy boundaries must be enforced. Choose per-agent memory when agents are independent; choose shared memory when agents must collaborate on a common task. In practice, many systems use both: per-agent memory for private reasoning and shared memory for coordination.
LangGraph's edge-based message passing sends data explicitly from one node to another, while shared memory (StateGraph) provides implicit access to all state. Message passing gives tighter control over what each agent sees but requires the graph designer to anticipate all information flows. Shared memory is more flexible -- agents can discover relevant state they weren't explicitly given.
A cache layer (Redis, Memcached) stores computed results for fast retrieval, typically with TTL-based expiration. Shared memory is semantically richer -- it maintains structured, versioned state with conflict resolution and checkpointing. However, many shared memory implementations use a cache layer as their backend. Think of it this way: a cache is a building material, shared memory is the building.
Vector stores enable semantic retrieval over embedded content, while shared memory provides structured state management. They are complementary: shared memory holds the current working state (structured, keyed), while a vector store can serve as the long-term memory backend for semantic search over historical state. Systems like Mem0 combine both patterns.
Pros, Cons & Tradeoffs
Advantages
Enables emergent collaboration -- agents can react to state changes they weren't explicitly told about, leading to coordination patterns that emerge from the shared medium rather than being hard-coded in the orchestrator
Fault tolerance through checkpointing -- persisting shared state to Redis or PostgreSQL means multi-agent workflows can recover from failures without restarting from scratch, saving both time and money on API calls
Decouples agent communication -- agents don't need to know about each other's existence; they interact through the shared state. This makes it easy to add, remove, or replace agents without changing the communication topology
Supports human-in-the-loop workflows -- checkpointed state allows the system to pause for human review, persist state indefinitely, and resume exactly where it left off. Critical for enterprise AI deployments with approval requirements
Reduces token consumption -- instead of passing the full conversation history to every agent, shared memory with view projection sends only relevant state, cutting prompt token usage by 30-60% in typical multi-agent workflows
Enables time-travel debugging -- checkpointed state history lets developers inspect the exact state at any point in execution, replay from earlier states, and diagnose issues in complex multi-agent interactions
Disadvantages
Concurrency complexity -- multiple agents writing to shared state simultaneously requires careful conflict resolution. Getting this wrong leads to lost updates, stale reads, or deadlocks. This is the #1 source of bugs in shared memory systems
State schema rigidity -- once agents depend on a shared state schema, changing it requires coordinating all agents and migrating checkpointed state. Schema evolution is an under-discussed operational burden
Context window overflow risk -- shared state grows over time. Without active pruning, summarization, or view projection, agents eventually receive more context than they can process, degrading output quality
Single point of failure (centralized) -- a centralized shared memory store (single Redis instance) is a SPOF. Outages halt all agents simultaneously. Mitigate with replication, but this adds operational complexity
Debugging difficulty -- when five agents write to the same state concurrently, tracing the source of a bad state update requires detailed logging and version tracking. Without this instrumentation, shared memory bugs are notoriously hard to reproduce
Latency overhead -- every state read/write adds network latency when using an external store (1-5ms for Redis). For latency-sensitive applications, this overhead multiplied by many agent steps can be significant
Failure Modes & Debugging
Lost Update (Write-Write Conflict)
Cause
Two agents read the same field, compute updates independently, and write back -- the second write overwrites the first without incorporating its changes. This is the classic lost update problem from database theory, and it is the most common failure in shared memory systems.
Symptoms
Agent A's contributions silently disappear from shared state. The final output is missing information. Quality degrades intermittently -- only when specific agents happen to write concurrently.
Mitigation
Use optimistic concurrency control with version checks (as shown in the Redis implementation above). For list-type fields, use append-only reducers instead of overwrite. For complex fields, implement CRDT-based merge or semantic merge with an LLM arbiter.
Context Window Overflow
Cause
Shared state grows unboundedly over a long multi-agent session. Eventually, serializing the full state exceeds individual agents' context window limits (8K-128K tokens depending on the model).
Symptoms
Agents start producing low-quality outputs because critical context is truncated. Token costs spike. In extreme cases, API calls fail with context_length_exceeded errors.
Mitigation
Implement a tiered memory strategy: (1) Keep the last N state updates in full detail, (2) Summarize older updates using an LLM, (3) Store historical state in a vector store for semantic retrieval. Set hard token budgets per agent in the view projector. Monitor total state size and alert when it exceeds 70% of the smallest agent's context window.
Stale Read Leading to Incorrect Decision
Cause
Under eventual consistency, an agent reads a state field that has been updated by another agent but the update hasn't propagated yet. The agent makes a decision based on outdated information.
Symptoms
Agents produce contradictory outputs. The final result contains inconsistencies. For example, a writer agent drafts a conclusion based on outdated analysis while the analyst has already revised its findings.
Mitigation
For critical fields, use read-after-write consistency -- after an agent writes, ensure subsequent reads by dependent agents see the update. In Redis, this means reading from the primary (not replicas) for state fields in the critical path. Alternatively, use causal dependency tracking so the orchestrator doesn't invoke agent B until agent A's writes are visible.
Checkpoint Corruption or Missing Checkpoint
Cause
The checkpointer fails silently (disk full, Redis OOM, network partition) and the system continues without a valid checkpoint. When a failure occurs, there is nothing to recover from.
Symptoms
After a failure, the system restarts from the beginning instead of the last checkpoint. Expensive API calls are wasted. In the worst case, the system enters an infinite restart loop if the failure is deterministic and no checkpoint exists before the failure point.
Mitigation
Implement checkpoint verification -- after writing a checkpoint, read it back and validate its integrity (checksum). Set up monitoring on checkpoint write success rate. Maintain at least 3 recent checkpoints (not just the latest) so corruption of one doesn't lose all recovery points. Alert on checkpoint age exceeding expected intervals.
Memory Leak from Unbounded History
Cause
Every state update is appended to history without cleanup. Over days or weeks of continuous operation, the state store's memory usage grows without bound.
Symptoms
Redis memory usage grows linearly over time. Eventually, Redis hits its maxmemory limit and begins evicting keys (potentially including active session state). Performance degrades as serialization/deserialization of large state objects slows down.
Mitigation
Implement TTL-based expiration on session state (e.g., 24 hours for inactive sessions). Use compaction to periodically merge old state entries into summaries. Set Redis maxmemory-policy to volatile-lru so only keys with TTL are evicted, never active session state. Monitor memory growth rate and set alerts at 70% capacity.
Deadlock from Circular Dependencies
Cause
Agent A waits for agent B's state update before proceeding, while agent B waits for agent A's update. Neither can make progress. This is more common in systems with fine-grained locking on shared state fields.
Symptoms
The multi-agent workflow hangs indefinitely. No agents produce output. CPU usage is low but wall-clock time grows. Timeout mechanisms eventually kill the session, losing all progress.
Mitigation
Use lock-free data structures (CRDTs, append-only logs) instead of locks. If locks are necessary, enforce a global lock ordering -- all agents acquire locks in the same order. Implement deadlock detection with a timeout: if an agent hasn't progressed in N seconds, forcibly release its locks and retry. The LangGraph pattern of sequential node execution with shared state naturally avoids deadlocks because only one node writes at a time.
Placement in an ML System
Position in the ML System
Shared memory sits at the orchestration layer of a multi-agent ML system -- between the agent orchestrator (which decides which agent runs next) and the individual agents (which read state, perform reasoning, and write results back).
In a LangGraph-based system, shared memory is the StateGraph's state object that flows through every node. The orchestrator controls the graph traversal; shared memory carries the accumulated context.
In a CrewAI-based system, shared memory is the implicit shared context that the crew maintains across task executions. Each task can read previous task outputs and write its own results.
For enterprise deployments in India -- think Razorpay's agentic payment flows, Swiggy's multi-agent order management, or Flipkart's AI shopping assistant -- shared memory typically connects to a Redis cluster for low-latency state access, with PostgreSQL as the durable checkpoint store and a vector store (Qdrant, Pinecone) for long-term semantic memory.
Architectural Principle: Shared memory is the connective tissue of multi-agent systems. It sits at the center, touching everything. This centrality is both its power (everything can coordinate through it) and its risk (a failure here affects everything). Design accordingly.
Pipeline Stage
Orchestration / Serving
Upstream
- agent-orchestrator
- langgraph-node
- tool-executor
Downstream
- memory-store
- vector-store
- cache-layer
- output-formatter
Scaling Bottlenecks
The primary bottleneck is write throughput under concurrent access. A single Redis instance handles approximately 100,000 operations/second, which comfortably supports 50-100 concurrent agent sessions with frequent state updates. Beyond that, you need Redis Cluster or a distributed state store.
The second bottleneck is state serialization size. As shared state grows, serializing and deserializing it for each agent invocation becomes CPU-intensive. A 50KB state object serialized and deserialized 10 times per agent step (for 10 agents) means 5MB of JSON processing per step -- measurable at scale.
The third bottleneck is checkpoint storage growth. If you checkpoint every agent step across thousands of concurrent sessions, storage can grow rapidly. At 50KB per checkpoint, 10 steps per session, and 10,000 sessions/day, that's ~5 GB/day of checkpoint data. Budget for TTL-based cleanup.
Some concrete numbers: a production multi-agent customer support system serving 1,000 concurrent sessions, each with 5 agents and 10 steps, generates approximately 50,000 state operations per minute. A single cache.r6g.large Redis instance on AWS (INR 10,900/month) handles this with ~50% headroom.
Production Case Studies
Google DeepMind's research on scaling agent systems evaluated multi-agent configurations across GPT, Gemini, and Claude model families. Their findings revealed that shared state management is a critical determinant of whether multi-agent systems boost or degrade performance. Configurations where agents could effectively share intermediate reasoning through a common workspace consistently outperformed isolated agent pipelines.
Multi-agent systems with properly designed shared state achieved significant improvements on complex reasoning tasks, while poorly configured shared memory led to unexpected performance degradation -- reinforcing that shared memory design, not just agent count, determines system effectiveness.
Razorpay partnered with NPCI and OpenAI to build agentic payments on ChatGPT, enabling AI-assisted UPI transactions. The system uses multiple specialized agents (intent recognition, payment validation, fraud detection) that share state through a common memory layer. The shared memory maintains transaction context, user preferences, and security constraints that all agents must respect across the payment flow.
The agentic payments system processes UPI transactions within preset spending limits while maintaining shared security context across all participating agents. The shared memory architecture ensures that the fraud detection agent's risk assessments are immediately visible to the payment execution agent.
Swiggy launched MCP (Model Context Protocol) integrations across Food, Instamart (grocery), and Dineout, enabling AI agents from ChatGPT, Claude, and Gemini to place orders on behalf of users. The multi-agent architecture uses shared context to maintain user preferences, dietary restrictions, past order history, and delivery location across different service verticals -- a food ordering agent and a grocery agent both need access to the user's address and dietary preferences.
The shared context architecture enables seamless cross-vertical experiences -- a user can ask an AI agent to order dinner and groceries in a single conversation, with shared memory ensuring consistency across the food and grocery agents.
Redis developed the langgraph-checkpoint-redis package providing production-grade shared memory for LangGraph-based multi-agent systems. The integration offers both thread-level persistence (RedisSaver for within-conversation state) and cross-thread memory (RedisStore for state that persists across conversations). This became the reference architecture for shared memory in LangGraph deployments.
Sub-millisecond read/write latency for agent state, enabling real-time multi-agent coordination. The checkpoint system enables time-travel debugging and human-in-the-loop workflows. Adopted as the recommended production backend for LangGraph shared memory.
Tooling & Ecosystem
LangGraph's StateGraph is the most widely used shared memory implementation for LLM-based multi-agent systems. Provides typed state schemas, reducer functions for conflict-free updates, and pluggable checkpointers (Redis, PostgreSQL, SQLite). The Annotated type with reducers elegantly solves the concurrent write problem for common data types.
Official Redis checkpointer and store for LangGraph. Provides RedisSaver for thread-level persistence and RedisStore for cross-thread shared memory. Sub-millisecond latency for state operations. The recommended production backend for LangGraph shared memory.
Universal memory layer for AI agents with 41,000+ GitHub stars. Provides cross-session shared memory with automatic memory extraction, conflict resolution (forgets outdated information), and semantic retrieval via vector search. Integrates natively with CrewAI, LangGraph, and Flowise. Backed by Y Combinator and Peak XV Partners (India).
Context engineering platform with temporal knowledge graph memory. Zep's Graphiti engine maintains relationships between memories over time, enabling agents to understand how shared knowledge evolves. Achieves 18.5% accuracy improvement over baselines on the LongMemEval benchmark while reducing latency by 90%.
The de facto standard for low-latency shared state. Provides sub-millisecond read/write, pub/sub for reactive agents, Lua scripting for atomic operations, and built-in data structures (lists, sets, hashes, streams) that map naturally to multi-agent state patterns. Redis Streams is particularly useful for event sourcing in agent systems.
Multi-agent orchestration framework with built-in shared context. Provides short-term, long-term, and entity memory with Chroma/Qdrant backends. Supports up to 100+ concurrent agent workflows through optimized task scheduling. Shared context flows implicitly through the crew's task pipeline.
Framework for multi-agent coordination using Anthropic's Model Context Protocol. Enables shared memory across specialized agents working in parallel on different aspects of a project. Uses MCP as the universal interface between agents and shared state.
Research & References
Hang Gao, Yongfeng Zhang (2024)arXiv preprint
Proposed a framework integrating real-time memory filtering, storage, and retrieval to enable memory sharing among multiple LLM agents. Demonstrated that shared memory significantly enhances in-context learning and reduces redundant computation across agent teams.
Zhuoran Jin, Pengfei Cao, et al. (2025)arXiv preprint
Introduced a framework for multi-user, multi-agent environments with asymmetric access controls maintaining two memory tiers: private memory and shared memory. Addresses the critical challenge of information asymmetry when agents with different permission levels share a common memory space.
Kai Yan, Jiatong Li, et al. (2025)arXiv preprint
Formalized the LbMAS (LLM-based Multi-Agent Blackboard System) where a central shared blackboard stores all agent-generated messages and intermediate inferences. Using a single public blackboard reduces per-agent prompt lengths and alleviates the memory bottleneck.
Yu Huang, Huan Yee Koh, et al. (2025)arXiv preprint
Demonstrated that the blackboard architecture consistently outperforms strong baselines including RAG and master-slave multi-agent frameworks, achieving 13-57% relative improvement in end-to-end data science problem solving. Autonomous agents volunteer to respond to shared blackboard postings based on their capabilities.
Kolby Nottingham, Bodhisattwa Prasad Majumder, et al. (2025)arXiv preprint
Proposed LatentMAS, an end-to-end latent collaboration framework where LLM agents share information through latent working memory stored in KV-caches rather than natural language text. Agents transfer hidden states layer-wise, enabling more efficient information sharing than explicit text-based communication.
Deshraj Yadav, Taranjeet Singh, et al. (2025)arXiv preprint
Presented Mem0's scalable memory architecture that dynamically extracts, consolidates, and retrieves information across multi-session agent dialogues. The graph-enhanced variant captures complex relational structures, delivering 26% accuracy boost and 91% lower p95 latency compared to full-context baselines.
Daniel Chalef, Preston Rasmussen, et al. (2025)arXiv preprint
Introduced Zep's temporal knowledge graph engine (Graphiti) for agent memory that dynamically synthesizes unstructured conversational data and structured business data while maintaining historical relationships. Achieves 18.5% accuracy improvement on LongMemEval while reducing latency by 90%.
Interview & Evaluation Perspective
Common Interview Questions
- ●
How would you design a shared memory system for a multi-agent customer support bot with 10 specialized agents?
- ●
What are the tradeoffs between shared memory and message passing in multi-agent systems?
- ●
How do you handle concurrent writes from multiple agents to the same state field?
- ●
Explain the blackboard architecture and how it applies to modern LLM-based multi-agent systems.
- ●
How would you implement checkpointing for a long-running multi-agent workflow that costs $5 per run in API calls?
- ●
What happens when shared state grows beyond what fits in an agent's context window? How do you solve this?
- ●
How would you design shared memory for a system where some agents should not see certain state fields (access control)?
Key Points to Mention
- ●
The blackboard architecture (1970s HEARSAY-II) is the foundational pattern -- shared workspace + knowledge sources + control mechanism. This maps directly to modern LLM multi-agent systems (shared state + specialized agents + orchestrator).
- ●
Reducer functions (from LangGraph) solve the concurrent write problem elegantly: append for lists, replace for scalars, merge for dicts. Always choose the right reducer for each field type -- last-writer-wins on a list field is almost always a bug.
- ●
View projection is critical for production systems -- you cannot dump 50K tokens of shared state into every agent's 8K context window. Use recency filtering, role-based filtering, and vector-based retrieval to give each agent a focused view.
- ●
Checkpointing with Redis or PostgreSQL enables fault recovery, time-travel debugging, and human-in-the-loop workflows. The cost of checkpointing (<1ms per step) is negligible compared to the cost of re-running failed workflows.
- ●
Optimistic concurrency control (version checks + retry on conflict) is preferred over pessimistic locking for multi-agent systems because agents spend most of their time reasoning (LLM calls), not writing state.
Pitfalls to Avoid
- ●
Describing shared memory as just 'a dictionary that agents share' -- interviewers want to hear about concurrency control, consistency models, conflict resolution, and persistence strategies
- ●
Ignoring the context window constraint -- a shared memory design that doesn't address how state is projected into individual agent prompts is incomplete
- ●
Proposing locks for every write operation -- pessimistic locking creates unnecessary contention in agent systems where writes are infrequent relative to reasoning time
- ●
Forgetting fault tolerance -- if your design loses all state on a single failure, it's not production-ready
- ●
Not discussing cost -- quantify the tradeoff between checkpointing cost and re-run cost. This is what separates senior candidates from mid-level ones.
Senior-Level Expectation
A senior/staff-level candidate should demonstrate end-to-end design thinking: (1) State schema design with typed fields and appropriate reducers for each field, (2) Consistency model selection with justification (strong vs. causal vs. eventual), (3) Conflict resolution strategy appropriate to the data types, (4) View projection strategy that respects context window budgets, (5) Persistence and checkpointing design with specific backend choices and cost analysis, (6) Monitoring and observability -- how to track state growth, conflict rates, checkpoint success rates, and agent coordination health. The candidate should be able to sketch the architecture on a whiteboard, estimate costs in INR/USD for an Indian deployment on AWS, and discuss how the design evolves from 2 agents to 20 agents. Bonus points for discussing CRDTs, event sourcing, or the CAP theorem as they apply to multi-agent shared state.
Summary
Shared memory is the architectural pattern that transforms a collection of independent AI agents into a coordinated team. At its core, it provides a centralized or distributed data structure -- the modern equivalent of the 1970s blackboard architecture -- where multiple agents read, write, and react to a shared evolving state.
The key engineering challenges are concurrent write handling (solved via reducer functions, optimistic concurrency control, or CRDTs), context window management (solved via tiered memory with view projection and summarization), and fault tolerance (solved via checkpointing to Redis or PostgreSQL). Production systems like LangGraph's StateGraph with RedisSaver provide batteries-included implementations, while tools like Mem0 and Zep extend shared memory with cross-session persistence and temporal knowledge graphs.
For teams building multi-agent systems in production -- whether it's Razorpay's agentic payments, Swiggy's AI-powered ordering, or an internal enterprise workflow -- the shared memory layer is the most critical infrastructure decision after the choice of LLM itself. The costs are modest (INR 10,000-45,000/month for a complete Redis + PostgreSQL setup), the failure modes are well-understood (lost updates, context overflow, checkpoint corruption), and the ROI is clear: proper shared memory with checkpointing prevents costly re-runs, enables human-in-the-loop workflows, and unlocks the emergent coordination that makes multi-agent systems genuinely more capable than single agents. Design your shared memory carefully -- it is the foundation everything else rests on.