What is shared memory in multi-agent AI systems?

Shared memory is a centralized or distributed data structure that allows multiple AI agents to read and write common state during collaborative task execution. Think of it as a shared whiteboard in a conference room -- every agent can see what others have written, add their own contributions, and react to changes. In practice, shared memory takes different forms depending on the framework: LangGraph uses a typed `StateGraph` object with reducer functions, CrewAI maintains implicit shared context across tasks, and custom systems often use Redis or PostgreSQL as the backing store. The key distinction from message passing is that shared memory provides **implicit coordination** -- agents can discover relevant state without explicitly being told about it. The concept traces back to the **blackboard architecture** from the 1970s, where independent specialist modules (knowledge sources) posted partial solutions to a shared workspace. Modern LLM-based multi-agent systems implement the same pattern, just with LLMs as the knowledge sources and Redis/PostgreSQL as the blackboard.

How do you handle concurrent writes from multiple agents?

Concurrent writes are the central challenge in shared memory design. There are several strategies, each with different tradeoffs: **1. Reducer Functions (LangGraph approach)**: Define a merge strategy per field. Lists use append (contributions are concatenated), scalars use last-writer-wins, dicts use merge. This is the simplest and most common approach because LangGraph serializes node execution within a single step. **2. Optimistic Concurrency Control**: Each state entry has a version number. An agent reads the current version, computes its update, and writes only if the version hasn't changed. If it has, the agent retries. This works well when conflicts are rare -- which they usually are in agent systems because agents spend most of their time on LLM calls, not writing state. **3. CRDTs (Conflict-free Replicated Data Types)**: Mathematical structures that guarantee convergence without coordination. G-Counters for incrementing counts, OR-Sets for set membership, LWW-Registers for simple values. These are powerful but complex and have memory overhead from storing history. **4. Semantic Merge**: Use an LLM to intelligently merge conflicting writes. For example, if two agents update a summary, an arbiter LLM can synthesize both summaries into a coherent whole. This is expensive but produces the best results for natural language state.

How does LangGraph's StateGraph implement shared memory?

LangGraph's `StateGraph` is the most popular shared memory implementation for LLM agents. Here's how it works: You define a state schema as a `TypedDict` (or Pydantic model) where each field represents a piece of shared state. Fields can be annotated with **reducer functions** that determine how concurrent updates are merged -- `operator.add` for list concatenation, custom functions for complex merges, or no annotation for last-writer-wins. Each node (agent) in the graph receives the current state as input and returns a partial update dictionary. LangGraph applies the reducer to merge the update into the existing state. After each node executes, the updated state is passed to the next node(s) in the graph. For persistence, you compile the graph with a **checkpointer** (`RedisSaver`, `PostgresSaver`, `SqliteSaver`). The checkpointer saves the complete state after every node execution, enabling fault recovery, time-travel debugging, and human-in-the-loop workflows. The `thread_id` in the runtime config isolates state across different conversations/sessions. The beauty of this design is that agents don't need to know about each other -- they only need to know the state schema. Adding a new agent means adding a new node that reads and writes state, without changing existing agents.

What is the blackboard architecture and why is it relevant to modern AI?

The **blackboard architecture** was invented at Carnegie Mellon University in the 1970s for the HEARSAY-II speech recognition system. It has three components: 1. **The blackboard**: A hierarchical shared workspace holding the current solution state 2. **Knowledge sources**: Independent specialist modules that monitor the blackboard and contribute partial solutions 3. **A control mechanism**: Opportunistically decides which knowledge source to activate next This is remarkably relevant to modern multi-agent LLM systems because it maps directly: knowledge sources become specialized LLM agents (researcher, coder, reviewer), the blackboard becomes shared state (LangGraph's StateGraph, Redis), and the control mechanism becomes the orchestrator/router. Recent research (2025) has formally validated this connection. The **LbMAS framework** demonstrated that using a shared blackboard reduces per-agent prompt lengths (by avoiding redundant context) and achieved 13-57% improvement over master-slave architectures on data science tasks. The blackboard pattern is particularly powerful when agents have overlapping capabilities -- instead of the orchestrator trying to assign tasks to the "right" agent, agents can self-select based on what they see on the blackboard.

How much does shared memory infrastructure cost for a production multi-agent system in India?

Let's break down realistic costs for a production deployment on AWS Mumbai (ap-south-1): **Redis (AWS ElastiCache)**: - `cache.t3.small` (1.5 GB): ~INR 2,500/month ($30) -- suitable for prototyping with <100 concurrent sessions - `cache.r6g.large` (6.1 GB): ~INR 10,900/month ($130) -- handles 1,000+ concurrent sessions with sub-millisecond latency - `cache.r6g.xlarge` (13 GB) with replica: ~INR 27,000/month ($320) -- production-grade with HA **PostgreSQL (for durable checkpoints, AWS RDS)**: - `db.t3.medium`: ~INR 5,800/month ($69) -- sufficient for checkpoint storage for most workloads **Vector Store (for semantic memory, Qdrant Cloud)**: - 1 GB storage tier: ~INR 5,000/month ($60) **Total production setup**: INR 20,000-45,000/month ($240-540) depending on scale. Compare this to the cost of NOT having shared memory: if a 10-step multi-agent workflow using GPT-4o costs INR 42-84 ($0.50-1.00) per run, and you have 500 failures per month that require full re-runs, that's INR 21,000-42,000 ($250-500) in wasted API calls. Shared memory with checkpointing pays for itself purely in failure recovery savings, not counting the improved agent coordination quality.

How do I prevent shared memory from growing unbounded and overwhelming agent context windows?

This is one of the most important practical challenges. Here's a tiered approach: **Tier 1 -- Hot State (in full detail)**: Keep the last N state updates (typically 3-5 per agent) in their complete form. These are the most recent and likely most relevant entries. Size budget: 2,000-4,000 tokens. **Tier 2 -- Warm State (summarized)**: State older than the hot window gets summarized by an LLM into a concise recap. "The researcher found 12 papers on transformer architectures, focusing on attention mechanisms and efficiency improvements" is much cheaper than 12 individual finding entries. Size budget: 500-1,000 tokens. **Tier 3 -- Cold State (vector-retrievable)**: Historical state is embedded and stored in a vector store. When an agent needs specific historical context, the view projector performs semantic search to retrieve the most relevant entries. Size budget: 0 tokens in the default prompt, retrieved on demand. Implementation-wise, set a `max_tokens_per_agent` budget in your view projector. When the total state exceeds this budget, automatically trigger Tier 2 summarization. Use token counting (via `tiktoken`) to enforce budgets precisely. Some teams also implement role-based filtering -- a writer agent doesn't need to see raw research data, just the analysis summary.

Shared memory vs. message passing: when should I use each?

**Use shared memory when**: - Agents need access to the evolving global state, not just direct messages from other agents - The coordination pattern is implicit -- agents react to state changes rather than receiving explicit instructions - You need fault recovery (checkpointing shared state is simpler than replaying message history) - The number of agents may change dynamically -- shared memory decouples agents from each other **Use message passing when**: - Communication is point-to-point with clear sender and receiver (e.g., "reviewer sends feedback to coder") - You need strict ordering guarantees for messages - Agents process events/messages in a stream-like fashion - You want to minimize coupling -- each agent sees only the messages directed to it **In practice, combine both**: Use shared memory for the evolving solution state and message passing for discrete coordination signals ("please review this", "approved", "needs revision"). LangGraph's design reflects this -- the `StateGraph` provides shared memory, while edges provide the message passing (directing control flow between agents). A common anti-pattern is trying to use shared memory as a message queue (checking a "messages" field in a loop) or trying to use message passing to reconstruct global state (forwarding accumulated context through a chain of messages). Each pattern has its sweet spot -- use both.

How do tools like Mem0 and Zep differ from framework-level shared memory (LangGraph StateGraph)?

They operate at different abstraction levels and serve different purposes: **LangGraph StateGraph** provides **within-session shared memory** -- state that exists for the duration of a single workflow execution (or persists via checkpointing). It's structured (typed fields with reducers), synchronous, and tightly integrated with the execution graph. Think of it as working memory -- what agents need right now. **Mem0** provides **cross-session semantic memory** -- memories that persist across different conversations and can be retrieved via natural language search. Mem0 automatically extracts salient information from conversations, resolves conflicts (forgets outdated information), and uses vector search for retrieval. Think of it as long-term memory -- what agents learned from past interactions. **Zep** adds **temporal knowledge graphs** -- not just storing memories but maintaining relationships between them over time. Zep's Graphiti engine tracks how knowledge evolves, enabling queries like "what did we know about this customer before their last purchase?" Think of it as episodic memory -- understanding how knowledge changed. Production systems often layer all three: LangGraph StateGraph for hot coordination state, Mem0 for persistent user/entity memory, and a vector store for long-term retrieval. This mirrors how human teams work -- you have your current whiteboard (LangGraph), your team wiki (Mem0), and your searchable archive (vector store).

Multi-Agent

Shared Memory in Machine Learning

Q: How do I prevent shared memory from growing unbounded and overwhelming agent context windows?

This is one of the most important practical challenges. Here's a tiered approach: **Tier 1 -- Hot State (in full detail)**: Keep the last N state updates (typically 3-5 per agent) in their complete form. These are the most recent and likely most relevant entries. Size budget: 2,000-4,000 tokens. **Tier 2 -- Warm State (summarized)**: State older than the hot window gets summarized by an LLM into a concise recap. "The researcher found 12 papers on transformer architectures, focusing on attention mechanisms and efficiency improvements" is much cheaper than 12 individual finding entries. Size budget: 500-1,000 tokens. **Tier 3 -- Cold State (vector-retrievable)**: Historical state is embedded and stored in a vector store. When an agent needs specific historical context, the view projector performs semantic search to retrieve the most relevant entries. Size budget: 0 tokens in the default prompt, retrieved on demand. Implementation-wise, set a `max_tokens_per_agent` budget in your view projector. When the total state exceeds this budget, automatically trigger Tier 2 summarization. Use token counting (via `tiktoken`) to enforce budgets precisely. Some teams also implement role-based filtering -- a writer agent doesn't need to see raw research data, just the analysis summary.

Q: Shared memory vs. message passing: when should I use each?

**Use shared memory when**: - Agents need access to the evolving global state, not just direct messages from other agents - The coordination pattern is implicit -- agents react to state changes rather than receiving explicit instructions - You need fault recovery (checkpointing shared state is simpler than replaying message history) - The number of agents may change dynamically -- shared memory decouples agents from each other **Use message passing when**: - Communication is point-to-point with clear sender and receiver (e.g., "reviewer sends feedback to coder") - You need strict ordering guarantees for messages - Agents process events/messages in a stream-like fashion - You want to minimize coupling -- each agent sees only the messages directed to it **In practice, combine both**: Use shared memory for the evolving solution state and message passing for discrete coordination signals ("please review this", "approved", "needs revision"). LangGraph's design reflects this -- the `StateGraph` provides shared memory, while edges provide the message passing (directing control flow between agents). A common anti-pattern is trying to use shared memory as a message queue (checking a "messages" field in a loop) or trying to use message passing to reconstruct global state (forwarding accumulated context through a chain of messages). Each pattern has its sweet spot -- use both.

Q: How do tools like Mem0 and Zep differ from framework-level shared memory (LangGraph StateGraph)?

They operate at different abstraction levels and serve different purposes: **LangGraph StateGraph** provides **within-session shared memory** -- state that exists for the duration of a single workflow execution (or persists via checkpointing). It's structured (typed fields with reducers), synchronous, and tightly integrated with the execution graph. Think of it as working memory -- what agents need right now. **Mem0** provides **cross-session semantic memory** -- memories that persist across different conversations and can be retrieved via natural language search. Mem0 automatically extracts salient information from conversations, resolves conflicts (forgets outdated information), and uses vector search for retrieval. Think of it as long-term memory -- what agents learned from past interactions. **Zep** adds **temporal knowledge graphs** -- not just storing memories but maintaining relationships between them over time. Zep's Graphiti engine tracks how knowledge evolves, enabling queries like "what did we know about this customer before their last purchase?" Think of it as episodic memory -- understanding how knowledge changed. Production systems often layer all three: LangGraph StateGraph for hot coordination state, Mem0 for persistent user/entity memory, and a vector store for long-term retrieval. This mirrors how human teams work -- you have your current whiteboard (LangGraph), your team wiki (Mem0), and your searchable archive (vector store).

When multiple AI agents collaborate on a task, they need a way to share context, intermediate results, and evolving state. Shared memory is the architectural pattern that makes this possible -- it provides a centralized or distributed data structure that all agents in a multi-agent system can read from and write to during execution.

Think about a team of software engineers working on a complex feature. They don't each work in isolation -- they share a codebase, leave comments for each other, update tickets, and maintain a shared understanding of the project's current state. Shared memory serves the same purpose for AI agents: it is the common ground where agents coordinate, exchange partial results, and build toward a collective solution.

This pattern has deep roots in classical AI. The blackboard architecture, first formalized in the 1970s for the HEARSAY-II speech recognition system, established the foundational idea: a shared workspace (the "blackboard") where independent specialist modules post and consume information, coordinated by a control mechanism. Today, that same principle powers LangGraph's StateGraph, CrewAI's shared context, and Redis-backed state stores in production multi-agent deployments.

Shared memory is arguably the most critical infrastructure decision in any multi-agent system. Get it wrong, and your agents will talk past each other, overwrite each other's work, or drown in stale context. Get it right, and you unlock emergent collaboration that no single agent could achieve alone.

Concept Snapshot

What It Is: A centralized or distributed data store that enables multiple AI agents to read, write, and synchronize shared state during collaborative task execution.
Category: Multi-Agent Systems
Complexity: Advanced
Inputs / Outputs: Inputs: agent observations, intermediate results, tool outputs, conversation history. Outputs: synchronized state accessible by all agents, conflict-resolved updates, checkpointed snapshots.
System Placement: Sits at the center of a multi-agent orchestration layer, connecting agent nodes, the orchestrator/router, and external tools or knowledge stores.
Also Known As: blackboard, shared state, common workspace, agent scratchpad, collaborative memory, shared context window, global state store
Typical Users: ML Engineers, AI Platform Engineers, Backend Engineers, Solutions Architects, AI Agent Developers
Prerequisites: Multi-agent system fundamentals, Distributed systems concepts (consistency, concurrency), State management patterns, Basic understanding of LLM agents, Key-value stores (Redis, Memcached)
Key Terms: blackboard architecturestate graphcheckpointerconflict resolutionCRDTvector clockevent sourcingoptimistic concurrencyreducer functionthread-level persistence

Why This Concept Exists

The Fundamental Problem: Agent Isolation

A single AI agent operates in its own context window -- it sees its prompt, its conversation history, and its tool outputs. That's sufficient for isolated tasks. But the moment you orchestrate multiple agents to collaborate -- a researcher agent gathering data, an analyst agent interpreting it, a writer agent composing a report -- you hit a wall: how does agent B know what agent A discovered?

Without shared memory, the only option is to pass everything through the orchestrator, which becomes a bottleneck. The orchestrator must serialize all communication, maintain the complete history, and decide what each agent needs to see. For two agents, this is manageable. For five agents working in parallel on a complex task, it quickly becomes intractable.

The Blackboard Architecture: Where It All Started

The blackboard architecture emerged from the HEARSAY-II speech recognition project at Carnegie Mellon University in the 1970s. The core insight was elegant: instead of having specialist modules (knowledge sources) communicate directly with each other in a tangled web of point-to-point connections, give them a shared workspace -- the blackboard -- where they can post partial solutions and react to each other's contributions.

The architecture has three components:

The blackboard: a global, hierarchical data structure holding the evolving solution state
Knowledge sources: independent specialist modules that monitor the blackboard and contribute partial solutions
A control mechanism: decides which knowledge source to activate next based on the current blackboard state

This pattern maps remarkably well to modern LLM-based multi-agent systems. Knowledge sources become specialized agents (researcher, coder, reviewer), the blackboard becomes a shared state object (LangGraph's StateGraph, a Redis store, or an in-memory dictionary), and the control mechanism becomes the orchestrator or router agent.

Why It Became Critical in 2024-2025

Three developments made shared memory essential rather than optional:

1. Multi-agent frameworks matured. LangGraph, CrewAI, AutoGen, and OpenAI's Swarm moved from research curiosities to production tools. Each required a principled way to manage shared state across agent boundaries.

2. Agent tasks grew more complex. Simple chain-of-thought prompting gave way to multi-step workflows involving tool use, web search, code execution, and human-in-the-loop reviews. Each step generates state that downstream agents need.

3. Long-running agent sessions became common. Agents that run for minutes or hours -- processing large datasets, iterating on code, or conducting multi-turn negotiations -- need persistent state that survives failures and can be checkpointed for recovery.

Key Insight: Shared memory is not just a convenience -- it is the mechanism that transforms a collection of independent agents into a coherent collaborative system. Without it, multi-agent systems are just multiple single agents running in parallel with no coordination.

Core Intuition & Mental Model

The Whiteboard in the War Room

Imagine a crisis response center. Multiple specialists -- logistics, medical, communications, intelligence -- are working the same incident. In the center of the room stands a large whiteboard. Anyone can walk up and write new information, update existing entries, or read what others have posted. The whiteboard is not owned by any single specialist -- it belongs to the room, to the mission.

That whiteboard is shared memory. Each specialist (agent) has their own expertise and their own local notes, but the whiteboard is where coordination happens. When the intelligence analyst writes "bridge at grid reference X is destroyed," the logistics planner immediately sees it and re-routes supply convoys without anyone explicitly telling them to. The information propagates through the shared medium.

Now, there are rules. You don't erase someone else's entry without good reason. If two people try to update the same section simultaneously, they talk it out (conflict resolution). And periodically, someone photographs the whiteboard (checkpointing) so they can reconstruct the state if something goes wrong.

What Shared Memory Is NOT

Shared memory is not the same as message passing. In message-passing architectures, agents send discrete messages to each other -- think of it like email. The sender decides what to share and who to share it with. In shared memory, all agents have access to the same global state -- think of it like a shared Google Doc where everyone can see everything (with appropriate access controls).

Both patterns are valid, and production systems often combine them. But the distinction matters because shared memory provides implicit coordination -- agents can react to state changes they didn't explicitly ask for -- while message passing requires explicit coordination where every piece of information must be deliberately sent.

The Fundamental Tension

Here is the tension that makes shared memory design interesting: you want agents to have access to everything they need while not being overwhelmed by everything that exists. A shared memory store with 50,000 tokens of accumulated state is useless if an agent's context window is 8,000 tokens. The art of shared memory design lies in maintaining a rich global state while providing each agent with a relevant, focused view of that state.

Technical Foundations

Formal Model of Shared Memory in Multi-Agent Systems

Let us define a multi-agent system $\mathcal{M} = (\mathcal{A}, \mathcal{S}, \mathcal{T}, \mathcal{C})$ where:

$\mathcal{A} = \{a_1, a_2, \ldots, a_n\}$ is a set of $n$ agents
$\mathcal{S}$ is the shared memory state space
$\mathcal{T}$ is the set of possible transitions (read/write operations)
$\mathcal{C}$ is the conflict resolution policy

State Representation

The shared memory at time $t$ is a structured object $S_t \in \mathcal{S}$ composed of typed fields:

$S_t = \{(k_1, v_1, \tau_1), (k_2, v_2, \tau_2), \ldots, (k_m, v_m, \tau_m)\}$

where $k_i$ is a key (field name), $v_i$ is the value (which may be a scalar, list, or nested structure), and $\tau_i$ is a logical timestamp or version vector.

State Transitions

Each agent $a_j$ can perform two operations on shared memory:

Read: $\text{read}(a_j, k_i) \rightarrow v_i$ -- agent $a_j$ reads the current value of key $k_i$

Write: $\text{write}(a_j, k_i, v_i') \rightarrow S_{t+1}$ -- agent $a_j$ updates key $k_i$ to value $v_i'$ , producing a new state $S_{t+1}$

Reducer Functions (LangGraph Model)

In frameworks like LangGraph, state updates are mediated by reducer functions $R_k$ for each key $k$ :

$v_k^{\text{new}} = R_k(v_k^{\text{current}}, \Delta_k)$

where $\Delta_k$ is the update proposed by an agent. Common reducers include:

Last-writer-wins: $R(v, \Delta) = \Delta$
Append: $R(v, \Delta) = v \| \Delta$ (concatenation for lists)
Merge: $R(v, \Delta) = v \cup \Delta$ (set union for dictionaries)
Custom: Application-specific logic (e.g., summarize-and-replace for conversation history)

Conflict Detection with Vector Clocks

For concurrent writes by agents $a_i$ and $a_j$ to the same key $k$ , we use vector clocks $VC = [c_1, c_2, \ldots, c_n]$ where $c_i$ is agent $a_i$ 's logical counter. Two writes conflict if and only if:

$VC_i \not\leq VC_j \text{ and } VC_j \not\leq VC_i$

meaning neither write causally precedes the other. The conflict resolution policy $\mathcal{C}$ then determines the outcome -- last-writer-wins, semantic merge, or escalation to the orchestrator.

Consistency Models

Shared memory systems choose among consistency guarantees:

Strong consistency: All agents see the same state at all times. Cost: high latency due to synchronization barriers.
Eventual consistency: Agents may temporarily see different states, but all converge. Cost: requires conflict resolution.
Causal consistency: If agent $a_i$ 's write depends on agent $a_j$ 's write, all agents see them in that order. A practical middle ground.

For most LLM-based multi-agent systems, causal consistency provides the best balance -- agents see causally related updates in order while allowing independent updates to propagate asynchronously.

Internal Architecture

A production shared memory system for multi-agent AI consists of five interacting layers: the state schema defining the structure of shared data, the access layer providing read/write APIs to agents, the conflict resolution engine handling concurrent updates, the persistence backend storing state durably, and the view layer projecting relevant subsets of state to individual agents.

The architecture follows a hub-and-spoke model where shared memory sits at the center and agents connect as spokes. The orchestrator typically mediates access, though agents can also read/write directly in decentralized designs (the blackboard pattern).

Shared Memory in Multi-Agent ML Systems Architecture — A hub-and-spoke architecture where three agents (Researcher, Analyst, Writer) connect to a centra...

The View Projector is a critical component that is often overlooked. When shared state grows large -- which it inevitably does in long-running multi-agent workflows -- you cannot dump the entire state into every agent's context window. The view projector selects, summarizes, or retrieves the most relevant portions of state for each agent based on its role, current task, and context window budget. This is where shared memory intersects with RAG: the view projector may use vector similarity search over the state store to find the most relevant entries for a given agent query.

Key Components

State Schema / State Graph

Defines the typed structure of shared state -- which fields exist, their types, and their reducer functions. In LangGraph, this is the TypedDict or Pydantic model passed to StateGraph. It acts as the contract between all agents: everyone agrees on what the shared memory looks like.

Access Layer (Read/Write API)

Provides the interface through which agents interact with shared memory. Supports atomic reads, atomic writes, and batch operations. May enforce access control policies -- e.g., agent A can write to research_findings but only read from final_report. In Redis-backed systems, this maps to GET/SET/HSET operations with optional Lua scripting for atomicity.

Conflict Resolution Engine

Handles concurrent writes to the same key by multiple agents. Implements strategies like last-writer-wins (simple but lossy), append-only (preserves all contributions), CRDT-based merge (mathematically guaranteed convergence), or semantic merge (uses an LLM to intelligently combine conflicting updates). The choice of strategy depends on the data type and the tolerance for information loss.

Checkpointer / Persistence Backend

Periodically snapshots the shared state for durability and fault recovery. LangGraph's RedisSaver, PostgresSaver, and SqliteSaver are canonical implementations. Enables time-travel debugging -- the ability to roll back to any previous state and replay execution from that point. Also supports human-in-the-loop workflows by persisting state while waiting for human approval.

View Projector / Context Assembler

Projects a relevant subset of shared state into each agent's context window. Uses strategies like recency filtering (last N entries), relevance filtering (vector similarity), role-based filtering (only show fields relevant to this agent's role), and summarization (compress older state into summaries). Prevents context window overflow in long-running sessions.

Event Bus / Notification Layer

Notifies agents when relevant portions of shared state change. Implements publish-subscribe patterns so agents can react to state updates without polling. In Redis, this maps to SUBSCRIBE/PUBLISH or Redis Streams. Enables reactive multi-agent patterns where agents wake up and act when relevant state changes occur.

Data Flow

Write Path

Agent completes a step and produces an update (e.g., {"research_findings": [new_finding]})
The update passes through the access layer, which validates types and permissions
The conflict resolution engine checks for concurrent writes using version vectors or timestamps
The reducer function for the target key merges the update with existing state (e.g., append to list)
The checkpointer persists the new state snapshot to the backend (Redis, PostgreSQL, etc.)
The event bus notifies subscribed agents of the state change

Read Path

Agent requests state (either full state or a specific key)
The view projector determines what subset of state is relevant for this agent
If the full state exceeds the agent's context budget, the projector applies summarization or vector-based retrieval
The filtered view is returned to the agent as part of its next prompt

Checkpoint/Recovery Path

On failure or timeout, the orchestrator retrieves the last checkpoint from the persistence backend
The state is restored to the checkpointed version
The failed agent (or a replacement) is re-invoked with the restored state
Execution continues from the checkpoint rather than restarting from scratch

A hub-and-spoke architecture where three agents (Researcher, Analyst, Writer) connect to a central Shared Memory Layer containing a State Store, Conflict Resolver, Checkpointer, and View Projector. The Shared Memory Layer connects to persistence backends (Redis, PostgreSQL, Vector Store). An Orchestrator/Router also connects to the State Store. The View Projector sends filtered views back to each agent.

How to Implement

Implementation Spectrum

Shared memory implementations range from simple in-process dictionaries to distributed, persistent state stores. The right choice depends on three factors:

Number of agents and concurrency level: Two sequential agents? A Python dictionary suffices. Ten parallel agents? You need proper concurrency control.
Session duration: Sub-minute workflows can use ephemeral memory. Hour-long sessions need persistence and checkpointing.
Failure tolerance: If losing state mid-execution is acceptable (prototyping), in-memory is fine. If agents process expensive operations (API calls costing money), you need durable checkpointing.

Framework-Native vs. Custom

Most teams start with their framework's built-in shared state -- LangGraph's StateGraph with reducers, CrewAI's shared context, or AutoGen's conversation history. These work well for standard patterns. However, production systems frequently outgrow framework defaults and require custom solutions: Redis for sub-millisecond access, PostgreSQL for durable state with SQL queryability, or vector stores for semantic retrieval over accumulated state.

Cost Note: A Redis instance on AWS ElastiCache with 6.1 GB memory (cache.r6g.large) costs approximately $130/month (~INR 10,900/month). For many multi-agent workloads, this single instance can handle thousands of concurrent agent sessions. Compare this to the cost of re-running failed agent workflows -- a single GPT-4 Turbo session with 10 agent steps can cost$ 0.50-2.00 (~INR 42-168) in API calls alone. Checkpointing to Redis pays for itself after preventing a few hundred failures per month.

LangGraph StateGraph with Shared Memory and Reducers71 lines

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.redis import RedisSaver
import operator

# Define shared state schema with reducer functions
class SharedMemory(TypedDict):
    # append reducer: each agent's findings are accumulated
    research_findings: Annotated[list[str], operator.add]
    # append reducer: conversation history grows
    messages: Annotated[list[dict], operator.add]
    # last-writer-wins: only the latest summary matters
    current_summary: str
    # append reducer: action items from all agents
    action_items: Annotated[list[str], operator.add]
    # metadata for coordination
    iteration_count: int

# Agent functions that read/write shared state
def researcher_agent(state: SharedMemory) -> dict:
    """Reads existing findings, adds new ones."""
    existing = state.get("research_findings", [])
    # Agent uses existing findings to avoid duplicates
    new_finding = f"Finding #{len(existing) + 1}: Market size is $4.2B"
    return {
        "research_findings": [new_finding],
        "messages": [{"role": "researcher", "content": new_finding}],
    }

def analyst_agent(state: SharedMemory) -> dict:
    """Reads research findings, produces analysis."""
    findings = state.get("research_findings", [])
    analysis = f"Analysis of {len(findings)} findings: growth trend detected"
    return {
        "current_summary": analysis,
        "messages": [{"role": "analyst", "content": analysis}],
        "action_items": ["Validate growth trend with secondary sources"],
    }

def writer_agent(state: SharedMemory) -> dict:
    """Reads summary and findings, writes final output."""
    summary = state.get("current_summary", "")
    findings = state.get("research_findings", [])
    report = f"Report: Based on {len(findings)} findings. {summary}"
    return {
        "messages": [{"role": "writer", "content": report}],
        "iteration_count": state.get("iteration_count", 0) + 1,
    }

# Build the graph with shared memory
graph = StateGraph(SharedMemory)
graph.add_node("researcher", researcher_agent)
graph.add_node("analyst", analyst_agent)
graph.add_node("writer", writer_agent)
graph.add_edge("researcher", "analyst")
graph.add_edge("analyst", "writer")
graph.add_edge("writer", END)
graph.set_entry_point("researcher")

# Compile with Redis checkpointer for persistence
checkpointer = RedisSaver(redis_url="redis://localhost:6379")
app = graph.compile(checkpointer=checkpointer)

# Run with thread-level persistence
config = {"configurable": {"thread_id": "session-001"}}
result = app.invoke(
    {"research_findings": [], "messages": [], "action_items": [], "iteration_count": 0},
    config=config,
)
print(result["current_summary"])
print(f"Total action items: {len(result['action_items'])}")

This example demonstrates LangGraph's shared memory pattern. The SharedMemory TypedDict defines the schema, and Annotated types with operator.add specify append reducers -- when multiple agents write to research_findings, their contributions are concatenated rather than overwritten. The current_summary field uses last-writer-wins (no reducer annotation). The RedisSaver checkpointer persists state after each node execution, enabling recovery if any agent fails. The thread_id in the config ensures each conversation session has its own isolated state.

Redis-Backed Shared Memory with Conflict Resolution151 lines

import redis
import json
import time
import hashlib
from typing import Any, Optional
from dataclasses import dataclass, field

@dataclass
class StateEntry:
    value: Any
    version: int
    author: str
    timestamp: float
    checksum: str = ""

class SharedMemoryStore:
    """Production shared memory with optimistic concurrency control."""

    def __init__(self, redis_url: str = "redis://localhost:6379", namespace: str = "agent"):
        self.redis = redis.from_url(redis_url, decode_responses=True)
        self.ns = namespace

    def _key(self, session_id: str, field: str) -> str:
        return f"{self.ns}:{session_id}:{field}"

    def read(self, session_id: str, field: str) -> Optional[StateEntry]:
        """Read a field from shared memory."""
        raw = self.redis.get(self._key(session_id, field))
        if raw is None:
            return None
        data = json.loads(raw)
        return StateEntry(**data)

    def write(
        self,
        session_id: str,
        field: str,
        value: Any,
        author: str,
        expected_version: Optional[int] = None,
    ) -> bool:
        """Write with optimistic concurrency control.
        
        Returns True if write succeeded, False if version conflict.
        Uses Redis WATCH/MULTI for atomic check-and-set.
        """
        key = self._key(session_id, field)
        with self.redis.pipeline() as pipe:
            try:
                pipe.watch(key)
                current_raw = pipe.get(key)
                current_version = 0
                if current_raw:
                    current = json.loads(current_raw)
                    current_version = current["version"]
                
                # Optimistic concurrency check
                if expected_version is not None and current_version != expected_version:
                    return False  # Conflict detected
                
                new_entry = StateEntry(
                    value=value,
                    version=current_version + 1,
                    author=author,
                    timestamp=time.time(),
                    checksum=hashlib.sha256(
                        json.dumps(value, sort_keys=True).encode()
                    ).hexdigest()[:16],
                )
                pipe.multi()
                pipe.set(key, json.dumps(new_entry.__dict__))
                pipe.execute()
                # Publish state change event for reactive agents
                self.redis.publish(
                    f"{self.ns}:events:{session_id}",
                    json.dumps({"field": field, "author": author, "version": new_entry.version}),
                )
                return True
            except redis.WatchError:
                return False  # Another agent modified the key

    def append_to_list(
        self, session_id: str, field: str, items: list, author: str
    ) -> int:
        """Atomic append to a list field. Returns new list length."""
        key = self._key(session_id, field)
        # Use Lua script for atomic read-modify-write
        lua_script = """
        local current = redis.call('GET', KEYS[1])
        local data
        if current then
            data = cjson.decode(current)
        else
            data = {value={}, version=0, author='', timestamp=0, checksum=''}
        end
        local new_items = cjson.decode(ARGV[1])
        for _, item in ipairs(new_items) do
            table.insert(data.value, item)
        end
        data.version = data.version + 1
        data.author = ARGV[2]
        data.timestamp = tonumber(ARGV[3])
        redis.call('SET', KEYS[1], cjson.encode(data))
        return #data.value
        """
        return self.redis.eval(
            lua_script, 1, key, json.dumps(items), author, str(time.time())
        )

    def checkpoint(self, session_id: str) -> str:
        """Snapshot all fields for a session. Returns checkpoint ID."""
        pattern = f"{self.ns}:{session_id}:*"
        keys = list(self.redis.scan_iter(match=pattern))
        snapshot = {}
        for key in keys:
            field_name = key.split(":")[-1]
            snapshot[field_name] = self.redis.get(key)
        checkpoint_id = f"cp:{session_id}:{int(time.time())}"
        self.redis.set(checkpoint_id, json.dumps(snapshot), ex=86400)  # 24h TTL
        return checkpoint_id

    def restore(self, checkpoint_id: str) -> bool:
        """Restore shared memory from a checkpoint."""
        raw = self.redis.get(checkpoint_id)
        if not raw:
            return False
        snapshot = json.loads(raw)
        # Extract session_id from checkpoint_id
        session_id = checkpoint_id.split(":")[1]
        for field_name, value in snapshot.items():
            key = self._key(session_id, field_name)
            self.redis.set(key, value)
        return True


# Usage example
store = SharedMemoryStore(redis_url="redis://localhost:6379")

# Agent 1 writes research findings
store.append_to_list("session-42", "findings", ["GDP growth at 7.2%"], "researcher")

# Agent 2 reads and writes analysis (with version check)
entry = store.read("session-42", "findings")
if entry:
    store.write("session-42", "analysis", 
                f"Analyzed {len(entry.value)} findings", 
                author="analyst", expected_version=None)

# Checkpoint before risky operation
cp_id = store.checkpoint("session-42")
print(f"Checkpointed: {cp_id}")

This production-grade implementation uses Redis WATCH/MULTI for optimistic concurrency control -- if two agents try to write the same key simultaneously, only one succeeds and the other gets a False return indicating a conflict. The append_to_list method uses a Lua script for atomic read-modify-write, ensuring list appends are safe under concurrent access. The checkpoint and restore methods enable fault recovery by snapshotting all session state. The publish/subscribe pattern on state changes allows reactive agents to wake up when relevant state is modified.

Blackboard Pattern with Multiple Agent Types107 lines

from typing import Any, Callable
from dataclasses import dataclass, field
import threading
import time

@dataclass
class BlackboardEntry:
    key: str
    value: Any
    source_agent: str
    confidence: float
    timestamp: float = field(default_factory=time.time)

class Blackboard:
    """Classical blackboard architecture for multi-agent coordination."""

    def __init__(self):
        self._state: dict[str, BlackboardEntry] = {}
        self._subscribers: dict[str, list[Callable]] = {}
        self._lock = threading.RLock()
        self._history: list[tuple[str, BlackboardEntry]] = []

    def write(self, key: str, value: Any, source: str, confidence: float = 1.0):
        """Post information to the blackboard."""
        with self._lock:
            entry = BlackboardEntry(
                key=key, value=value, source_agent=source,
                confidence=confidence, timestamp=time.time(),
            )
            # Conflict resolution: higher confidence wins
            if key in self._state:
                existing = self._state[key]
                if existing.confidence > confidence:
                    return  # Existing entry has higher confidence
            self._state[key] = entry
            self._history.append(("write", entry))
        # Notify subscribers outside the lock
        for callback in self._subscribers.get(key, []):
            callback(entry)

    def read(self, key: str) -> BlackboardEntry | None:
        """Read from the blackboard."""
        with self._lock:
            return self._state.get(key)

    def read_all(self, prefix: str = "") -> dict[str, BlackboardEntry]:
        """Read all entries matching a prefix."""
        with self._lock:
            return {
                k: v for k, v in self._state.items()
                if k.startswith(prefix)
            }

    def subscribe(self, key: str, callback: Callable):
        """Subscribe to changes on a specific key."""
        self._subscribers.setdefault(key, []).append(callback)

    def get_history(self, last_n: int = 10) -> list[tuple[str, BlackboardEntry]]:
        """Get recent history for debugging / time-travel."""
        return self._history[-last_n:]


# Define knowledge sources (agents)
class KnowledgeSource:
    def __init__(self, name: str, blackboard: Blackboard):
        self.name = name
        self.bb = blackboard

class ResearchAgent(KnowledgeSource):
    def act(self, query: str):
        # Simulate research
        result = f"Research on '{query}': Found 3 relevant papers"
        self.bb.write(f"research:{query}", result, self.name, confidence=0.85)
        self.bb.write("status", f"{self.name} completed research", self.name)

class AnalysisAgent(KnowledgeSource):
    def act(self):
        # Read all research entries from the blackboard
        research = self.bb.read_all(prefix="research:")
        analysis = f"Synthesized {len(research)} research entries"
        self.bb.write("analysis:summary", analysis, self.name, confidence=0.90)

class QualityAgent(KnowledgeSource):
    def act(self):
        # Review analysis quality
        analysis = self.bb.read("analysis:summary")
        if analysis and analysis.confidence < 0.95:
            self.bb.write(
                "review:needed", True, self.name, confidence=1.0
            )


# Orchestrate via the blackboard
bb = Blackboard()
researcher = ResearchAgent("researcher-1", bb)
analyst = AnalysisAgent("analyst-1", bb)
reviewer = QualityAgent("reviewer-1", bb)

# Agents interact through the shared blackboard
researcher.act("transformer architectures")
researcher.act("multi-agent memory")
analyst.act()
reviewer.act()

# Inspect the blackboard state
for key, entry in bb._state.items():
    print(f"[{entry.source_agent}] {key}: {entry.value} (conf: {entry.confidence})")

This implements the classical blackboard architecture where agents (knowledge sources) interact exclusively through a shared workspace. The confidence-based conflict resolution means that when two agents write to the same key, the higher-confidence write wins. The subscribe mechanism enables reactive behavior -- agents can register callbacks that fire when specific blackboard keys change, eliminating the need for polling. The history log supports time-travel debugging. This pattern is especially powerful when agents have overlapping capabilities and the system needs to select the best contribution.

Mem0 Integration for Cross-Session Shared Memory56 lines

from mem0 import Memory

# Initialize Mem0 with shared memory configuration
config = {
    "llm": {
        "provider": "openai",
        "config": {"model": "gpt-4o-mini", "temperature": 0.1},
    },
    "embedder": {
        "provider": "openai",
        "config": {"model": "text-embedding-3-small"},
    },
    "vector_store": {
        "provider": "qdrant",
        "config": {"collection_name": "agent_memory", "host": "localhost", "port": 6333},
    },
}

memory = Memory.from_config(config)

# Agent 1: Customer research agent stores findings
memory.add(
    "Customer prefers budget smartphones under INR 15,000 with good camera",
    user_id="customer-123",
    agent_id="research-agent",
    metadata={"category": "preference", "confidence": 0.92},
)

memory.add(
    "Customer previously purchased Redmi Note 12 in March 2025",
    user_id="customer-123",
    agent_id="research-agent",
    metadata={"category": "history", "confidence": 0.99},
)

# Agent 2: Recommendation agent reads shared memory
relevant_memories = memory.search(
    "What phones should I recommend to this customer?",
    user_id="customer-123",
    limit=5,
)

for mem in relevant_memories["results"]:
    print(f"Memory: {mem['memory']} (score: {mem['score']:.2f})")

# Agent 3: Follow-up agent adds new information
memory.add(
    "Customer mentioned interest in 5G connectivity for Jio network",
    user_id="customer-123",
    agent_id="followup-agent",
    metadata={"category": "preference", "confidence": 0.88},
)

# All agents now have access to the combined memory
all_memories = memory.get_all(user_id="customer-123")
print(f"Total shared memories for customer: {len(all_memories['results'])}")

This example uses Mem0 for cross-session shared memory -- memories persist across different agent invocations and even different sessions. Unlike the stateful approaches above, Mem0 provides semantic retrieval over accumulated memories using vector search. Each agent contributes memories tagged with its agent_id, and any agent can search the shared memory pool using natural language queries. This is particularly powerful for customer-facing multi-agent systems (like an e-commerce assistant at Flipkart or Swiggy) where different specialized agents need access to a unified customer memory across interactions.

Configuration Example21 lines

# LangGraph + Redis shared memory configuration
langgraph:
  state_type: SharedMemory
  checkpointer:
    backend: redis
    redis_url: redis://localhost:6379
    key_prefix: "agent:state:"
    ttl_seconds: 86400  # 24 hours
  reducers:
    messages: append      # list concatenation
    findings: append      # list concatenation  
    summary: replace      # last-writer-wins
    metadata: merge       # dict merge
  view_projection:
    max_tokens_per_agent: 4000
    strategy: recency_and_relevance
    summarize_after: 20   # summarize state after 20 entries
  conflict_resolution:
    strategy: optimistic_concurrency
    max_retries: 3
    retry_backoff_ms: 50

Common Implementation Mistakes

●
Dumping entire state into every agent's prompt: When shared memory accumulates thousands of tokens, naively including all of it in each agent's context window wastes tokens and degrades performance. Use view projection or summarization to give each agent only the state it needs.
●
Using last-writer-wins for all fields: This is the default in many frameworks, but it silently discards concurrent updates. For list-type data (findings, action items, messages), always use an append reducer. Reserve last-writer-wins for fields where only the latest value matters (current status, final summary).
●
No checkpointing for long-running workflows: If your multi-agent workflow takes more than 30 seconds and makes expensive API calls, running without checkpointing means any failure forces a full restart. A single Redis checkpoint adds <1ms of latency but can save minutes of re-computation and dollars in API costs.
●
Ignoring state schema evolution: As your multi-agent system evolves, the shared memory schema changes -- new fields are added, old ones become obsolete. Without versioning, loading a checkpoint from an older schema version will crash or produce silent data corruption. Always version your state schema.
●
Treating shared memory as a message queue: Shared memory is for state, not for events. If agents need to send discrete messages to each other ("please review this"), use a proper message/event system alongside shared memory. Conflating the two leads to complex, brittle state management.

When Should You Use This?

Use When

Multiple agents need to collaborate on a shared task and must see each other's intermediate results -- this is the primary use case
Your multi-agent workflow runs for more than a few seconds and needs fault recovery via checkpointing
Agents have different specializations but need a unified view of the evolving solution (the classic blackboard pattern)
You need human-in-the-loop approval steps where the system pauses, persists state, and resumes after human input
Cross-session memory is required -- e.g., a customer support system where different agents serve the same customer across multiple conversations
You are building a system where agents work in parallel and their results must be merged (e.g., parallel research agents each investigating different sources)
The cost of re-running a failed multi-agent workflow exceeds the cost of maintaining shared state infrastructure

Avoid When

Your system has a single agent -- shared memory adds unnecessary complexity when there is nothing to share
Agents are fully independent and their outputs do not depend on each other's results -- use simple parallel execution with result aggregation instead
The workflow is short-lived (< 5 seconds) and cheap to re-run -- the overhead of persistence and conflict resolution is not justified
You need strict message ordering guarantees between agents -- use message queues (Kafka, RabbitMQ) instead of shared state
Your agents operate on completely different data domains with no overlap -- separate state stores per agent are simpler and avoid unnecessary coupling
Regulatory requirements mandate that certain agents cannot access each other's data (e.g., PII isolation) -- shared memory makes isolation harder to enforce

Key Tradeoffs

Consistency vs. Performance

Strong consistency (every agent always sees the latest state) requires synchronization barriers that add latency. In a system with 5 agents writing concurrently, strong consistency can add 5-15ms per write due to locking or consensus protocols. Eventual consistency allows agents to proceed without waiting but introduces the risk of stale reads. For most LLM-based multi-agent systems, causal consistency is the sweet spot -- agents see causally related updates in order but don't block on unrelated updates.

Consistency Model	Latency Overhead	Data Freshness	Complexity	Best For
Strong	High (5-15ms/write)	Always current	Medium	Financial, safety-critical
Causal	Medium (1-3ms/write)	Causally ordered	High	Most multi-agent workflows
Eventual	Low (<1ms/write)	May be stale	Low	High-throughput, loss-tolerant

Centralized vs. Distributed State

Centralized shared memory (single Redis instance) is simpler to reason about but creates a single point of failure and a potential bottleneck. Distributed shared memory (Redis Cluster, CRDTs) provides higher availability and throughput but introduces complexity in conflict resolution and network partitioning. For most multi-agent AI systems with fewer than 50 concurrent agents, a single Redis instance with replication is sufficient and dramatically simpler.

Memory Footprint vs. Context Quality

Keeping all historical state gives agents maximum context but can overwhelm context windows and increase costs. A 100-step multi-agent workflow might accumulate 50,000+ tokens of state. Aggressive pruning saves tokens but risks losing critical context. The practical approach is tiered memory: keep the last N steps in full detail, summarize older steps, and use vector retrieval for long-term memory.

Cost Comparison for an Indian Startup: Running a Redis-backed shared memory system for a 10-agent customer support bot costs approximately INR 10,900/month ( $130) for the Redis instance. Without checkpointing, re-running failed 10-step agent workflows (at ~INR 84-168 per run with GPT-4) could cost INR 42,000-84,000/month ($ 500-1,000) in wasted API calls if even 500 workflows fail per month. The math strongly favors investing in proper shared memory infrastructure.

Alternatives & Comparisons

Memory Store (Per-Agent)

A per-agent memory store gives each agent its own isolated memory -- useful when agents don't need to coordinate or when privacy boundaries must be enforced. Choose per-agent memory when agents are independent; choose shared memory when agents must collaborate on a common task. In practice, many systems use both: per-agent memory for private reasoning and shared memory for coordination.

LangGraph Node (Message Passing)

LangGraph's edge-based message passing sends data explicitly from one node to another, while shared memory (StateGraph) provides implicit access to all state. Message passing gives tighter control over what each agent sees but requires the graph designer to anticipate all information flows. Shared memory is more flexible -- agents can discover relevant state they weren't explicitly given.

Cache Layer

A cache layer (Redis, Memcached) stores computed results for fast retrieval, typically with TTL-based expiration. Shared memory is semantically richer -- it maintains structured, versioned state with conflict resolution and checkpointing. However, many shared memory implementations use a cache layer as their backend. Think of it this way: a cache is a building material, shared memory is the building.

Vector Store

Vector stores enable semantic retrieval over embedded content, while shared memory provides structured state management. They are complementary: shared memory holds the current working state (structured, keyed), while a vector store can serve as the long-term memory backend for semantic search over historical state. Systems like Mem0 combine both patterns.

Pros, Cons & Tradeoffs

Advantages

Enables emergent collaboration -- agents can react to state changes they weren't explicitly told about, leading to coordination patterns that emerge from the shared medium rather than being hard-coded in the orchestrator
Fault tolerance through checkpointing -- persisting shared state to Redis or PostgreSQL means multi-agent workflows can recover from failures without restarting from scratch, saving both time and money on API calls
Decouples agent communication -- agents don't need to know about each other's existence; they interact through the shared state. This makes it easy to add, remove, or replace agents without changing the communication topology
Supports human-in-the-loop workflows -- checkpointed state allows the system to pause for human review, persist state indefinitely, and resume exactly where it left off. Critical for enterprise AI deployments with approval requirements
Reduces token consumption -- instead of passing the full conversation history to every agent, shared memory with view projection sends only relevant state, cutting prompt token usage by 30-60% in typical multi-agent workflows
Enables time-travel debugging -- checkpointed state history lets developers inspect the exact state at any point in execution, replay from earlier states, and diagnose issues in complex multi-agent interactions

Disadvantages

Concurrency complexity -- multiple agents writing to shared state simultaneously requires careful conflict resolution. Getting this wrong leads to lost updates, stale reads, or deadlocks. This is the #1 source of bugs in shared memory systems
State schema rigidity -- once agents depend on a shared state schema, changing it requires coordinating all agents and migrating checkpointed state. Schema evolution is an under-discussed operational burden
Context window overflow risk -- shared state grows over time. Without active pruning, summarization, or view projection, agents eventually receive more context than they can process, degrading output quality
Single point of failure (centralized) -- a centralized shared memory store (single Redis instance) is a SPOF. Outages halt all agents simultaneously. Mitigate with replication, but this adds operational complexity
Debugging difficulty -- when five agents write to the same state concurrently, tracing the source of a bad state update requires detailed logging and version tracking. Without this instrumentation, shared memory bugs are notoriously hard to reproduce
Latency overhead -- every state read/write adds network latency when using an external store (1-5ms for Redis). For latency-sensitive applications, this overhead multiplied by many agent steps can be significant

Use lock-free data structures (CRDTs, append-only logs) instead of locks. If locks are necessary, enforce a global lock ordering -- all agents acquire locks in the same order. Implement deadlock detection with a timeout: if an agent hasn't progressed in N seconds, forcibly release its locks and retry. The LangGraph pattern of sequential node execution with shared state naturally avoids deadlocks because only one node writes at a time.

Placement in an ML System

Position in the ML System

Shared memory sits at the orchestration layer of a multi-agent ML system -- between the agent orchestrator (which decides which agent runs next) and the individual agents (which read state, perform reasoning, and write results back).

In a LangGraph-based system, shared memory is the StateGraph's state object that flows through every node. The orchestrator controls the graph traversal; shared memory carries the accumulated context.

In a CrewAI-based system, shared memory is the implicit shared context that the crew maintains across task executions. Each task can read previous task outputs and write its own results.

For enterprise deployments in India -- think Razorpay's agentic payment flows, Swiggy's multi-agent order management, or Flipkart's AI shopping assistant -- shared memory typically connects to a Redis cluster for low-latency state access, with PostgreSQL as the durable checkpoint store and a vector store (Qdrant, Pinecone) for long-term semantic memory.

Architectural Principle: Shared memory is the connective tissue of multi-agent systems. It sits at the center, touching everything. This centrality is both its power (everything can coordinate through it) and its risk (a failure here affects everything). Design accordingly.

Pipeline Stage

Orchestration / Serving

Upstream

agent-orchestrator
langgraph-node
tool-executor

Downstream

memory-store
vector-store
cache-layer
output-formatter

Scaling Bottlenecks

Where Shared Memory Gets Tight

The primary bottleneck is write throughput under concurrent access. A single Redis instance handles approximately 100,000 operations/second, which comfortably supports 50-100 concurrent agent sessions with frequent state updates. Beyond that, you need Redis Cluster or a distributed state store.

The second bottleneck is state serialization size. As shared state grows, serializing and deserializing it for each agent invocation becomes CPU-intensive. A 50KB state object serialized and deserialized 10 times per agent step (for 10 agents) means 5MB of JSON processing per step -- measurable at scale.

The third bottleneck is checkpoint storage growth. If you checkpoint every agent step across thousands of concurrent sessions, storage can grow rapidly. At 50KB per checkpoint, 10 steps per session, and 10,000 sessions/day, that's ~5 GB/day of checkpoint data. Budget for TTL-based cleanup.

Some concrete numbers: a production multi-agent customer support system serving 1,000 concurrent sessions, each with 5 agents and 10 steps, generates approximately 50,000 state operations per minute. A single cache.r6g.large Redis instance on AWS (INR 10,900/month) handles this with ~50% headroom.

Production Case Studies

Google DeepMindAI Research

Google DeepMind's research on scaling agent systems evaluated multi-agent configurations across GPT, Gemini, and Claude model families. Their findings revealed that shared state management is a critical determinant of whether multi-agent systems boost or degrade performance. Configurations where agents could effectively share intermediate reasoning through a common workspace consistently outperformed isolated agent pipelines.

Outcome:

Multi-agent systems with properly designed shared state achieved significant improvements on complex reasoning tasks, while poorly configured shared memory led to unexpected performance degradation -- reinforcing that shared memory design, not just agent count, determines system effectiveness.

RazorpayFintech (India)

Razorpay partnered with NPCI and OpenAI to build agentic payments on ChatGPT, enabling AI-assisted UPI transactions. The system uses multiple specialized agents (intent recognition, payment validation, fraud detection) that share state through a common memory layer. The shared memory maintains transaction context, user preferences, and security constraints that all agents must respect across the payment flow.

Outcome:

The agentic payments system processes UPI transactions within preset spending limits while maintaining shared security context across all participating agents. The shared memory architecture ensures that the fraud detection agent's risk assessments are immediately visible to the payment execution agent.

SwiggyFood Delivery (India)

Swiggy launched MCP (Model Context Protocol) integrations across Food, Instamart (grocery), and Dineout, enabling AI agents from ChatGPT, Claude, and Gemini to place orders on behalf of users. The multi-agent architecture uses shared context to maintain user preferences, dietary restrictions, past order history, and delivery location across different service verticals -- a food ordering agent and a grocery agent both need access to the user's address and dietary preferences.

Outcome:

The shared context architecture enables seamless cross-vertical experiences -- a user can ask an AI agent to order dinner and groceries in a single conversation, with shared memory ensuring consistency across the food and grocery agents.

Redis + LangChainDeveloper Tools / Infrastructure

Redis developed the langgraph-checkpoint-redis package providing production-grade shared memory for LangGraph-based multi-agent systems. The integration offers both thread-level persistence (RedisSaver for within-conversation state) and cross-thread memory (RedisStore for state that persists across conversations). This became the reference architecture for shared memory in LangGraph deployments.

Outcome:

Sub-millisecond read/write latency for agent state, enabling real-time multi-agent coordination. The checkpoint system enables time-travel debugging and human-in-the-loop workflows. Adopted as the recommended production backend for LangGraph shared memory.

Tooling & Ecosystem

LangGraph (StateGraph)

PythonOpen Source

LangGraph's StateGraph is the most widely used shared memory implementation for LLM-based multi-agent systems. Provides typed state schemas, reducer functions for conflict-free updates, and pluggable checkpointers (Redis, PostgreSQL, SQLite). The Annotated type with reducers elegantly solves the concurrent write problem for common data types.

langgraph-checkpoint-redis

PythonOpen Source

Official Redis checkpointer and store for LangGraph. Provides RedisSaver for thread-level persistence and RedisStore for cross-thread shared memory. Sub-millisecond latency for state operations. The recommended production backend for LangGraph shared memory.

Mem0

PythonOpen Source

Universal memory layer for AI agents with 41,000+ GitHub stars. Provides cross-session shared memory with automatic memory extraction, conflict resolution (forgets outdated information), and semantic retrieval via vector search. Integrates natively with CrewAI, LangGraph, and Flowise. Backed by Y Combinator and Peak XV Partners (India).

Zep

Python / TypeScriptOpen Source

Context engineering platform with temporal knowledge graph memory. Zep's Graphiti engine maintains relationships between memories over time, enabling agents to understand how shared knowledge evolves. Achieves 18.5% accuracy improvement over baselines on the LongMemEval benchmark while reducing latency by 90%.

Redis

COpen Source

The de facto standard for low-latency shared state. Provides sub-millisecond read/write, pub/sub for reactive agents, Lua scripting for atomic operations, and built-in data structures (lists, sets, hashes, streams) that map naturally to multi-agent state patterns. Redis Streams is particularly useful for event sourcing in agent systems.

CrewAI

PythonOpen Source

Multi-agent orchestration framework with built-in shared context. Provides short-term, long-term, and entity memory with Chroma/Qdrant backends. Supports up to 100+ concurrent agent workflows through optimized task scheduling. Shared context flows implicitly through the crew's task pipeline.

Agent-MCP (Model Context Protocol)

TypeScript / PythonOpen Source

Framework for multi-agent coordination using Anthropic's Model Context Protocol. Enables shared memory across specialized agents working in parallel on different aspects of a project. Uses MCP as the universal interface between agents and shared state.

Research & References

Memory Sharing for Large Language Model based Agents

Hang Gao, Yongfeng Zhang (2024)arXiv preprint

Proposed a framework integrating real-time memory filtering, storage, and retrieval to enable memory sharing among multiple LLM agents. Demonstrated that shared memory significantly enhances in-context learning and reduces redundant computation across agent teams.

Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control

Zhuoran Jin, Pengfei Cao, et al. (2025)arXiv preprint

Introduced a framework for multi-user, multi-agent environments with asymmetric access controls maintaining two memory tiers: private memory and shared memory. Addresses the critical challenge of information asymmetry when agents with different permission levels share a common memory space.

Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

Kai Yan, Jiatong Li, et al. (2025)arXiv preprint

Formalized the LbMAS (LLM-based Multi-Agent Blackboard System) where a central shared blackboard stores all agent-generated messages and intermediate inferences. Using a single public blackboard reduces per-agent prompt lengths and alleviates the memory bottleneck.

LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science

Yu Huang, Huan Yee Koh, et al. (2025)arXiv preprint

Demonstrated that the blackboard architecture consistently outperforms strong baselines including RAG and master-slave multi-agent frameworks, achieving 13-57% relative improvement in end-to-end data science problem solving. Autonomous agents volunteer to respond to shared blackboard postings based on their capabilities.

Latent Collaboration in Multi-Agent Systems

Kolby Nottingham, Bodhisattwa Prasad Majumder, et al. (2025)arXiv preprint

Proposed LatentMAS, an end-to-end latent collaboration framework where LLM agents share information through latent working memory stored in KV-caches rather than natural language text. Agents transfer hidden states layer-wise, enabling more efficient information sharing than explicit text-based communication.

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Deshraj Yadav, Taranjeet Singh, et al. (2025)arXiv preprint

Presented Mem0's scalable memory architecture that dynamically extracts, consolidates, and retrieves information across multi-session agent dialogues. The graph-enhanced variant captures complex relational structures, delivering 26% accuracy boost and 91% lower p95 latency compared to full-context baselines.

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Daniel Chalef, Preston Rasmussen, et al. (2025)arXiv preprint

Introduced Zep's temporal knowledge graph engine (Graphiti) for agent memory that dynamically synthesizes unstructured conversational data and structured business data while maintaining historical relationships. Achieves 18.5% accuracy improvement on LongMemEval while reducing latency by 90%.

Interview & Evaluation Perspective

Common Interview Questions

●
How would you design a shared memory system for a multi-agent customer support bot with 10 specialized agents?
●
What are the tradeoffs between shared memory and message passing in multi-agent systems?
●
How do you handle concurrent writes from multiple agents to the same state field?
●
Explain the blackboard architecture and how it applies to modern LLM-based multi-agent systems.
●
How would you implement checkpointing for a long-running multi-agent workflow that costs $5 per run in API calls?
●
What happens when shared state grows beyond what fits in an agent's context window? How do you solve this?
●
How would you design shared memory for a system where some agents should not see certain state fields (access control)?

Key Points to Mention

●
The blackboard architecture (1970s HEARSAY-II) is the foundational pattern -- shared workspace + knowledge sources + control mechanism. This maps directly to modern LLM multi-agent systems (shared state + specialized agents + orchestrator).
●
Reducer functions (from LangGraph) solve the concurrent write problem elegantly: append for lists, replace for scalars, merge for dicts. Always choose the right reducer for each field type -- last-writer-wins on a list field is almost always a bug.
●
View projection is critical for production systems -- you cannot dump 50K tokens of shared state into every agent's 8K context window. Use recency filtering, role-based filtering, and vector-based retrieval to give each agent a focused view.
●
Checkpointing with Redis or PostgreSQL enables fault recovery, time-travel debugging, and human-in-the-loop workflows. The cost of checkpointing (<1ms per step) is negligible compared to the cost of re-running failed workflows.
●
Optimistic concurrency control (version checks + retry on conflict) is preferred over pessimistic locking for multi-agent systems because agents spend most of their time reasoning (LLM calls), not writing state.

Pitfalls to Avoid

●
Describing shared memory as just 'a dictionary that agents share' -- interviewers want to hear about concurrency control, consistency models, conflict resolution, and persistence strategies
●
Ignoring the context window constraint -- a shared memory design that doesn't address how state is projected into individual agent prompts is incomplete
●
Proposing locks for every write operation -- pessimistic locking creates unnecessary contention in agent systems where writes are infrequent relative to reasoning time
●
Forgetting fault tolerance -- if your design loses all state on a single failure, it's not production-ready
●
Not discussing cost -- quantify the tradeoff between checkpointing cost and re-run cost. This is what separates senior candidates from mid-level ones.

Senior-Level Expectation

A senior/staff-level candidate should demonstrate end-to-end design thinking: (1) State schema design with typed fields and appropriate reducers for each field, (2) Consistency model selection with justification (strong vs. causal vs. eventual), (3) Conflict resolution strategy appropriate to the data types, (4) View projection strategy that respects context window budgets, (5) Persistence and checkpointing design with specific backend choices and cost analysis, (6) Monitoring and observability -- how to track state growth, conflict rates, checkpoint success rates, and agent coordination health. The candidate should be able to sketch the architecture on a whiteboard, estimate costs in INR/USD for an Indian deployment on AWS, and discuss how the design evolves from 2 agents to 20 agents. Bonus points for discussing CRDTs, event sourcing, or the CAP theorem as they apply to multi-agent shared state.

Summary

Shared memory is the architectural pattern that transforms a collection of independent AI agents into a coordinated team. At its core, it provides a centralized or distributed data structure -- the modern equivalent of the 1970s blackboard architecture -- where multiple agents read, write, and react to a shared evolving state.

The key engineering challenges are concurrent write handling (solved via reducer functions, optimistic concurrency control, or CRDTs), context window management (solved via tiered memory with view projection and summarization), and fault tolerance (solved via checkpointing to Redis or PostgreSQL). Production systems like LangGraph's StateGraph with RedisSaver provide batteries-included implementations, while tools like Mem0 and Zep extend shared memory with cross-session persistence and temporal knowledge graphs.

For teams building multi-agent systems in production -- whether it's Razorpay's agentic payments, Swiggy's AI-powered ordering, or an internal enterprise workflow -- the shared memory layer is the most critical infrastructure decision after the choice of LLM itself. The costs are modest (INR 10,000-45,000/month for a complete Redis + PostgreSQL setup), the failure modes are well-understood (lost updates, context overflow, checkpoint corruption), and the ROI is clear: proper shared memory with checkpointing prevents costly re-runs, enables human-in-the-loop workflows, and unlocks the emergent coordination that makes multi-agent systems genuinely more capable than single agents. Design your shared memory carefully -- it is the foundation everything else rests on.

Concept Snapshot

Why This Concept Exists

The Fundamental Problem: Agent Isolation

The Blackboard Architecture: Where It All Started

Why It Became Critical in 2024-2025

Core Intuition & Mental Model

The Whiteboard in the War Room

What Shared Memory Is NOT

The Fundamental Tension

Technical Foundations

Formal Model of Shared Memory in Multi-Agent Systems

State Representation

State Transitions

Reducer Functions (LangGraph Model)

Conflict Detection with Vector Clocks

Consistency Models

Internal Architecture

Key Components

Data Flow

How to Implement

Implementation Spectrum

Framework-Native vs. Custom

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

Consistency vs. Performance

Centralized vs. Distributed State

Memory Footprint vs. Context Quality

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Lost Update (Write-Write Conflict)

Context Window Overflow

Stale Read Leading to Incorrect Decision

Checkpoint Corruption or Missing Checkpoint

Memory Leak from Unbounded History

Deadlock from Circular Dependencies

Placement in an ML System

Position in the ML System

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading