AI Agent Architecture: Tool Use, Memory Systems, and Multi-Step Reasoning

Excerpt: AI agents that combine language models with tool use, persistent memory, and multi-step reasoning represent a new paradigm in software architecture. This article covers the ReAct reasoning pattern, structured function calling, the four-layer memory architecture (sensory, working, episodic, semantic), planning strategies including chain-of-thought and tree-of-thought, and practical evaluation frameworks for production agent deployments.

Introduction

The shift from language models as text completers to language models as reasoning agents is one of the most significant architectural changes in modern software. An agent is not just a model that generates responses — it is a system that perceives its environment, reasons about goals, selects tools, executes actions, and updates its understanding based on results. Building agents that work reliably in production requires understanding each of these components at an engineering level, not just a conceptual one.

This article builds up the architecture of a production-grade AI agent: the reasoning loop, tool use interfaces, memory system design, planning strategies, and the evaluation infrastructure needed to know whether your agent is working correctly.

The ReAct Pattern: Reasoning and Acting in Interleaved Steps

The foundational reasoning pattern for tool-using agents is ReAct (Reason + Act), introduced in a 2022 paper by Yao et al. The insight is simple but powerful: interleave reasoning traces with actions rather than separating them. This allows the model to explain its reasoning, observe action results, and update its reasoning before the next action.

A ReAct trace looks like:

Task: What is the current price of copper per pound, and how does it compare to last month?

Thought: I need to find the current copper price and the price from one month ago.
         I should use the commodity_price tool to get both values.

Action: commodity_price(commodity="copper", date="today")
Observation: {"price": 4.23, "unit": "USD/lb", "date": "2026-04-07"}

Thought: I have the current price. Now I need the price from one month ago.

Action: commodity_price(commodity="copper", date="2026-03-07")
Observation: {"price": 4.01, "unit": "USD/lb", "date": "2026-03-07"}

Thought: Current price is $4.23/lb, last month was $4.01/lb.
         Change = (4.23 - 4.01) / 4.01 = +5.49% increase.

Answer: Copper is currently $4.23/lb, up 5.49% from $4.01/lb one month ago.

The interleaved reasoning is critical. Without it, models tend to hallucinate results rather than actually invoking tools, or fail to incorporate observation results into subsequent reasoning steps.

Function Calling: The Tool Use Interface

Modern LLM APIs provide structured function calling (also called tool use) that goes beyond prompting the model to output JSON. The model receives a structured schema describing available tools, and the API enforces that tool invocations are valid JSON matching the schema.

import anthropic
import json

client = anthropic.Anthropic()

# Define tools with strict JSON Schema
tools = [
    {
        "name": "search_codebase",
        "description": "Search for files or code patterns in the project codebase",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query or regex pattern"
                },
                "file_type": {
                    "type": "string",
                    "enum": ["typescript", "python", "go", "any"],
                    "description": "File type to search within"
                },
                "include_tests": {
                    "type": "boolean",
                    "description": "Whether to include test files",
                    "default": False
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "run_tests",
        "description": "Execute the test suite for a specific package or module",
        "input_schema": {
            "type": "object",
            "properties": {
                "package": {
                    "type": "string",
                    "description": "Package path relative to project root"
                },
                "filter": {
                    "type": "string",
                    "description": "Test name filter pattern"
                }
            },
            "required": ["package"]
        }
    }
]

def run_agent_loop(task: str, max_iterations: int = 10):
    messages = [{"role": "user", "content": task}]
    iteration = 0

    while iteration < max_iterations:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        # Add assistant response to message history
        messages.append({"role": "assistant", "content": response.content})

        # Check if agent is done
        if response.stop_reason == "end_turn":
            break

        # Process tool uses
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result)
                    })

            messages.append({"role": "user", "content": tool_results})

        iteration += 1

    return messages

The Four-Layer Memory Architecture

Human memory has distinct systems for different time horizons and purposes. Effective agent architectures mirror this structure across four layers:

Layer 1: Sensory Memory (Current Context Window)

The immediate context window is the agent’s sensory memory — everything currently visible to the model. This includes the current task, conversation history, tool results from recent steps, and any injected context. It is fast to access but limited in size and ephemeral.

Context window management is a critical skill in agent engineering. Strategies include: summarizing older conversation turns when the window fills, using hierarchical summarization to compress long tool outputs, and selectively retrieving relevant memory from persistent storage rather than loading everything upfront.

Layer 2: Working Memory (In-Execution State)

Working memory holds the agent’s active task state — what it is currently doing, what it has determined so far, and what remains to be done. This is typically implemented as an explicit structured object passed between reasoning steps:

from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional

@dataclass
class AgentWorkingMemory:
    """Structured working memory for a single agent execution."""
    task: str
    goal_decomposition: List[str] = field(default_factory=list)
    completed_steps: List[Dict[str, Any]] = field(default_factory=list)
    pending_steps: List[str] = field(default_factory=list)
    findings: Dict[str, Any] = field(default_factory=dict)
    errors: List[str] = field(default_factory=list)
    confidence: float = 1.0

    def to_context_string(self) -> str:
        """Render working memory as context for the next reasoning step."""
        lines = [f"Current task: {self.task}"]
        if self.completed_steps:
            lines.append(f"Completed: {len(self.completed_steps)} steps")
            lines.append(f"Last result: {self.completed_steps[-1].get('summary', 'N/A')}")
        if self.pending_steps:
            lines.append(f"Next step: {self.pending_steps[0]}")
        if self.errors:
            lines.append(f"Errors so far: {'; '.join(self.errors[-3:])}")
        return "\n".join(lines)

Layer 3: Episodic Memory (Execution History Store)

Episodic memory stores records of past agent executions — what task was performed, what tools were used, what the outcome was, and any lessons learned. This layer enables the agent to avoid repeating mistakes and to apply patterns from similar past tasks.

Implementation typically uses a vector database (pgvector, Chroma, Qdrant) to enable semantic similarity search over past episodes:

import psycopg2
import numpy as np
from anthropic import Anthropic

client = Anthropic()

def store_episode(task: str, execution_trace: list, outcome: str, success: bool):
    """Store a completed agent execution in episodic memory."""
    # Generate embedding of the task for similarity search
    embedding_response = client.embeddings.create(
        model="voyage-3",
        input=task
    )
    embedding = embedding_response.embeddings[0]

    conn = psycopg2.connect("postgresql://ems_agent:[email protected]/agent_memory")
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO agent_episodes
                (task, execution_trace, outcome, success, task_embedding, created_at)
            VALUES (%s, %s, %s, %s, %s, NOW())
        """, (task, psycopg2.extras.Json(execution_trace),
              outcome, success, embedding))
    conn.commit()

def retrieve_similar_episodes(task: str, limit: int = 3) -> list:
    """Retrieve the most similar past episodes for a given task."""
    embedding_response = client.embeddings.create(
        model="voyage-3",
        input=task
    )
    embedding = embedding_response.embeddings[0]

    conn = psycopg2.connect("postgresql://ems_agent:[email protected]/agent_memory")
    with conn.cursor() as cur:
        cur.execute("""
            SELECT task, outcome, success,
                   1 - (task_embedding <=> %s::vector) as similarity
            FROM agent_episodes
            ORDER BY task_embedding <=> %s::vector
            LIMIT %s
        """, (embedding, embedding, limit))
        return cur.fetchall()

Layer 4: Semantic Memory (Knowledge Base)

Semantic memory is the agent’s general knowledge store — documentation, code references, domain knowledge, and organizational context that the agent may need to consult. This is typically implemented as a RAG (Retrieval-Augmented Generation) pipeline that retrieves relevant chunks from a vector store based on the current query.

Planning Strategies

Different tasks require different planning depths. Three strategies cover most production scenarios:

Chain-of-Thought (CoT) is the simplest: prompt the model to reason step-by-step before acting. Effective for linear tasks where the next action is clear from the current state. The key is to make reasoning explicit in the prompt: “Think through what you need to do before taking any action.”

Plan-and-Execute separates planning from execution. The model first generates a complete task decomposition, then executes each step. This is more reliable for complex tasks because the plan can be reviewed (and corrected) before execution begins. The trade-off is reduced adaptability — if step 3 produces unexpected results, the remainder of the pre-generated plan may be invalid.

Tree-of-Thought (ToT) explores multiple reasoning paths simultaneously, evaluating each path’s promise before committing. Effective for tasks with multiple valid approaches where the best path is not immediately obvious. Computationally more expensive — typically reserved for high-value tasks where the cost is justified.

Evaluation Framework for Production Agents

Agents are harder to evaluate than static models because their behavior depends on the sequence of tool results, which may be non-deterministic. Build evaluation at multiple levels:

  • Unit tests for tools — verify each tool function independently with mocked inputs
  • Trajectory evaluation — given a task and a fixed sequence of tool results, does the agent arrive at the correct conclusion? Use golden trajectories from human evaluations
  • End-to-end task success rate — run the agent against a benchmark of tasks with known correct answers. Track success rate, average step count, and token cost per task
  • Failure mode analysis — categorize failures (hallucination, tool misuse, planning failure, context window overflow) to identify which layer needs improvement

Conclusion

AI agent architecture is a discipline that borrows heavily from software engineering and cognitive science. The ReAct pattern provides the reasoning foundation; structured function calling provides reliable tool use; the four-layer memory architecture enables context, state, history, and knowledge to work together; and planning strategies match the approach to the task complexity. Rigorous evaluation turns an agent from a demo into a production system.

The field is evolving rapidly, but these foundational patterns are stable. Engineers who understand them deeply will be well-positioned to build reliable agents as the ecosystem matures.

Scroll to Top