The promise of autonomous AI agents — systems that can decompose complex goals, select appropriate tools, recover from failures, and execute multi-step plans without human hand-holding — is becoming practical engineering rather than research aspiration. But building agents that work reliably in production requires careful attention to architecture. This guide examines the core patterns for multi-step task execution: the agent loop, tool use design, memory systems, error recovery, and the critical decision points where human oversight should intervene.
The Agent Loop
Every autonomous agent, regardless of the underlying model, implements some variant of the observe-think-act loop:
- Observe: Gather current state — tool outputs, memory contents, conversation history, external data
- Think: Invoke the language model with the current context to generate a plan or the next action
- Act: Execute the selected action (call a tool, write to memory, return a result)
- Evaluate: Assess whether the goal has been achieved; if not, loop
The implementation is deceptively simple:
async def agent_loop(goal: str, tools: list[Tool], max_steps: int = 20) -> str:
messages = [{"role": "user", "content": goal}]
for step in range(max_steps):
response = await llm.complete(
messages=messages,
tools=tools,
tool_choice="auto"
)
if response.stop_reason == "end_turn":
return response.content
if response.stop_reason == "tool_use":
tool_results = await execute_tools(response.tool_calls)
messages.append(response.as_message())
messages.append(tool_results_message(tool_results))
continue
raise AgentError(f"Unexpected stop reason: {response.stop_reason}")
raise AgentError(f"Max steps ({max_steps}) exceeded without completing goal")
The max_steps guard is not optional. Without it, a confused or manipulated agent can loop indefinitely, consuming tokens and potentially taking damaging repeated actions.
Tool Design Principles
Tools are the agent’s interface to the world. Poor tool design is the most common source of agent failures. Follow these principles:
Narrow, Composable Tools
A tool that does one thing well is more reliable and easier to reason about than a tool that does many things. Prefer read_file(path) and write_file(path, content) over manage_file(operation, path, content). The agent can compose narrow tools to achieve complex outcomes; broad tools introduce ambiguity about which operation will be performed.
Idempotent Where Possible
If a tool can be called multiple times with the same arguments and produce the same result without side effects, do so. Idempotent tools make retry logic safe. A tool that sends an email is not idempotent; one that upserts a database record generally is.
Rich Return Values
Return structured data with enough context for the agent to reason about what happened. A tool that returns {"success": true} forces the agent to call additional tools to verify the outcome. A tool that returns {"success": true, "records_modified": 3, "record_ids": [101, 102, 103]} lets the agent proceed with confidence or raise a concern if the count is unexpected.
Explicit Error Messages
When a tool fails, return a structured error that tells the agent what went wrong and, if possible, how to recover:
{
"error": "PERMISSION_DENIED",
"message": "Cannot write to /etc/hosts: insufficient permissions",
"suggestion": "Request elevated permissions or use an alternative path under /tmp"
}
Memory Systems
The context window is finite and expensive. Effective agents maintain multiple memory layers with different characteristics:
In-Context Memory (Working Memory)
The current message history. Cheap to read, expensive in tokens, lost when the context is cleared. Use for the active task state and recent tool results. Compress aggressively — summarize completed sub-tasks rather than keeping full tool output histories.
External Short-Term Memory
A key-value store (Redis, simple dict) that persists across agent loop iterations but is scoped to the current task. The agent reads and writes explicitly via memory tools:
tools = [
Tool("remember", lambda key, value: memory.set(key, value)),
Tool("recall", lambda key: memory.get(key)),
Tool("list_memories", lambda: memory.keys()),
]
This lets the agent track intermediate results, decisions made, and sub-task completions without filling the context window with repetition.
Long-Term Semantic Memory
A vector database (pgvector, Qdrant) containing past task outcomes, learned facts, and domain knowledge. The agent queries it with natural language at the start of a task to retrieve relevant prior experience. This is the mechanism that enables an agent to improve over time — past successes and failures inform future decisions.
Error Recovery Patterns
Production agents encounter failures: network timeouts, malformed tool outputs, rate limits, unexpected data formats. Build recovery into the agent loop:
Retry with Exponential Backoff
For transient failures (network errors, rate limits), retry with exponential backoff and jitter. Wrap tool execution:
async def execute_with_retry(tool_call, max_retries=3):
for attempt in range(max_retries):
try:
return await tool_call.execute()
except TransientError as e:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.random()
await asyncio.sleep(wait)
Reflection on Failure
When a tool returns an error, add the error to the message history and let the model reason about what went wrong and how to adapt. Most capable models will adjust their approach when given clear error context. If the same failure repeats three times, escalate to a human checkpoint rather than continuing to loop.
Rollback Capabilities
For agents that modify state (databases, filesystems, external APIs), implement rollback tools that the agent can invoke when it detects it has made a mistake:
Tool("create_checkpoint", lambda: state.checkpoint()),
Tool("rollback_to_checkpoint", lambda checkpoint_id: state.rollback(checkpoint_id))
Human-in-the-Loop Checkpoints
Full autonomy is not always appropriate. Define explicit conditions under which the agent must pause and request human approval before proceeding:
- High-impact irreversible actions: Deleting records, sending external communications, executing financial transactions
- Confidence below threshold: When the model’s reasoning shows significant uncertainty about the correct action
- Repeated failures: Three consecutive failures on the same sub-task indicate the agent needs human guidance
- Scope expansion: When completing the goal would require accessing systems or data not mentioned in the original request
- Anomaly detection: Tool outputs that differ dramatically from expected patterns
async def checkpoint(reason: str, proposed_action: str) -> bool:
"""Returns True if human approves, False to abort."""
notification = await notify_operator(
f"Agent checkpoint: {reason}\nProposed action: {proposed_action}\n"
f"Approve? Reply YES/NO within 300 seconds."
)
response = await wait_for_response(timeout=300)
return response.strip().upper() == "YES"
Context Management
Long-running agents accumulate context that approaches token limits. Implement progressive summarization:
async def compress_history(messages: list) -> list:
if token_count(messages) < CONTEXT_THRESHOLD:
return messages
# Keep system prompt and last N messages verbatim
recent = messages[-10:]
older = messages[1:-10] # Exclude system prompt
summary = await llm.complete([{
"role": "user",
"content": f"Summarize the following agent task history concisely, "
f"preserving all key decisions, tool results, and current state:\n\n"
f"{format_messages(older)}"
}])
return [messages[0], {"role": "assistant", "content": f"[History summary]: {summary}"}, *recent]
Observability
Agents are opaque by default — a sequence of LLM calls with tool invocations between them. Add structured logging at every step:
@dataclass
class AgentStep:
step_number: int
timestamp: datetime
model_input_tokens: int
model_output_tokens: int
tool_calls: list[ToolCall]
tool_results: list[ToolResult]
reasoning: str # Model's chain-of-thought if visible
Emit these as structured events to your observability platform. Traces that show the full decision tree for a completed task are invaluable for debugging failures and optimizing prompt design.
Conclusion
Building reliable autonomous agents requires disciplined engineering at every layer: a bounded agent loop, well-designed narrow tools, a multi-tier memory architecture, robust error recovery, and principled human-in-the-loop checkpoints. The models themselves are increasingly capable; the limiting factor is usually the scaffolding around them. Invest in observability from day one — agents that cannot be debugged cannot be improved.
