skip to content

Search

Building AI Agents from Scratch

15 min read

Build an AI agent with tool use, memory, and error recovery using OpenAI's function calling API. No frameworks—just the patterns that matter.

Building AI Agents from Scratch

ChatGPT is a chatbot. An agent is a chatbot that can do things. That distinction is worth billions of dollars right now.

A chatbot generates text. An agent generates text and then acts on it—calling APIs, querying databases, writing files, making decisions in loops. The shift from “generate a response” to “accomplish a goal” is the difference between a toy and a tool.

But most agent tutorials are either toy demos (“look, the LLM can call a calculator!”) or impenetrable framework code (LangChain’s 47-layer abstraction for calling a function). The core concepts are simple. The frameworks obscure them.

In this post, we’ll build an agent from scratch. No LangChain. No CrewAI. No AutoGen. Just Python and OpenAI’s function calling API.

What we’ll build:

  1. The agent loop — the think → act → observe cycle that powers every agent
  2. Tool use — defining and calling tools with structured inputs/outputs
  3. Memory — short-term (conversation) and working memory (scratchpad)
  4. Error recovery — what happens when tools fail, and how to handle it gracefully
  5. Guardrails — keeping the agent from doing things it shouldn’t

The goal isn’t to build a production framework. It’s to understand the patterns well enough that you can build exactly the agent you need—or evaluate the frameworks that claim to do it for you.


The Agent Loop

Every agent, from the simplest to the most complex, runs the same loop:

while goal not achieved:
    1. THINK  — Look at the current state and decide what to do
    2. ACT    — Call a tool or take an action
    3. OBSERVE — See the result
    4. UPDATE — Add the result to memory, decide if we're done

This is the ReAct pattern (Reason + Act). The LLM doesn’t just generate a response—it reasons about what to do, does it, observes the outcome, and reasons again. It’s a feedback loop, and feedback loops are what make agents powerful.

The loop terminates when the agent either:

  • Achieves the goal (returns a final answer)
  • Gives up (too many iterations, unrecoverable error)
  • Hits a guardrail (tries to do something forbidden)

Let’s build it.


Setup

import json
import time
import httpx
from datetime import datetime
from dataclasses import dataclass, field
from typing import Any, Callable
 
from openai import OpenAI
 
client = OpenAI()  # Uses OPENAI_API_KEY env var
MODEL = "gpt-4o"

Defining Tools

A tool is just a function with a structured interface. The LLM sees the tool’s name, description, and parameter schema (as JSON Schema). Based on that, it decides whether and how to call the tool.

The quality of your tool descriptions is the single biggest factor in agent performance. The LLM can’t read your source code—it only sees what you tell it. Vague descriptions produce vague tool calls.

Let’s define a set of tools for a data analysis agent:

@dataclass
class Tool:
    """A tool that the agent can use."""
    name: str
    description: str
    parameters: dict  # JSON Schema for the tool's parameters
    function: Callable[..., str]  # The actual function to call
    requires_confirmation: bool = False  # If True, ask user before executing
 
 
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression safely."""
    # Only allow safe math operations
    allowed_chars = set("0123456789+-*/.() ")
    if not all(c in allowed_chars for c in expression):
        return f"Error: Expression contains unsafe characters. Only math operators allowed."
    try:
        result = eval(expression)  # Safe because we validated the input
        return str(result)
    except Exception as e:
        return f"Error: {str(e)}"
 
 
def search_web(query: str) -> str:
    """Search the web and return top results."""
    # In production, use a real search API (Tavily, Brave, SerpAPI, etc.)
    # For demo purposes, return simulated results
    simulated_results = {
        "iowa liquor sales": "The Iowa ABD reported $1.9 billion in total liquor sales for fiscal year 2023, up 3.2% from 2022. The top-selling category was American Vodkas.",
        "polars vs pandas": "Polars is a DataFrame library written in Rust that offers 5-20x speedups over pandas through lazy evaluation, predicate pushdown, and automatic parallelization.",
        "feature store": "A feature store is a centralized repository for storing, managing, and serving ML features. Popular options include Feast (open source) and Tecton (managed).",
    }
    # Fuzzy match
    query_lower = query.lower()
    for key, result in simulated_results.items():
        if any(word in query_lower for word in key.split()):
            return result
    return f"No results found for: {query}"
 
 
def get_current_time() -> str:
    """Get the current date and time."""
    return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 
 
def write_note(title: str, content: str) -> str:
    """Save a note to the agent's working memory."""
    return f"Note saved: '{title}'"
 
 
def read_file(filepath: str) -> str:
    """Read the contents of a local file."""
    # Guardrail: restrict to safe paths
    allowed_prefixes = ["/tmp/", "./data/"]
    if not any(filepath.startswith(p) for p in allowed_prefixes):
        return f"Error: Access denied. Can only read files in {allowed_prefixes}"
    try:
        with open(filepath) as f:
            return f.read()[:5000]  # Limit output size
    except FileNotFoundError:
        return f"Error: File not found: {filepath}"
    except Exception as e:
        return f"Error reading file: {str(e)}"
 
 
# Define the tool registry
TOOLS = [
    Tool(
        name="calculate",
        description="Evaluate a mathematical expression. Use this for any arithmetic, percentages, or unit conversions. Input should be a valid Python math expression like '(100 * 1.05) - 50'.",
        parameters={
            "type": "object",
            "properties": {
                "expression": {
                    "type": "string",
                    "description": "A mathematical expression to evaluate (e.g., '2 + 2', '100 * 0.15')",
                },
            },
            "required": ["expression"],
        },
        function=lambda expression: calculate(expression),
    ),
    Tool(
        name="search_web",
        description="Search the web for current information. Use this when you need facts, statistics, or information you don't have in your training data.",
        parameters={
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query",
                },
            },
            "required": ["query"],
        },
        function=lambda query: search_web(query),
    ),
    Tool(
        name="get_current_time",
        description="Get the current date and time. Use this when the user asks about the current date or time.",
        parameters={"type": "object", "properties": {}},
        function=lambda: get_current_time(),
    ),
    Tool(
        name="write_note",
        description="Save a note for later reference. Use this to store intermediate results, plans, or observations during multi-step tasks.",
        parameters={
            "type": "object",
            "properties": {
                "title": {"type": "string", "description": "A short title for the note"},
                "content": {"type": "string", "description": "The note content"},
            },
            "required": ["title", "content"],
        },
        function=lambda title, content: write_note(title, content),
    ),
    Tool(
        name="read_file",
        description="Read the contents of a local file. Restricted to /tmp/ and ./data/ directories for security.",
        parameters={
            "type": "object",
            "properties": {
                "filepath": {"type": "string", "description": "Path to the file to read"},
            },
            "required": ["filepath"],
        },
        function=lambda filepath: read_file(filepath),
        requires_confirmation=True,
    ),
]
 
# Build the lookup map
TOOL_MAP = {tool.name: tool for tool in TOOLS}
 
# Convert to OpenAI function calling format
TOOL_SCHEMAS = [
    {
        "type": "function",
        "function": {
            "name": tool.name,
            "description": tool.description,
            "parameters": tool.parameters,
        },
    }
    for tool in TOOLS
]
 
print(f"Registered {len(TOOLS)} tools:")
for tool in TOOLS:
    print(f"  • {tool.name}: {tool.description[:60]}...")

Memory

An agent without memory is just an LLM with extra steps. Memory is what makes agents useful—they can build on previous results, remember context, and work through multi-step problems.

We implement two types of memory:

  1. Conversation memory: The full message history (what was said and done). This is passed to the LLM on every turn so it has full context.
  2. Working memory: A scratchpad for notes, intermediate results, and observations. The agent can explicitly save and retrieve information here.

There’s a practical limit: the context window. When conversation memory gets too long, we need to summarize older turns to stay within the token budget. This is where most naive agent implementations break—they just truncate history, losing critical context.

@dataclass
class AgentMemory:
    """Agent memory management."""
    messages: list[dict] = field(default_factory=list)
    working_memory: dict[str, str] = field(default_factory=dict)  # Notes scratchpad
    max_messages: int = 50  # Before summarization kicks in
 
    def add_system(self, content: str):
        self.messages.append({"role": "system", "content": content})
 
    def add_user(self, content: str):
        self.messages.append({"role": "user", "content": content})
 
    def add_assistant(self, content: str = None, tool_calls: list = None):
        msg = {"role": "assistant"}
        if content:
            msg["content"] = content
        if tool_calls:
            msg["tool_calls"] = tool_calls
        self.messages.append(msg)
 
    def add_tool_result(self, tool_call_id: str, content: str):
        self.messages.append({
            "role": "tool",
            "tool_call_id": tool_call_id,
            "content": content,
        })
 
    def save_note(self, title: str, content: str):
        self.working_memory[title] = content
 
    def get_context_messages(self) -> list[dict]:
        """Get messages to send to the LLM, including working memory context."""
        msgs = self.messages.copy()
 
        # Inject working memory as a system message if it has content
        if self.working_memory:
            notes = "\n".join(f"- **{k}**: {v}" for k, v in self.working_memory.items())
            msgs.insert(1, {
                "role": "system",
                "content": f"Your working memory (notes from previous steps):\n{notes}",
            })
 
        return msgs
 
    def should_summarize(self) -> bool:
        return len(self.messages) > self.max_messages
 
    def summarize(self, client: OpenAI):
        """Compress older messages into a summary to free up context."""
        if not self.should_summarize():
            return
 
        # Keep system prompt + last 10 messages, summarize the rest
        system_msgs = [m for m in self.messages if m["role"] == "system"]
        recent = self.messages[-10:]
        to_summarize = self.messages[len(system_msgs):-10]
 
        if not to_summarize:
            return
 
        summary_prompt = (
            "Summarize this conversation history concisely. "
            "Focus on: key decisions, tool results, and important findings. "
            "Keep all factual information.\n\n"
            + "\n".join(str(m) for m in to_summarize)
        )
 
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=500,
        )
 
        summary = response.choices[0].message.content
        self.messages = system_msgs + [
            {"role": "system", "content": f"Summary of earlier conversation:\n{summary}"},
        ] + recent
 
        print(f"  [Memory] Summarized {len(to_summarize)} messages into 1")

The Agent

Now we assemble the pieces. The Agent class implements the think-act-observe loop, delegates tool calls to the right functions, handles errors, and manages memory.

Key design decisions:

  • Max iterations: Prevent infinite loops. If the agent hasn’t finished in N steps, it gives up.
  • Error recovery: When a tool call fails, the error is fed back as an observation so the agent can try a different approach.
  • Confirmation prompts: Sensitive tools (file access, API calls) can require user confirmation before executing.
class Agent:
    """An AI agent with tool use, memory, and error recovery."""
 
    def __init__(
        self,
        system_prompt: str,
        tools: list[Tool],
        model: str = "gpt-4o",
        max_iterations: int = 10,
        verbose: bool = True,
    ):
        self.model = model
        self.max_iterations = max_iterations
        self.verbose = verbose
        self.tool_map = {t.name: t for t in tools}
        self.tool_schemas = [
            {
                "type": "function",
                "function": {
                    "name": t.name,
                    "description": t.description,
                    "parameters": t.parameters,
                },
            }
            for t in tools
        ]
        self.memory = AgentMemory()
        self.memory.add_system(system_prompt)
 
    def _log(self, msg: str):
        if self.verbose:
            print(msg)
 
    def _execute_tool(self, name: str, arguments: dict) -> str:
        """Execute a tool call with error handling."""
        tool = self.tool_map.get(name)
        if not tool:
            return f"Error: Unknown tool '{name}'. Available tools: {list(self.tool_map.keys())}"
 
        try:
            result = tool.function(**arguments)
 
            # Handle write_note specially—save to working memory
            if name == "write_note":
                self.memory.save_note(
                    arguments.get("title", "untitled"),
                    arguments.get("content", ""),
                )
 
            return result
 
        except TypeError as e:
            return f"Error: Wrong arguments for tool '{name}': {str(e)}"
        except Exception as e:
            return f"Error executing '{name}': {type(e).__name__}: {str(e)}"
 
    def run(self, user_message: str) -> str:
        """Run the agent loop until a final answer is produced or max iterations hit."""
        self.memory.add_user(user_message)
        self._log(f"\nUser: {user_message}")
        self._log("=" * 60)
 
        for iteration in range(self.max_iterations):
            self._log(f"\n--- Iteration {iteration + 1}/{self.max_iterations} ---")
 
            # Summarize if memory is getting large
            if self.memory.should_summarize():
                self.memory.summarize(client)
 
            # THINK: Ask the LLM what to do
            response = client.chat.completions.create(
                model=self.model,
                messages=self.memory.get_context_messages(),
                tools=self.tool_schemas if self.tool_schemas else None,
                temperature=0.1,
            )
 
            message = response.choices[0].message
 
            # Case 1: LLM wants to call tools (ACT)
            if message.tool_calls:
                # Store the assistant message with tool calls
                self.memory.add_assistant(
                    content=message.content,
                    tool_calls=[
                        {
                            "id": tc.id,
                            "type": "function",
                            "function": {
                                "name": tc.function.name,
                                "arguments": tc.function.arguments,
                            },
                        }
                        for tc in message.tool_calls
                    ],
                )
 
                # Execute each tool call
                for tool_call in message.tool_calls:
                    name = tool_call.function.name
                    args = json.loads(tool_call.function.arguments)
                    self._log(f"  Tool: {name}({json.dumps(args, indent=None)})")
 
                    # OBSERVE: Execute and capture result
                    result = self._execute_tool(name, args)
                    self._log(f"  Result: {result[:200]}")
 
                    # Store the tool result
                    self.memory.add_tool_result(tool_call.id, result)
 
            # Case 2: LLM produced a final answer (no tool calls)
            else:
                final_answer = message.content
                self.memory.add_assistant(content=final_answer)
                self._log(f"\nAgent: {final_answer}")
                return final_answer
 
        # Max iterations reached
        fallback = "I wasn't able to complete the task within the allowed number of steps. Here's what I found so far based on my working memory."
        self.memory.add_assistant(content=fallback)
        return fallback

Demo: The Agent in Action

Let’s see the agent work through some tasks of increasing complexity. Watch the think-act-observe loop unfold:

# Create an agent for data analysis tasks
agent = Agent(
    system_prompt=(
        "You are a helpful data analysis assistant. You can search for information, "
        "perform calculations, and take notes. When working on multi-step problems, "
        "save intermediate results as notes so you can reference them later. "
        "Always show your reasoning."
    ),
    tools=TOOLS,
    verbose=True,
)
 
# Task 1: Simple tool use
result = agent.run(
    "What were Iowa's total liquor sales in 2023, and what would a 5% increase look like?"
)
# Task 2: Multi-step reasoning (same agent, so it has memory of the first task)
result = agent.run(
    "Based on what you found about Iowa liquor sales, if each of the state's 99 counties "
    "contributed equally, how much would each county contribute? And what if the top "
    "5 counties contributed 40% of the total?"
)
# Task 3: Handling unknowns and edge cases
fresh_agent = Agent(
    system_prompt=(
        "You are a research assistant. Use the available tools to find information. "
        "If you can't find something, say so honestly rather than making things up. "
        "Save important findings as notes."
    ),
    tools=TOOLS,
    verbose=True,
)
 
result = fresh_agent.run(
    "Compare Polars and Pandas for data processing. What are the key advantages of each? "
    "Then calculate: if Polars is 10x faster and a Pandas job takes 45 minutes, "
    "how much time would you save per day if you run that job 8 times?"
)

Guardrails

An agent without guardrails is a liability. The LLM might try to:

  • Access files it shouldn’t (we handled this with path restrictions)
  • Make too many API calls (we handled this with max_iterations)
  • Execute dangerous operations (we use requires_confirmation)
  • Leak sensitive information in its responses

Here are additional guardrails worth implementing:

@dataclass
class Guardrails:
    """Safety constraints for agent behavior."""
    max_tool_calls_per_turn: int = 5
    max_total_tool_calls: int = 20
    blocked_tool_sequences: list[list[str]] = field(default_factory=list)
    output_filters: list[Callable[[str], str]] = field(default_factory=list)
 
    _tool_call_count: int = field(default=0, init=False)
    _turn_tool_count: int = field(default=0, init=False)
    _recent_tools: list[str] = field(default_factory=list, init=False)
 
    def check_tool_call(self, tool_name: str) -> str | None:
        """Check if a tool call is allowed. Returns error message if blocked."""
        self._tool_call_count += 1
        self._turn_tool_count += 1
        self._recent_tools.append(tool_name)
 
        if self._tool_call_count > self.max_total_tool_calls:
            return f"Guardrail: Maximum total tool calls ({self.max_total_tool_calls}) exceeded."
 
        if self._turn_tool_count > self.max_tool_calls_per_turn:
            return f"Guardrail: Maximum tool calls per turn ({self.max_tool_calls_per_turn}) exceeded."
 
        # Check for blocked sequences (e.g., read_file → read_file → read_file = exfiltration attempt)
        for seq in self.blocked_tool_sequences:
            if self._recent_tools[-len(seq):] == seq:
                return f"Guardrail: Blocked tool sequence detected: {seq}"
 
        return None  # Allowed
 
    def reset_turn(self):
        self._turn_tool_count = 0
 
    def filter_output(self, text: str) -> str:
        """Apply output filters (PII removal, etc.)."""
        for f in self.output_filters:
            text = f(text)
        return text
 
 
# Example: Create guardrails that prevent excessive file reading
def redact_emails(text: str) -> str:
    """Simple PII filter for email addresses."""
    import re
    return re.sub(r'[\w.+-]+@[\w-]+\.[\w.-]+', '[REDACTED_EMAIL]', text)
 
 
guardrails = Guardrails(
    max_tool_calls_per_turn=3,
    max_total_tool_calls=15,
    blocked_tool_sequences=[
        ["read_file", "read_file", "read_file"],  # Prevent bulk file exfiltration
    ],
    output_filters=[redact_emails],
)
 
# Test the guardrails
print(guardrails.check_tool_call("search_web"))  # None (allowed)
print(guardrails.check_tool_call("calculate"))    # None (allowed)
print(guardrails.check_tool_call("calculate"))    # None (allowed)
print(guardrails.check_tool_call("calculate"))    # Blocked (4th call in a turn)
 
print()
print(guardrails.filter_output("Contact [email protected] for details"))

Error Recovery Patterns

Agents fail. Tools return errors. APIs time out. The LLM hallucinates a tool name that doesn’t exist. What matters is how the agent recovers.

Our agent handles errors by feeding them back as observations. This gives the LLM a chance to:

  1. Retry with different arguments (e.g., fix a malformed math expression)
  2. Try a different tool (e.g., if search fails, try calculating from known data)
  3. Give up gracefully (tell the user what went wrong and what it tried)

Let’s see error recovery in action:

# Create an agent and deliberately cause tool errors
error_agent = Agent(
    system_prompt=(
        "You are a helpful assistant. When a tool fails, try an alternative approach. "
        "Be transparent about what worked and what didn't."
    ),
    tools=TOOLS,
    verbose=True,
)
 
# This will cause a file read error (path not allowed),
# and the agent should handle it gracefully
result = error_agent.run(
    "Read the file at /etc/passwd and tell me what's in it. "
    "If you can't access it, search for information about Linux user management instead."
)

Patterns Worth Knowing

Now that you understand the core loop, here are patterns you’ll encounter as you build more sophisticated agents:

Planning Before Acting

For complex tasks, have the agent create a plan before executing. Add a think tool that doesn’t do anything except let the agent reason out loud:

Tool(
    name="think",
    description="Think through a problem step by step before acting. Use this for complex multi-step tasks.",
    parameters={"type": "object", "properties": {"thoughts": {"type": "string"}}},
    function=lambda thoughts: "Thought recorded. Proceed with your plan.",
)

Parallel Tool Calls

OpenAI’s API supports multiple tool calls in a single response. When the agent needs data from two independent sources, it can call both tools simultaneously. Our implementation already handles this—look at how we loop over message.tool_calls.

Human-in-the-Loop

For high-stakes actions, pause and ask the user:

if tool.requires_confirmation:
    user_input = input(f"Agent wants to call {name}({args}). Allow? [y/n]: ")
    if user_input.lower() != "y":
        result = "Tool call denied by user."

Agent Handoff

When a task requires multiple specialized capabilities, use a router agent that delegates to specialized sub-agents. The router decides which specialist to call based on the task type.


How This Compares to Frameworks

AspectOur ImplementationLangChainCrewAI
Lines of code~200Thousands (+ dependencies)Hundreds (+ dependencies)
DebuggabilityFull visibilityAbstraction layers hide logicModerate
Tool definitionDataclass + function@tool decorator + schema@tool decorator
MemoryExplicit managementMultiple memory backendsShared/private memory
Multi-agentBuild it yourselfChains and graphsBuilt-in crew concept
Vendor lock-inOpenAI (easily swappable)Multiple providersMultiple providers

When to use a framework:

  • You need multi-agent orchestration with complex graphs
  • You want pre-built integrations (vector stores, APIs, databases)
  • Your team already knows the framework

When to build from scratch:

  • You need full control over the agent loop
  • Debuggability is critical (production systems)
  • You want to minimize dependencies
  • You’re learning how agents work (you are here)

Conclusion

An AI agent is three things:

  1. A loop that alternates between thinking and acting
  2. Tools that let it interact with the world
  3. Memory that lets it learn from its actions

Everything else—guardrails, error recovery, planning, multi-agent handoff—is refinement on these three ideas. If you understand the loop, you can build any agent.

The implementation we built is ~200 lines of Python with no dependencies beyond the OpenAI SDK. It handles multi-step reasoning, error recovery, memory management, and basic safety guardrails. It’s not production-ready (you’d want proper logging, async execution, rate limiting, and more robust error handling), but it’s a foundation you own and understand.

Where to Go From Here

  • Add more tools: Database queries, API calls, code execution, file writing
  • Implement structured output: Use Pydantic models for tool return types
  • Add streaming: Stream the agent’s thinking and tool calls in real-time
  • Build a router: Create a meta-agent that delegates to specialized sub-agents
  • Connect to the RAG post: Give your agent a knowledge base via retrieval
  • Evaluate with the LLM Evaluation post: Systematically test agent behavior