AI Agent Error Recovery Patterns for Production Systems

Why agent error recovery is not optional in production

Demo agents feel magical because failures are rare and cheap. Production agents fail constantly: rate limits spike at 2 a.m., a vendor changes an API schema without notice, a tool times out while your database holds a row lock, and the model returns valid JSON that does not match your Zod schema. None of these are edge cases. They are Tuesday.

If your agent stack treats every failure as fatal, users see opaque errors, support tickets pile up, and your on-call engineer restarts pods hoping for luck. If you blanket-retry everything, you burn token budgets, amplify outages, and occasionally double-charge a customer because the first tool call actually succeeded but the response never made it back to the model.

Production-grade AI agent error recovery sits between those extremes. You classify failures, retry only what is safe to retry, fall back to alternate tools or models when the primary path is unhealthy, and trip circuit breakers before a flaky dependency takes down your whole fleet. This article walks through those patterns with Node.js and TypeScript examples you can drop into an agent orchestration layer today.

The patterns here apply whether you are building a customer-facing support agent, an internal code assistant, or a background workflow that chains ten tool calls per request. The primitives are the same: bounded retries, explicit fallback chains, and breakers with clear half-open behavior.

The failure taxonomy every agent team should map first

Before writing retry logic, draw a simple failure map. Agent systems fail in four overlapping layers, and each layer needs a different response.

Layer	Typical failures	Default response
Model API	429 rate limit, 503 overload, timeout, content filter	Retry with backoff; fallback model
Tool execution	HTTP 5xx, validation error, auth expiry, timeout	Retry if idempotent; alternate tool
Orchestration	Max steps exceeded, malformed tool JSON, schema mismatch	Repair prompt or abort with user message
Downstream services	DB deadlock, queue full, third-party outage	Circuit breaker; degrade gracefully

Most teams over-invest in model retries and under-invest in tool and orchestration failures. That imbalance shows up in dashboards where LLM latency looks fine but end-to-end success rate sits at eighty-three percent because search_inventory fails silently twice a session.

Document each tool's idempotency contract. A create_invoice tool is not safe to retry blindly; a get_order_status tool usually is. Your recovery layer should read those contracts from configuration, not from comments buried in handler code.

Transient versus permanent failures

Transient failures might succeed on a second attempt: network blips, temporary rate limits, cold starts. Permanent failures will not: invalid API key, unknown tool name, user lacks permission. Retrying permanent failures wastes money and trains users to think your agent is broken slowly instead of quickly.

Return structured error codes from tools so the orchestrator can decide without parsing English prose. A pattern like { "code": "RATE_LIMITED", "retryable": true, "retryAfterMs": 2000 } beats "Something went wrong, please try again."

Retry logic that works with LLMs, not against them

Naive retry loops are dangerous around LLMs. Identical prompts after a partial tool failure can produce different tool selections, duplicate side effects, or runaway token spend. Good retry logic is bounded, jittered, and aware of which step failed.

Cap attempts per request and per user session.
Use exponential backoff with full jitter on rate limits.
Retry the failed step only, not the entire agent loop from scratch.
Respect Retry-After headers from model providers when present.
Track a per-request token budget so retries cannot double spend unnoticed.

Think of retries as surgery, not rebooting the machine. If step three of five failed, resume at step three with context intact. Re-running steps one and two often re-executes tools that already mutated state.

Classifying errors before you retry

Centralize classification in one module. Every outbound call—OpenAI, Anthropic, your internal REST API—maps HTTP status, SDK error types, and domain codes into a small enum: RETRYABLE, NON_RETRYABLE, RETRY_WITH_FALLBACK.

type ErrorClass = "RETRYABLE" | "NON_RETRYABLE" | "RETRY_WITH_FALLBACK";

export function classifyLlmError(err: unknown): ErrorClass {
  if (isRateLimitError(err)) return "RETRYABLE";
  if (isServerOverload(err)) return "RETRYABLE";
  if (isInvalidApiKey(err)) return "NON_RETRYABLE";
  if (isContextLengthExceeded(err)) return "RETRY_WITH_FALLBACK";
  return "NON_RETRYABLE";
}

RETRY_WITH_FALLBACK is the interesting case. Context length exceeded might clear if you summarize history and retry on a smaller model. Invalid JSON from the model might succeed after a repair pass with a stricter system prompt. Classify those separately from a generic 503.

A TypeScript retry helper with jitter and budgets

Here is a compact retry wrapper suitable for model calls and idempotent tools:

type RetryOptions = {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  classify: (err: unknown) => ErrorClass;
};

function backoffDelay(attempt: number, base: number, max: number): number {
  const exp = Math.min(max, base * 2 ** attempt);
  return Math.floor(Math.random() * exp);
}

export async function withRetry<T>(
  fn: () => Promise<T>,
  opts: RetryOptions
): Promise<T> {
  let lastError: unknown;
  for (let attempt = 0; attempt < opts.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      const kind = opts.classify(err);
      if (kind === "NON_RETRYABLE") throw err;
      if (attempt === opts.maxAttempts - 1) break;
      const delay = backoffDelay(attempt, opts.baseDelayMs, opts.maxDelayMs);
      await new Promise((r) => setTimeout(r, delay));
    }
  }
  throw lastError;
}

Wire this at the transport boundary, not inside every tool handler. Tools throw domain errors; the agent runner decides whether to retry, fall back, or surface a message to the user.

For multi-tenant SaaS, add per-tenant retry budgets stored in Redis. When a tenant exhausts their budget during an incident, fail fast with a clear status page message instead of competing with healthy tenants for the same broken endpoint.

Tool fallbacks when the primary path breaks

Retries fix transient blips. Fallbacks fix structural problems: primary search index is down, geocoding vendor changed pricing, the model keeps choosing a deprecated tool name. A fallback chain is an ordered list of capabilities that achieve the same user intent through different implementations.

Example: an agent needs current weather for a warehouse routing decision. Primary tool weather_api_v2 might fall back to weather_api_v1, then to a cached NOAA feed resource, then to asking the user to confirm a default assumption. Each step should return metadata the model can cite: "Used cached weather from 45 minutes ago because live API unavailable."

Designing fallback chains agents can reason about

Fallbacks fail when they are invisible to the model. Return explicit degraded: true flags and short explanations in tool results. The model will hallucinate confidence if your fallback silently returns stale data without saying so.

type ToolFallback<T> = {
  name: string;
  run: (input: unknown) => Promise<T>;
  isHealthy: () => boolean;
};

export async function runWithFallbacks<T>(
  chain: ToolFallback<T>[],
  input: unknown
): Promise<{ result: T; source: string; degraded: boolean }> {
  const errors: string[] = [];
  for (const step of chain) {
    if (!step.isHealthy()) continue;
    try {
      const result = await step.run(input);
      return {
        result,
        source: step.name,
        degraded: step.name !== chain[0].name,
      };
    } catch (e) {
      errors.push(`${step.name}: ${String(e)}`);
    }
  }
  throw new Error(`All fallbacks failed: ${errors.join("; ")}`);
}

Register fallback chains in configuration, not hard-coded if-else in handlers. Product teams change vendors; your agent should not need a deploy to swap step two for step three.

Model-level fallbacks mirror tool fallbacks. If gpt-4.1 times out twice, route the same messages to a faster model with a system note: "Prior model unavailable; answer conservatively and avoid tool calls requiring fine-grained reasoning." Document quality trade-offs so product owners accept them before an incident.

Circuit breakers for models, tools, and downstream APIs

Circuit breakers stop your agent from hammering a dependency that is already failing. Without them, a dead payment API turns into ten thousand retrying tool calls per minute across your Node.js replicas, each holding an open LLM context window while waiting.

A breaker has three states. Closed: traffic flows normally. Open: failures exceeded a threshold; calls fail fast without hitting the dependency. Half-open: a probe request tests recovery; success closes the breaker, failure reopens it.

State	Behavior	Agent impact
Closed	Normal calls	Full capability
Open	Fail fast	Skip tool; use fallback or user message
Half-open	Single probe	Limited retry; watch error rate

Use separate breakers per dependency, not one global breaker for "the agent." Your CRM can be down while your vector database is healthy. Granular breakers preserve partial functionality instead of blanketing every session with "service unavailable."

Implementing a circuit breaker in Node.js

The following TypeScript class is intentionally small. Production teams often adopt libraries like opossum, but understanding the state machine helps you tune thresholds correctly.

type BreakerState = "CLOSED" | "OPEN" | "HALF_OPEN";

export class CircuitBreaker {
  private state: BreakerState = "CLOSED";
  private failures = 0;
  private nextAttempt = 0;

  constructor(
    private readonly name: string,
    private readonly failureThreshold: number,
    private readonly cooldownMs: number
  ) {}

  async exec<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "OPEN") {
      if (Date.now() < this.nextAttempt) {
        throw new Error(`Breaker open for ${this.name}`);
      }
      this.state = "HALF_OPEN";
    }
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  private onSuccess(): void {
    this.failures = 0;
    this.state = "CLOSED";
  }

  private onFailure(): void {
    this.failures += 1;
    if (this.failures >= this.failureThreshold) {
      this.state = "OPEN";
      this.nextAttempt = Date.now() + this.cooldownMs;
    }
  }

  isOpen(): boolean {
    return this.state === "OPEN" && Date.now() < this.nextAttempt;
  }
}

Wrap each external integration: LLM provider, embedding service, payment gateway, internal GraphQL gateway. Export breaker state to metrics. Alert when any breaker stays open longer than five minutes.

Coordinate breakers across replicas with a shared counter in Redis when local failure counts are not enough. A single pod might not see enough traffic to trip open during a gradual degradation; cluster-wide error rates tell the truth faster.

Composing retry, fallback, and breaker into one policy

Individually, retries, fallbacks, and breakers are straightforward. Complexity arrives when you stack them. A sensible default order for tool calls:

Check breaker. If open, skip primary and jump to fallback chain immediately.
Attempt primary tool inside retry wrapper with idempotency guard.
On exhausted retries, walk fallback chain with its own per-step breakers.
If all paths fail, return a structured error the model can translate for the user.

export async function resilientToolCall<T>(
  primary: () => Promise<T>,
  fallbacks: Array<() => Promise<T>>,
  breaker: CircuitBreaker
): Promise<T> {
  if (!breaker.isOpen()) {
    try {
      return await withRetry(primary, {
        maxAttempts: 3,
        baseDelayMs: 200,
        maxDelayMs: 4_000,
        classify: classifyToolError,
      });
    } catch {
      // fall through to fallbacks
    }
  }
  for (const fb of fallbacks) {
    try {
      return await fb();
    } catch {
      continue;
    }
  }
  throw new Error("TOOL_UNAVAILABLE");
}

Expose policy knobs per tool criticality. Payment capture might allow zero fallbacks and no retries. A FAQ lookup tool might allow five retries and three fallbacks because stale data is worse than a delay but better than a hard error.

Keep the agent loop aware of policy outcomes. If you silently swallow failures, the model invents answers. Pass TOOL_UNAVAILABLE into the conversation as a system-visible tool result so the model can apologize, offer alternatives, or escalate to a human.

Observability: what to log when agents fail

You cannot improve recovery you cannot see. Log these fields on every failure path: requestId, tenantId, agentStepIndex, toolName, errorClass, retryAttempt, breakerState, fallbackUsed, latencyMs, tokenCostSoFar.

Build dashboards that separate model errors from tool errors. A spike in 429 from your LLM provider needs different runbook steps than a spike in TOOL_UNAVAILABLE from your inventory service. SLOs should track end-to-end task completion, not just model HTTP success.

Trace across the full agent loop with OpenTelemetry. Parent spans for the user request, child spans for each model completion and tool invocation. When a user reports "the agent lied about stock levels," you want a trace that shows the fallback cache was forty minutes stale, not a mystery.

Testing failure paths before users hit them

Chaos testing for agents is underrated. In staging, inject latency and errors into tool mocks. Verify breakers open, fallbacks activate, and user-facing copy stays honest. Snapshot the structured errors your orchestrator returns so refactors do not drop fields the model relies on.

Contract test tool schemas separately from happy-path integration tests. A handler that changes its return shape without updating JSON Schema will fail at orchestration time with schema mismatch errors—permanent failures that retries will never fix.

Summary

Production AI agents fail at the model layer, the tool layer, the orchestration layer, and the downstream service layer. Treating those failures the same guarantees either wasted tokens or angry users. Classify errors first, then apply the right pattern: bounded retries with jitter for transient faults, explicit fallback chains for degraded capability, and circuit breakers to protect flaky dependencies from retry storms.

Implement recovery in TypeScript at the orchestration boundary so tool handlers stay focused on domain logic. Make degradation visible to the model, cap spend per request, and instrument every failure path with traces and metrics that distinguish LLM issues from tool issues. The goal is not zero failures—it is predictable behavior when failures happen, so your agent fails fast, fails honestly, and recovers without taking the rest of your system with it.

FAQ

Should agents retry tool calls automatically?

Only when the tool is idempotent and the error is classified retryable. Never auto-retry creates, charges, or sends without an idempotency key. The orchestrator should retry, not the model itself deciding to call the same tool again unprompted.

How many retries are safe for LLM API calls?

Two to three attempts with exponential backoff and jitter is a common default for rate limits and 503 responses. Pair that with a per-request token budget so a retry loop cannot double your inference cost unnoticed.

When should I fail open versus fail closed?

Fail closed for auth, billing, and data mutation. Fail open only for non-critical read paths where stale data is acceptable and you disclose degradation to the user. Most customer-facing agents should fail closed on uncertainty.

Do circuit breakers apply to streaming responses?

Yes. Track stream errors and mid-stream timeouts as failures. If a provider drops connections repeatedly, open the breaker on that provider even when HTTP status started as 200. Streaming adds complexity; probe with short non-streaming health checks in half-open state.

How do I prevent retry storms across replicas?

Use jittered backoff, per-dependency circuit breakers, and shared failure counters in Redis for cluster-wide open states. Rate-limit retries per tenant and coordinate breaker state so one replica opening the breaker informs the others.

What metrics prove your recovery layer is working?

Track fallback activation rate, breaker open duration, retry success rate by error class, end-to-end task completion SLO, and token cost per completed task. Improving recovery shows up as higher completion rate without proportional cost increase.