AI Agent Testing: Failure Taxonomies That Actually Work

Six failure categories for AI agent evaluation with grading rubrics and automated eval patterns. Practical testing for production agents.

Robert Ta's Self-Model CEO & Co-Founder

· February 18, 2026 · 8 min read

TL;DR

Most AI agent testing treats all failures as equivalent — “the agent gave a wrong answer.” This prevents systematic improvement because different failure types require different fixes
Six failure categories cover the vast majority of production agent failures: instruction drift, context mismanagement, tool misuse, hallucinated actions, goal misalignment, and recovery failure
Each category has distinct root causes, detection methods, and remediation patterns
This post provides grading rubrics and automated eval code for each category

AI agent testing is stuck in a pre-taxonomy era. When a traditional software system fails, engineers can classify the failure immediately: null pointer exception, timeout, authentication error, race condition. Each classification points to a specific debugging strategy and a specific fix.

When an AI agent fails, the failure report usually says something like “gave wrong answer” or “didn’t do what I wanted.” This is the equivalent of logging every software error as “broken.” It prevents pattern recognition, makes regression testing meaningless, and ensures that the same failure modes recur indefinitely.

The failure taxonomy in this post comes from triaging hundreds of production agent failures across different models, frameworks, and domains. The six categories are not exhaustive, but they cover the vast majority of failures we encounter in production and — more importantly — each category has distinct root causes that require distinct fixes.

failure categories covering most production agent failures

of AI projects fail to deliver value (RAND Corp 2024)

of GenAI projects abandoned after POC (Gartner 2024)

The Six Failure Categories

Category 1: Instruction Drift

What it looks like: The agent starts following instructions correctly, then gradually deviates as the conversation or task chain gets longer. By step 8, the agent is doing something only loosely related to the original instructions.

Root cause: Context window pressure. As the conversation grows, early instructions compete with recent context for the model’s attention. System prompts get diluted by accumulated conversation history. Multi-step chains compound this because each step adds context that pushes the original instructions further from the model’s attention.

Detection pattern: Run the same instruction set with increasing chain lengths (1 step, 5 steps, 10 steps, 20 steps). Measure instruction adherence at each length. If adherence degrades with chain length, you have instruction drift.

eval-instruction-drift.ts

1interface DriftEval {← Measure adherence degradation over chain length
2  instruction: string;
3  chainLengths: number[];  // [1, 5, 10, 20]
4  expectedBehavior: string;
5}
6
7async function measureDrift(eval: DriftEval): Promise<DriftResult> {
8  const results = await Promise.all(
9    eval.chainLengths.map(len => runChain(eval.instruction, len))
10  );
11  return {
12    adherenceByLength: results.map(r => scoreAdherence(r, eval.expectedBehavior)),
13    driftRate: calculateDriftSlope(results),← Negative slope = drift
14    firstDriftStep: findFirstDivergence(results),
15  };
16}

Grading rubric:

A (No drift): Adherence stays above 90% at all chain lengths
B (Mild drift): Adherence above 80% at chain length 10, above 70% at 20
C (Moderate drift): Adherence drops below 70% by chain length 10
F (Severe drift): Adherence drops below 50% by chain length 5

Remediation: Instruction reinforcement at chain boundaries. Inject a condensed version of core instructions every N steps. Summarize and compress conversation history periodically rather than carrying the full transcript.

Category 2: Context Mismanagement

What it looks like: The agent has the information it needs somewhere in its context but fails to use it correctly. It might ignore relevant context, use stale context over fresh context, or conflate context from different sources.

Root cause: The model cannot distinguish between relevant and irrelevant context, or between current and outdated information. This is distinct from instruction drift because the instructions are followed — the agent just operates on the wrong information.

Detection pattern: Provide the agent with a mix of relevant and irrelevant context, then test whether it uses the right pieces. Include contradictory information at different timestamps and test whether it resolves conflicts correctly.

eval-context-management.ts

1interface ContextEval {← Test context selection and recency handling
2  relevantContext: ContextItem[];
3  irrelevantContext: ContextItem[];  // Should be ignored
4  staleContext: ContextItem[];        // Should be superseded
5  freshContext: ContextItem[];         // Should take priority
6  query: string;
7}
8
9function scoreContextUsage(response: string, eval: ContextEval): ContextScore {
10  return {
11    relevantUsed: checkCitation(response, eval.relevantContext),
12    irrelevantIgnored: checkAbsence(response, eval.irrelevantContext),
13    recencyCorrect: checkRecency(response, eval.staleContext, eval.freshContext),← Did it prefer fresh over stale?
14    conflictResolution: checkConflictHandling(response, eval),
15  };
16}

Grading rubric:

A: Uses relevant context, ignores irrelevant, prefers fresh over stale in all test cases
B: Occasionally includes irrelevant context but does not base decisions on it
C: Sometimes uses stale context over fresh, or misses relevant context
F: Regularly bases decisions on irrelevant or outdated information

Remediation: Structured context injection with explicit recency markers. Tag context with timestamps and source reliability scores. Implement context ranking before injection rather than stuffing everything into the prompt.

Category 3: Tool Misuse

What it looks like: The agent calls the right tool with wrong arguments, the wrong tool entirely, or calls tools in an incorrect sequence. It might also call a tool when no tool call is needed, or fail to call a tool when one is required.

Root cause: Ambiguous tool descriptions, overlapping tool capabilities, or insufficient examples in tool schemas. The model is pattern-matching on tool names and descriptions rather than understanding tool semantics.

Detection pattern: Create a test suite of scenarios where each scenario has one correct tool call (or sequence) and multiple plausible-but-wrong alternatives. Measure correct tool selection rate, argument accuracy, and sequence correctness.

eval-tool-usage.ts

1interface ToolEval {
2  scenario: string;
3  availableTools: ToolDefinition[];
4  expectedCalls: ExpectedToolCall[];   // Correct sequence← Order matters
5  distractorTools: ToolDefinition[];   // Should NOT be called
6}
7
8function scoreToolUsage(actual: ToolCall[], expected: ExpectedToolCall[]): ToolScore {
9  return {
10    toolSelection: jaccard(actual.map(t => t.name), expected.map(t => t.name)),
11    argumentAccuracy: actual.map((a, i) => compareArgs(a.args, expected[i]?.args)),
12    sequenceCorrect: isCorrectOrder(actual, expected),← Did it call tools in the right order?
13    spuriousCalls: actual.filter(a => !expected.find(e => e.name === a.name)).length,
14    missingCalls: expected.filter(e => !actual.find(a => a.name === e.name)).length,
15  };
16}

Grading rubric:

A: Correct tool, correct arguments, correct sequence in 95%+ of cases
B: Correct tool selection but occasional argument errors (wrong parameter types, missing optional args)
C: Correct tool most of the time but sometimes selects a similar-but-wrong tool
F: Regularly calls wrong tools, wrong arguments, or calls tools when none are needed

Remediation: Improve tool descriptions with explicit usage examples and counter-examples. Add “when NOT to use this tool” to tool schemas. Reduce tool overlap — if two tools have similar descriptions, the model will confuse them.

Category 4: Hallucinated Actions

What it looks like: The agent claims to have performed an action it did not actually perform, or reports results from an action that never executed. This is distinct from text hallucination — it is behavioral hallucination.

Root cause: The model has learned that users expect action confirmation, so it generates confirmation text even when the action failed, timed out, or was never attempted. This is particularly dangerous because the user trusts the agent’s report of its own actions.

Detection pattern: Intercept actual tool calls and compare the agent’s reported actions against the real execution log. Test with tools that intentionally fail and verify the agent reports the failure rather than fabricating a success.

eval-hallucinated-actions.ts

1interface ActionAudit {
2  agentReport: string;       // What the agent claims it did
3  actualExecutionLog: ToolExecution[];  // What actually happened← Ground truth from tool middleware
4}
5
6function detectHallucinatedActions(audit: ActionAudit): ActionScore {
7  const claimed = extractClaimedActions(audit.agentReport);
8  const actual = audit.actualExecutionLog;
9  return {
10    fabricatedActions: claimed.filter(c => !actual.find(a => matches(c, a))),
11    unreportedFailures: actual.filter(a => a.status === 'failed')
12      .filter(a => !audit.agentReport.includes('failed')),← Did it hide failures?
13    accuracyRate: actual.filter(a => correctlyReported(a, claimed)).length / actual.length,
14  };
15}

Grading rubric:

A: Zero hallucinated actions. All failures reported accurately. No fabricated successes
B: No fabricated successes but occasionally omits failure details (says “error occurred” without specifics)
C: Occasionally claims partial success when the action fully failed
F: Fabricates action results or claims actions it never attempted

Remediation: Add explicit failure handling instructions to the system prompt. Require the agent to cite the tool response in its confirmation. Implement a verification layer that cross-references agent reports against execution logs.

Category 5: Goal Misalignment

What it looks like: The agent completes the task as stated but misses the user’s actual intent. It satisfies the letter of the request but not the spirit. The output is technically correct but practically useless.

Root cause: The agent optimizes for the literal instruction rather than the underlying goal. This is a natural consequence of instruction-following training — the model learns to satisfy the stated request without modeling why the user is making the request.

Detection pattern: Create eval scenarios where the stated request is ambiguous or incomplete, and the correct behavior requires inferring the user’s underlying goal. Score based on whether the agent addresses the actual need or just the surface request.

Grading rubric:

A: Identifies the underlying goal and addresses it, even when the stated request is ambiguous
B: Satisfies the stated request and asks clarifying questions when goals are ambiguous
C: Satisfies the stated request literally without recognizing ambiguity
F: Satisfies neither the stated request nor the underlying goal

Remediation: Add goal-inference prompting — instruct the agent to identify the user’s underlying objective before acting. Implement a confirmation step for ambiguous requests. Provide the agent with user context (via self-models or session history) that helps disambiguate intent.

Category 6: Recovery Failure

What it looks like: The agent encounters an error, a tool failure, or an unexpected state and cannot recover. It either gets stuck in a loop, gives up silently, or produces incoherent output as it tries to work around the failure.

Root cause: Most agent architectures have no error recovery strategy. The happy path is well-designed. The unhappy path is undefined. When something goes wrong, the agent falls back to generic model behavior, which is not equipped to handle tool failures, partial results, or inconsistent state.

Detection pattern: Inject failures at various points in multi-step workflows: tool timeouts, partial results, contradictory data, permission errors. Score the agent’s ability to detect the failure, communicate it clearly, and either recover or escalate appropriately.

eval-recovery.ts

1type FailureInjection = ← Simulate production failure modes
2  | { type: 'timeout'; step: number }
3  | { type: 'partial_result'; step: number; completeness: number }
4  | { type: 'permission_denied'; step: number }
5  | { type: 'contradictory_data'; step: number };
6
7function scoreRecovery(agent: AgentRun, injection: FailureInjection): RecoveryScore {
8  return {
9    detected: agent.acknowledgedError,← Did it notice the failure?
10    communicated: agent.informedUser,
11    recovered: agent.foundAlternativePath,
12    escalated: agent.requestedHumanHelp,  // When recovery is not possible
13    looped: agent.repeatedSameAction > 2,  // Retry storm detection
14  };
15}

Grading rubric:

A: Detects failure, communicates clearly, recovers via alternative path or escalates appropriately
B: Detects failure and communicates it, but does not attempt recovery
C: Partially detects failure — retries once or twice, then gives up without clear communication
F: Gets stuck in a loop, fails silently, or produces incoherent output after the failure

Remediation: Define explicit recovery strategies for each tool and each failure type. Add retry budgets with backoff. Implement escalation paths that route to human operators when automated recovery is exhausted.

Building the Eval Suite

The six categories above give you a taxonomy. To make it operational, you need an eval suite that runs automatically and produces actionable reports.

The eval suite structure:

Category	Test Count (minimum)	Automation Level	Run Frequency
Instruction Drift	20 scenarios x 4 chain lengths	Fully automated	Every model/prompt change
Context Mismanagement	30 scenarios with mixed context	Fully automated	Every model/prompt change
Tool Misuse	15 scenarios per tool	Fully automated	Every tool schema change
Hallucinated Actions	20 scenarios with injected failures	Automated + log audit	Every deployment
Goal Misalignment	25 scenarios with ambiguous requests	Semi-automated (human grading)	Weekly
Recovery Failure	10 failure types x 5 injection points	Fully automated	Every architecture change

The goal is not perfection across all categories — it is visibility. When you know that your agent scores A on tool usage but C on recovery, you know where to invest engineering time. When a production failure occurs, you can classify it immediately instead of opening a 4-hour investigation.

From Taxonomy to Improvement

The taxonomy is only useful if it drives systematic improvement. Here is the loop:

Classify every production failure into one of the six categories
Aggregate failures weekly to identify which categories are trending up
Prioritize the category with the highest user-impact failures
Fix the root cause using the category-specific remediation pattern
Validate the fix with the category-specific eval suite
Monitor for regression using automated evals

This is not novel engineering methodology. It is the same approach that made traditional software reliable: classify failures, identify patterns, fix root causes, prevent regressions. The only thing new is the taxonomy adapted for AI agent failure modes.

We build AI agents with production-grade evaluation infrastructure at Clarity. If your agents are failing in ways you cannot categorize or systematically fix, we should talk.

Building AI that needs to understand its users?

Talk to us →

Key insights

“You cannot fix what you cannot categorize. Most AI agent failures get logged as 'bad output' — which is about as useful as a doctor writing 'felt sick' on every chart.”

Share this insight

“The six failure categories are not theoretical. They come from triaging hundreds of production agent failures and noticing that the same patterns keep appearing regardless of the model, the framework, or the domain.”

Share this insight

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →

AI Agent Evaluation: Building a Testing Framework That Works

A complete framework for evaluating AI agents in production: failure taxonomies, grading rubrics, automated eval pipelines, and alignment scoring.

Robert Ta's Self-Model

20 min read