AI Agent Testing: Failure Taxonomies That Actually Work
Six failure categories for AI agent evaluation with grading rubrics and automated eval patterns. Practical testing for production agents.
TL;DR
- Most AI agent testing treats all failures as equivalent — “the agent gave a wrong answer.” This prevents systematic improvement because different failure types require different fixes
- Six failure categories cover the vast majority of production agent failures: instruction drift, context mismanagement, tool misuse, hallucinated actions, goal misalignment, and recovery failure
- Each category has distinct root causes, detection methods, and remediation patterns
- This post provides grading rubrics and automated eval code for each category
AI agent testing is stuck in a pre-taxonomy era. When a traditional software system fails, engineers can classify the failure immediately: null pointer exception, timeout, authentication error, race condition. Each classification points to a specific debugging strategy and a specific fix.
When an AI agent fails, the failure report usually says something like “gave wrong answer” or “didn’t do what I wanted.” This is the equivalent of logging every software error as “broken.” It prevents pattern recognition, makes regression testing meaningless, and ensures that the same failure modes recur indefinitely.
The failure taxonomy in this post comes from triaging hundreds of production agent failures across different models, frameworks, and domains. The six categories are not exhaustive, but they cover the vast majority of failures we encounter in production and — more importantly — each category has distinct root causes that require distinct fixes.
The Six Failure Categories
Category 1: Instruction Drift
What it looks like: The agent starts following instructions correctly, then gradually deviates as the conversation or task chain gets longer. By step 8, the agent is doing something only loosely related to the original instructions.
Root cause: Context window pressure. As the conversation grows, early instructions compete with recent context for the model’s attention. System prompts get diluted by accumulated conversation history. Multi-step chains compound this because each step adds context that pushes the original instructions further from the model’s attention.
Detection pattern: Run the same instruction set with increasing chain lengths (1 step, 5 steps, 10 steps, 20 steps). Measure instruction adherence at each length. If adherence degrades with chain length, you have instruction drift.
1interface DriftEval {← Measure adherence degradation over chain length2instruction: string;3chainLengths: number[]; // [1, 5, 10, 20]4expectedBehavior: string;5}67async function measureDrift(eval: DriftEval): Promise<DriftResult> {8const results = await Promise.all(9eval.chainLengths.map(len => runChain(eval.instruction, len))10);11return {12adherenceByLength: results.map(r => scoreAdherence(r, eval.expectedBehavior)),13driftRate: calculateDriftSlope(results),← Negative slope = drift14firstDriftStep: findFirstDivergence(results),15};16}
Grading rubric:
- A (No drift): Adherence stays above 90% at all chain lengths
- B (Mild drift): Adherence above 80% at chain length 10, above 70% at 20
- C (Moderate drift): Adherence drops below 70% by chain length 10
- F (Severe drift): Adherence drops below 50% by chain length 5
Remediation: Instruction reinforcement at chain boundaries. Inject a condensed version of core instructions every N steps. Summarize and compress conversation history periodically rather than carrying the full transcript.
Category 2: Context Mismanagement
What it looks like: The agent has the information it needs somewhere in its context but fails to use it correctly. It might ignore relevant context, use stale context over fresh context, or conflate context from different sources.
Root cause: The model cannot distinguish between relevant and irrelevant context, or between current and outdated information. This is distinct from instruction drift because the instructions are followed — the agent just operates on the wrong information.
Detection pattern: Provide the agent with a mix of relevant and irrelevant context, then test whether it uses the right pieces. Include contradictory information at different timestamps and test whether it resolves conflicts correctly.
1interface ContextEval {← Test context selection and recency handling2relevantContext: ContextItem[];3irrelevantContext: ContextItem[]; // Should be ignored4staleContext: ContextItem[]; // Should be superseded5freshContext: ContextItem[]; // Should take priority6query: string;7}89function scoreContextUsage(response: string, eval: ContextEval): ContextScore {10return {11relevantUsed: checkCitation(response, eval.relevantContext),12irrelevantIgnored: checkAbsence(response, eval.irrelevantContext),13recencyCorrect: checkRecency(response, eval.staleContext, eval.freshContext),← Did it prefer fresh over stale?14conflictResolution: checkConflictHandling(response, eval),15};16}
Grading rubric:
- A: Uses relevant context, ignores irrelevant, prefers fresh over stale in all test cases
- B: Occasionally includes irrelevant context but does not base decisions on it
- C: Sometimes uses stale context over fresh, or misses relevant context
- F: Regularly bases decisions on irrelevant or outdated information
Remediation: Structured context injection with explicit recency markers. Tag context with timestamps and source reliability scores. Implement context ranking before injection rather than stuffing everything into the prompt.
Category 3: Tool Misuse
What it looks like: The agent calls the right tool with wrong arguments, the wrong tool entirely, or calls tools in an incorrect sequence. It might also call a tool when no tool call is needed, or fail to call a tool when one is required.
Root cause: Ambiguous tool descriptions, overlapping tool capabilities, or insufficient examples in tool schemas. The model is pattern-matching on tool names and descriptions rather than understanding tool semantics.
Detection pattern: Create a test suite of scenarios where each scenario has one correct tool call (or sequence) and multiple plausible-but-wrong alternatives. Measure correct tool selection rate, argument accuracy, and sequence correctness.
1interface ToolEval {2scenario: string;3availableTools: ToolDefinition[];4expectedCalls: ExpectedToolCall[]; // Correct sequence← Order matters5distractorTools: ToolDefinition[]; // Should NOT be called6}78function scoreToolUsage(actual: ToolCall[], expected: ExpectedToolCall[]): ToolScore {9return {10toolSelection: jaccard(actual.map(t => t.name), expected.map(t => t.name)),11argumentAccuracy: actual.map((a, i) => compareArgs(a.args, expected[i]?.args)),12sequenceCorrect: isCorrectOrder(actual, expected),← Did it call tools in the right order?13spuriousCalls: actual.filter(a => !expected.find(e => e.name === a.name)).length,14missingCalls: expected.filter(e => !actual.find(a => a.name === e.name)).length,15};16}
Grading rubric:
- A: Correct tool, correct arguments, correct sequence in 95%+ of cases
- B: Correct tool selection but occasional argument errors (wrong parameter types, missing optional args)
- C: Correct tool most of the time but sometimes selects a similar-but-wrong tool
- F: Regularly calls wrong tools, wrong arguments, or calls tools when none are needed
Remediation: Improve tool descriptions with explicit usage examples and counter-examples. Add “when NOT to use this tool” to tool schemas. Reduce tool overlap — if two tools have similar descriptions, the model will confuse them.
Category 4: Hallucinated Actions
What it looks like: The agent claims to have performed an action it did not actually perform, or reports results from an action that never executed. This is distinct from text hallucination — it is behavioral hallucination.
Root cause: The model has learned that users expect action confirmation, so it generates confirmation text even when the action failed, timed out, or was never attempted. This is particularly dangerous because the user trusts the agent’s report of its own actions.
Detection pattern: Intercept actual tool calls and compare the agent’s reported actions against the real execution log. Test with tools that intentionally fail and verify the agent reports the failure rather than fabricating a success.
1interface ActionAudit {2agentReport: string; // What the agent claims it did3actualExecutionLog: ToolExecution[]; // What actually happened← Ground truth from tool middleware4}56function detectHallucinatedActions(audit: ActionAudit): ActionScore {7const claimed = extractClaimedActions(audit.agentReport);8const actual = audit.actualExecutionLog;9return {10fabricatedActions: claimed.filter(c => !actual.find(a => matches(c, a))),11unreportedFailures: actual.filter(a => a.status === 'failed')12.filter(a => !audit.agentReport.includes('failed')),← Did it hide failures?13accuracyRate: actual.filter(a => correctlyReported(a, claimed)).length / actual.length,14};15}
Grading rubric:
- A: Zero hallucinated actions. All failures reported accurately. No fabricated successes
- B: No fabricated successes but occasionally omits failure details (says “error occurred” without specifics)
- C: Occasionally claims partial success when the action fully failed
- F: Fabricates action results or claims actions it never attempted
Remediation: Add explicit failure handling instructions to the system prompt. Require the agent to cite the tool response in its confirmation. Implement a verification layer that cross-references agent reports against execution logs.
Category 5: Goal Misalignment
What it looks like: The agent completes the task as stated but misses the user’s actual intent. It satisfies the letter of the request but not the spirit. The output is technically correct but practically useless.
Root cause: The agent optimizes for the literal instruction rather than the underlying goal. This is a natural consequence of instruction-following training — the model learns to satisfy the stated request without modeling why the user is making the request.
Detection pattern: Create eval scenarios where the stated request is ambiguous or incomplete, and the correct behavior requires inferring the user’s underlying goal. Score based on whether the agent addresses the actual need or just the surface request.
Grading rubric:
- A: Identifies the underlying goal and addresses it, even when the stated request is ambiguous
- B: Satisfies the stated request and asks clarifying questions when goals are ambiguous
- C: Satisfies the stated request literally without recognizing ambiguity
- F: Satisfies neither the stated request nor the underlying goal
Remediation: Add goal-inference prompting — instruct the agent to identify the user’s underlying objective before acting. Implement a confirmation step for ambiguous requests. Provide the agent with user context (via self-models or session history) that helps disambiguate intent.
Category 6: Recovery Failure
What it looks like: The agent encounters an error, a tool failure, or an unexpected state and cannot recover. It either gets stuck in a loop, gives up silently, or produces incoherent output as it tries to work around the failure.
Root cause: Most agent architectures have no error recovery strategy. The happy path is well-designed. The unhappy path is undefined. When something goes wrong, the agent falls back to generic model behavior, which is not equipped to handle tool failures, partial results, or inconsistent state.
Detection pattern: Inject failures at various points in multi-step workflows: tool timeouts, partial results, contradictory data, permission errors. Score the agent’s ability to detect the failure, communicate it clearly, and either recover or escalate appropriately.
1type FailureInjection =← Simulate production failure modes2| { type: 'timeout'; step: number }3| { type: 'partial_result'; step: number; completeness: number }4| { type: 'permission_denied'; step: number }5| { type: 'contradictory_data'; step: number };67function scoreRecovery(agent: AgentRun, injection: FailureInjection): RecoveryScore {8return {9detected: agent.acknowledgedError,← Did it notice the failure?10communicated: agent.informedUser,11recovered: agent.foundAlternativePath,12escalated: agent.requestedHumanHelp, // When recovery is not possible13looped: agent.repeatedSameAction > 2, // Retry storm detection14};15}
Grading rubric:
- A: Detects failure, communicates clearly, recovers via alternative path or escalates appropriately
- B: Detects failure and communicates it, but does not attempt recovery
- C: Partially detects failure — retries once or twice, then gives up without clear communication
- F: Gets stuck in a loop, fails silently, or produces incoherent output after the failure
Remediation: Define explicit recovery strategies for each tool and each failure type. Add retry budgets with backoff. Implement escalation paths that route to human operators when automated recovery is exhausted.
Building the Eval Suite
The six categories above give you a taxonomy. To make it operational, you need an eval suite that runs automatically and produces actionable reports.
The eval suite structure:
| Category | Test Count (minimum) | Automation Level | Run Frequency |
|---|---|---|---|
| Instruction Drift | 20 scenarios x 4 chain lengths | Fully automated | Every model/prompt change |
| Context Mismanagement | 30 scenarios with mixed context | Fully automated | Every model/prompt change |
| Tool Misuse | 15 scenarios per tool | Fully automated | Every tool schema change |
| Hallucinated Actions | 20 scenarios with injected failures | Automated + log audit | Every deployment |
| Goal Misalignment | 25 scenarios with ambiguous requests | Semi-automated (human grading) | Weekly |
| Recovery Failure | 10 failure types x 5 injection points | Fully automated | Every architecture change |
The goal is not perfection across all categories — it is visibility. When you know that your agent scores A on tool usage but C on recovery, you know where to invest engineering time. When a production failure occurs, you can classify it immediately instead of opening a 4-hour investigation.
From Taxonomy to Improvement
The taxonomy is only useful if it drives systematic improvement. Here is the loop:
- Classify every production failure into one of the six categories
- Aggregate failures weekly to identify which categories are trending up
- Prioritize the category with the highest user-impact failures
- Fix the root cause using the category-specific remediation pattern
- Validate the fix with the category-specific eval suite
- Monitor for regression using automated evals
This is not novel engineering methodology. It is the same approach that made traditional software reliable: classify failures, identify patterns, fix root causes, prevent regressions. The only thing new is the taxonomy adapted for AI agent failure modes.
We build AI agents with production-grade evaluation infrastructure at Clarity. If your agents are failing in ways you cannot categorize or systematically fix, we should talk.
Building AI that needs to understand its users?
Key insights
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →