Thought Leadership

How to Investigate an AI System Failure

A practitioner's guide to AI forensics

10 March 202614 min readAuthor: James Jackson, Founder
Summarise with ChatGPTWhat This Article Covers

This guide walks through the complete process of investigating an AI system failure, from assessing whether your organisation even has the data to support an investigation, through the specific techniques for diagnosing hallucinations, RAG failures, model drift, and agentic breakdowns. It is written for the people who commission and run these investigations.

TL;DR

The biggest obstacle to investigating an AI failure is not finding the root cause, it is discovering that your organisation never captured the data needed to reconstruct what happened. Audit your telemetry first: inference logs, prompt chains, retrieval context, embedding versions, and agent reasoning traces. Without these, no investigation methodology will save you. LLM failures are non-deterministic and leave no conventional traces, so techniques like semantic entropy analysis and RAG pipeline tracing replace the stack traces and error codes you are used to.

Article Overview

How to investigate an AI system failure

If you have experience with traditional incident response, you already understand how to investigate a system failure. The problem is that most of what you know does not transfer cleanly to AI systems, and the places where it breaks down are not obvious.

Traditional software fails deterministically. The same bug produces the same error. You find the root cause in the code, the configuration, or the infrastructure. The evidence sits in logs, stack traces, and version control. Frameworks like NIST SP 800-61 Rev. 3 and ISO 27035 were built for this world: identify indicators of compromise, trace the kill chain, contain, eradicate, recover.

LLMs are non-deterministic. The same input can produce different outputs on different runs. A hallucinated response does not throw an error code. There is no stack trace for a confidently wrong answer. The system behaves exactly as designed, it simply produces an output that is factually incorrect, contextually inappropriate, or legally dangerous.

This distinction matters enormously for investigation. Here is how traditional assumptions break down.

AssumptionTraditional SoftwareAI / LLM Systems
Failure signatureErrors produce consistent, repeatable signatures in logsHallucinations leave no conventional trace. The system reports success.
ReproducibilitySame input, same output. Failures can be reliably reproduced.Same input may produce different outputs. Reproduction requires controlling model version, temperature, prompt chain, and retrieval context.
Degradation patternFailures are discrete events with clear before/after statesModel drift is gradual. Performance degrades incrementally, often unnoticed until a high-consequence failure surfaces.
Investigation surfaceApplication logs, network logs, system eventsInference logs, prompt chains, retrieval context, embedding versions, agent reasoning traces. Most of which are not captured by default.
Root cause locationCode, configuration, infrastructure, or human actionCould be the model, the prompt, the retrieval pipeline, the training data, the guardrails, or the interaction between all of them.

Agentic AI systems compound every one of these problems. When an AI agent autonomously invokes tools, delegates tasks to sub-agents, and makes multi-step reasoning decisions, the investigation surface expands dramatically. You are no longer tracing a single inference. You are reconstructing an entire decision chain, potentially spanning multiple models, tools, and external data sources.

The OECD AI Incidents Monitor added 108 new incident IDs between November 2025 and January 2026 alone. The volume is accelerating, and the investigation methodology has not kept pace. What exists today was built for a world of deterministic systems with static indicators of compromise. We need something better.

Before you can investigate, you need data

The majority of AI deployments I assess lack the telemetry infrastructure to support a post-incident investigation. Application logs alone are insufficient. Research consistently shows that explicit traces of AI-specific attack vectors appear in application logs in only around 40% of cases. For subtle failures like hallucinations or gradual model drift, the figure is almost certainly lower.

Before you can investigate anything, you need to confirm that the following data exists. If it does not, the investigation will have significant blind spots, and your findings will carry caveats that weaken their defensibility.

The telemetry checklist

  • Inference logs. Full input/output pairs with timestamps, model version identifiers, and inference parameters (temperature, top-p, max tokens). Without these, you cannot reconstruct what the model received and what it produced.

  • Prompt chain persistence. The complete chain of system prompts, user inputs, and any intermediate reasoning steps. Many systems construct prompts dynamically from templates, user input, and retrieved context. If the assembled prompt is not persisted, it is gone.

  • Retrieval context. For RAG systems, the exact documents or chunks retrieved, the similarity scores that determined their selection, and the query that triggered retrieval. This must be stored alongside the generated output, not just in the retrieval system's own logs.

  • Embedding versions. Which embedding model version was used for retrieval at the time of the incident. Embedding models get updated. If the version is not tracked, you cannot determine whether a retrieval failure was caused by a query problem or an embedding change.

  • Agent action logs. For agentic systems, the tool calls made, the reasoning traces that led to those calls, and the full delegation chain. Which agent decided to call which tool, with what parameters, and what did it do with the result?

  • Model metadata. Version, fine-tuning provenance, deployment configuration, and guardrail settings. You need to know exactly which model was running, how it was configured, and what safety mechanisms were in place.

  • User session context. Who triggered the interaction, from where, with what permissions, and within what application context.

These requirements align with the OpenTelemetry Semantic Conventions for Generative AI, which are formalising industry-standard attributes AI systems including model version, temperature, token usage, prompt content, and tool calls.

Why most organisations do not capture this

The honest answer is cost and architectural complexity. Logging full inference chains with retrieval context generates significant storage volume. Most engineering teams optimise for latency and cost, not forensic readiness. The prevailing assumption is "we can retrain or roll back if something goes wrong." That assumption holds until you face a regulator, a litigant, or an insurance underwriter who wants to know exactly what happened and when.

The regulatory pressure is real and imminent. The EU AI Act, Article 73, requires providers of high-risk AI systems to report serious incidents to market surveillance authorities within 15 days and conduct investigations. The obligations take effect in August 2026, with fines of up to 15 million euros or 3% of worldwide annual turnover. Without telemetry, you cannot comply with the investigation requirement, let alone meet the reporting timeline.

The time to build investigation capability is before an incident occurs, not after. If you read only one section of this article and take action on it, make it this one. Audit your telemetry. Identify the gaps. Fix them now, while the cost is an infrastructure decision rather than a crisis response.

How to conduct an AI system failure investigation

There are many mechanisms that can cause AI to fail. The table below provides an overview of the most important ones.

What can go wrong: an LLM failure mode taxonomy

Failure ModeWhat It IsWhat It CausesHow You Investigate It
HallucinationThe model generates factually incorrect output that is not grounded in its context or training dataMisinformation presented to users, fabricated citations, incorrect contractual or medical advice, legal liabilityContext disconnect tracing, semantic entropy measurement, consortium verification across multiple models
RAG retrieval failureThe retrieval pipeline returns wrong, incomplete, or irrelevant documents to the model's context windowModel generates plausible but incorrect answers because it never received the right source materialRetrieval evaluation (similarity scores, ranking analysis), chunk boundary analysis, index version auditing
Model driftGradual degradation of model performance over time due to data distribution shifts, upstream model updates, or environment changesSlowly declining output quality that goes unnoticed until a high-consequence failure surfacesEmbedding space distribution shift analysis, golden dataset regression testing, temporal correlation with pipeline changes
Prompt injectionMalicious input that manipulates the model into ignoring its instructions, exposing system prompts, or producing harmful outputData exfiltration, guardrail bypass, unauthorised actions, system prompt leakageInput classification, boundary testing, guardrail bypass analysis, system prompt exposure assessment
Guardrail failureSafety filters, output validators, or content policies fail to catch a problematic input or outputHarmful, biased, or non-compliant content reaches end users despite safety measures being in placeGuardrail rule auditing, edge case testing against the specific failure vector, policy gap analysis
Agent reasoning failureAn autonomous agent makes incorrect decisions in its tool selection, task delegation, or multi-step reasoning chainIncorrect actions taken on behalf of users, cascading errors through tool invocations, authority boundary violationsDecision chain reconstruction, tool invocation auditing, reasoning trace evaluation, authority delegation analysis
Data pipeline corruptionUpstream data used for fine-tuning, RAG indexing, or feature generation is corrupted, stale, or poisonedModel trained or augmented with bad data produces systematically flawed outputsData lineage tracing, pipeline integrity checks, comparison against known-good baseline datasets
Context window overflowInput exceeds the model's context window, causing truncation of critical information (system prompts, safety instructions, or key context)Model loses access to instructions or context it needs, leading to unpredictable behaviour or safety failuresToken count analysis, truncation point identification, context prioritisation auditing

These failure modes align with the incident archetypes identified in the GenAI Incident Response Framework and the Coalition for Secure AI's AI Incident Response Framework v1.0, both published in 2025.

With these failures in mind, here is how I actually investigate an issue.

The four phases at a glance

  1. Stabilise and preserve. Contain the system to prevent further harm. Preserve all available telemetry and model state before it is overwritten or rotated. Establish a timeline and determine the blast radius.

  2. Reconstruct the failure. Map the full inference pathway from input to output. Reproduce the failure in a controlled environment using preserved model state, prompt chains, and retrieval context.

  3. Root cause analysis. Apply failure-mode-specific analytical methods (from the table above) to identify the proximate cause. Distinguish between model-level, data-level, infrastructure-level, and human-level factors.

  4. Causal synthesis. Structure findings into a defensible causal chain. Separate proximate cause from contributing factors and systemic issues. Determine whether the failure was isolated or systemic.

Two techniques in practice

Each of those phases warrants its own detailed treatment. So in the interests of brevity, here are two examples of root cause analysis techniques that I use frequently.

1. Semantic entropy analysis. This is one of the most powerful forensic techniques available for LLM investigation. Research by Farquhar et al. (published in Nature, 2024) demonstrated that semantic entropy, a measure of meaning-level uncertainty across multiple model outputs, can reliably distinguish between a model that is confidently wrong and one that is visibly uncertain. In practice, you run the same query multiple times and cluster the responses by meaning, not by surface text. If the model produces semantically divergent answers to the same question, it lacks reliable knowledge on that topic and is confabulating. This technique applies directly to hallucination investigations, but it is equally useful for validating whether a model's outputs on a given topic are stable enough to trust in production.

How semantic entropy detects confabulation

Semantic entropy clusters model responses by meaning, not surface text, to distinguish grounded knowledge from confabulation.

2. Context grounding verification. Before investigating model behaviour, determine whether the model even had the right information to work with. For RAG systems, this means extracting the exact documents retrieved, the similarity scores, and the query that triggered retrieval, then evaluating whether the correct source material was present in the context window. A large proportion of what looks like a hallucination is actually a retrieval failure upstream. (Barnett et al. documented seven distinct failure points in RAG systems, and the majority occur before the model even generates a response.) The model generated a plausible answer because it never received the right source material. This distinction matters because the remediation is completely different: you do not fix the model, you fix the pipeline.

Investigating a RAG retrieval failure: where to look

Each stage of the RAG pipeline is a potential failure point.

There are many additional methods we can use including distribution shift detection, adversarial boundary testing, and decision chain reconstruction for agentic systems: each requires specific techniques, specific tooling, and specific interpretive frameworks.

Verifying that fixes actually work

A root cause analysis is incomplete without verification that the remediation addresses the actual root cause, not just the symptoms. I have seen this go wrong enough times that I consider it a separate, essential phase of any investigation.

Remediation validation involves:

  • Testing guardrails against the specific failure mode identified. If the root cause was a prompt injection bypass, test the updated guardrails against the exact attack vector and reasonable variations of it.

  • Validating retrieval pipeline corrections. If the root cause was a RAG failure, confirm that the fix (re-chunking, index rebuild, query reformulation) resolves the specific retrieval failure and does not introduce regressions elsewhere.

  • Confirming monitoring gaps have been closed. If the investigation revealed telemetry gaps, verify that new logging is actually capturing the data it was designed to capture. Deploy, check, confirm. Do not assume.

  • Re-running the failure scenario in a controlled environment post-fix. Replay the original failure conditions against the remediated system. Confirm the failure no longer reproduces. Document the results.

The independence principle is non-negotiable. The team that built the system should not be the team that validates the fix. This is not about distrust. It is about rigour. Independent verification eliminates confirmation bias and produces findings that withstand external scrutiny. Vendor self-certification is not validation.

Post-remediation, establish ongoing monitoring requirements specific to the failure mode identified. A one-time fix is necessary but not sufficient. You need to confirm that the fix continues to hold over time, particularly for drift-related failures where the conditions that caused the original incident may recur.

The case for forensic readiness

Everything in this article depends on one prerequisite: that the data exists to support an investigation. The methodology is sound. The analytical techniques work. The reporting frameworks produce defensible evidence. But none of it matters if your AI systems are not capturing the telemetry needed to reconstruct what happened after the fact.

I want to frame forensic readiness not as a cost centre but as what it actually is: a competitive advantage.

  • Regulatory compliance. The EU AI Act's incident reporting obligations take effect in August 2026. Organisations that have built forensic readiness into their AI infrastructure will be able to comply. Those that have not will face a scramble that is both more expensive and more risky.

  • Insurance eligibility. Underwriters are already asking about AI governance. As AI-specific liability products mature and standard policies exclude AI exposures, the organisations that can demonstrate forensic readiness will have access to better coverage at better rates.

  • Faster incident resolution. When failures do occur (and they will), organisations with proper telemetry resolve incidents faster, with lower exposure, and with findings that actually prevent recurrence. The cost of an investigation drops dramatically when the data is already there.

  • Board confidence. AI programme governance is a board-level concern. Being able to say "we have the infrastructure to investigate any AI system failure within our deployment" is a genuinely valuable governance position.

The question I leave you with is simple. If your organisation is deploying AI systems, can you answer this: what data would we need to reconstruct a failure, and are we capturing it?

If the answer is no, or if you are not sure, the readiness audit is the place to start. Not because something has gone wrong, but because something will, and the organisations that are prepared will handle it better than those that are not.

Put this into practice.

This article covers the methodology. If you want hands-on support delivering cybersecurity assessments or building this capability into your consultancy, let's talk.

Get Hands-On Support