How to forensically analyse LLM alignment drift and hallucination

This is a practical guide for evaluating LLM drift, hallucination, or degraded functionality. It is based on a real forensic instruction where we were asked to investigate an AI model for generating an erroneous response, which ultimately resulted in financial liability for the firm. Although the engineering team subsequently claimed to have fixed the issue, AnalystEngine was instructed to investigate the circumstances of the original issue and demonstrate that the fixes fully remediated it.

We will use the terms Bad (the erroneous model) and Good (the patched model) to distinguish between the two.

This article sits alongside our guide on investigating AI system failures, which covers the broader investigation lifecycle including evidence preservation, failure mode taxonomy, and reporting methodology.

Three detection lenses

In order to get started, you need to create three key datasets:

•
A set of inputs passed to the LLM, including any inputs that led to erroneous responses. But you also want to include around 50 to 200 representative samples of usual activity.
•
A set of outputs generated from the inputs above using the Bad Model.
•
A set of outputs generated from the inputs above using the Good Model.

We used three techniques to measure the distance between the Good and Bad models.

•
Embedding space drift to measure whether outputs have moved in semantic space. It tells us whether something changed and which prompts are affected, but not what changed at the token level or whether the model is uncertain.
•
Token-level distribution analysis to measure probability allocation at each position in the output. It reveals shifts the model has not yet expressed in its output text.
•
Semantic entropy analysis to measure meaning-level uncertainty across multiple sampled responses. It detects confabulation (the model gives different answers each time) but cannot detect confident wrong answers.

This article shows how we used these techniques to conclude how and why the Bad model became degraded, and how confident we were that the Good model had addressed these issues.

Embedding space drift

This asks: have the outputs moved in semantic space?

Embedding drift quantifies how far the Bad Model's outputs have moved from the Good Model's in semantic space. It tells you which prompts are affected and by how much, but not what changed at the token level or whether the model is uncertain. If two models produce meaningfully different outputs for the same prompt, embedding drift will catch it.

What this looks like

In our engagement, the chatbot had been generating erroneous responses on customer policy queries. The engineering team deployed a fix. We ran the same return policy query through both models.

Good Model (patched):

"You can return any item within 30 days of delivery for a full refund. Items must be unused and in original packaging. Start your return at example.com/returns."

Bad Model (erroneous):

"Returns are generally handled on a case-by-case basis, though most items are eligible within our standard return window. Factors including product category, condition, and purchase date may influence eligibility. I'd suggest reviewing our returns page or contacting support for guidance on your specific situation."

Both answer the return policy question. But the Bad Model output is vague, hedgy, and avoids specific claims. The question is whether this difference is statistically significant across the prompt set, or whether it is within the range of normal variation you would expect from any model.

How to measure it

The idea behind embedding drift is straightforward. You convert text outputs into numerical vectors. Outputs with similar meaning end up close together in that vector space. Outputs with different meaning end up far apart. If you embed the Good Model outputs and the Bad Model outputs, and the two groups have moved apart, something has changed.

But a raw distance number on its own is meaningless. You need to know whether this distance is larger than what you would expect from normal variation.

Step 1: Embed the Good Model outputs. Take the 50 Good Model outputs for a given prompt and run them through your embedding model. You now have 50 vectors.

Step 2: Establish what "normal variation" looks like. Take your 50 Good Model vectors, randomly shuffle them, and split them into two groups of 25. Measure the cosine distance between the average (centroid) of each group. Because both groups come from the same Good Model, this distance represents normal variation, i.e., drift that is not real.

Now shuffle again. Split again. Measure again. Do this 200 times. Each shuffle gives you a slightly different split, and each split gives you a slightly different distance. After 200 rounds, you have 200 distance scores that represent the range of "normal." Compute the mean and standard deviation of these 200 scores. This gives you a noise floor.

Step 3: Measure the actual drift. Embed the Bad Model outputs using the same embedding model. Compute the cosine distance between the Good Model centroid and the Bad Model centroid. You now have a single drift score.

Step 4: Compare against the noise floor. Subtract the mean of your 200 normal-variation scores from your actual drift score, then divide by the standard deviation. This gives you a z-score: how many standard deviations away from "normal" your observation is. If z > 3.0, the drift is very unlikely to be random variation.

Three-panel diagram showing the measurement process. Panel 1: 50 baseline embeddings as dots in a 2D space, tightly clustered. Panel 2: the same dots randomly split into two coloured groups (blue and green), with centroids marked and a small distance between them labelled as 'normal variation.' Panel 3: baseline centroid compared against current output centroid, with a much larger distance labelled and a z-score annotation. — Left: baseline outputs cluster tightly. Centre: random splits of the baseline establish what normal variation looks like. Right: the distance to the current outputs, measured against that normal range, tells you whether the shift is real.

What this told us

We ran all 150 prompts through both models and computed the z-score for each one. The return policy prompt came back with a z-score of 6.0: the Bad Model outputs are six times further from the Good Model than normal variation would produce. Of the 150 prompts, 28 came back above the z > 3.0 threshold. The remaining 122 were stable.

The 28 flagged prompts all fall in the same category: policy-related queries. Return windows, refund terms, warranty conditions, cancellation procedures. These are the prompts where the Good Model gives specific, actionable answers (dates, amounts, links) and the Bad Model hedges. Product specification queries are stable. Greetings and small-talk are stable. The divergence is concentrated in one category.

Chart showing z-scores across 150 golden prompts. Most prompts cluster below the z equals 3.0 threshold. A group of 28 prompts, all labelled as policy-related queries, appear above the threshold with z-scores ranging from 3.2 to 6.0. — Most prompts sit comfortably within the noise floor. The 28 flagged prompts all cluster in one category: policy queries. The drift is real, measurable, and localised.

This confirms that the Bad Model has genuinely diverged from the Good Model on policy queries. It is real, measurable, and concentrated in one category. It does not tell us why, but it narrows the investigation from "something is wrong" to "policy queries have drifted."

Token-level distribution analysis

This asks: has the model's confidence changed, even when the output text looks the same?

Token-level analysis measures whether the Bad Model's probability distributions have shifted at each token position, even when the output text is identical to the Good Model's. Two models can produce the same words while assigning very different probabilities to those words. This technique catches that difference. It does not tell you whether the model is producing inconsistent outputs (that is what semantic entropy does), but it tells you where the model's conviction has weakened.

Why we ran this

Embedding drift told us that 28 policy prompts have drifted between the Good and Bad models, and 122 others are stable. But "stable" only means the output text has not moved in semantic space. It does not mean the Bad Model is equally confident about those answers. The product specification queries all passed embedding drift. The question is: is the Bad Model still sure about them, or is it losing its grip even where the outputs look the same?

To find out, we needed to look inside the model's probability distributions. When the Bad Model generates the token "245" in "The Model X headphones weigh 245 grams," how much probability does it assign to "245" versus alternatives like "250" or "240"? If that probability has dropped compared to the Good Model, the Bad Model is less certain, even though it still picks the same token.

How to get the data

This technique only works if you can extract token-level probabilities from your model. In the OpenAI Chat Completions API (and any OpenAI-compatible endpoint such as vLLM, llama.cpp, or TGI), add logprobs: true and top_logprobs set to the number of alternatives you want (maximum 20) to your request:

{
  "model": "gpt-4o",
  "messages": [
    { "role": "user", "content": "What is the weight of the Model X headphones?" }
  ],
  "logprobs": true,
  "top_logprobs": 5
}

The response includes the normal completion text plus a logprobs object. Here is what the data looks like at the "245" token position:

{
  "choices": [
    {
      "message": {
        "content": "The Model X headphones weigh 245 grams."
      },
      "logprobs": {
        "content": [
          {
            "token": "The",
            "logprob": -0.0012,
            "top_logprobs": [ "..." ]
          },
          "...",
          {
            "token": " 245",
            "logprob": -0.0619,
            "top_logprobs": [
              { "token": " 245", "logprob": -0.0619 },
              { "token": " 250", "logprob": -3.912 },
              { "token": " 240", "logprob": -4.605 },
              { "token": " 200", "logprob": -5.116 },
              { "token": " 300", "logprob": -5.809 }
            ]
          },
          "..."
        ]
      }
    }
  ]
}

Each element in content represents one generated token. The logprob field is the natural log of the probability. To convert it: exp(-0.0619) = 0.94, and exp(-3.912) = 0.02. The top_logprobs array shows the most likely alternatives at that position. You get this data alongside the normal completion response, so you can collect it during regular inference without a separate pipeline. OpenAI returns up to 20 tokens per position. For factual claim positions like our "245" example, the top 5 tokens typically cover 95%+ of the probability mass, so the top-20 limit is not a practical constraint.

Where this technique is not available

Reasoning models (OpenAI o-series, Claude with extended thinking) do not return logprobs. This technique cannot be used on them. Anthropic's Messages API does not expose logprobs on any model, so if your target is Claude, token-level analysis is not an option through the direct API. Embedding drift and semantic entropy both still work against these models without limitation.

What we found

We collected logprobs from both the Good Model and the Bad Model for the product specification prompts that embedding drift had marked as stable. The output text was identical from both models:

"The Model X headphones weigh 245 grams."

The logprobs response gives you a top_logprobs array at every token position in the output. To build the comparison, you take the top_logprobs at a given position from the Good Model response and the top_logprobs at the same position from the Bad Model response, convert each logprob to a probability using exp(), and line them up. At the "245" position, that gives you:

	Good Model	Bad Model
"245"	0.94	0.58
"250"	0.02	0.22
"240"	0.01	0.11
Other tokens	0.03	0.09

The model still outputs "245" because it is the top token. But it is far less confident. The probability mass has redistributed toward adjacent values.

To turn this into a single number you can compare across positions and prompts, we use Jensen-Shannon Divergence (JSD). JSD measures the distance between two probability distributions on a scale from 0 (the two distributions are identical) to 1 (they share no overlap at all). A JSD of 0.18 bits means the Good Model and Bad Model distributions at this position have substantially diverged. For context: if both distributions were identical, JSD would be 0.00. If the Good Model put 94% on "245" and the Bad Model put 94% on "250" instead, JSD would be close to 1.0. Our 0.18 sits in between: both models pick the same token, but the Bad Model's confidence has shifted enough that the distributions are measurably different.

Two side-by-side bar charts comparing token probability distributions. Baseline shows a dominant bar at 0.94 for token 245. Current shows that bar reduced to 0.58, with probability mass redistributed to 250 (0.22) and 240 (0.11). JSD of 0.18 bits annotated between them. — The output text is identical, but the model's confidence has dropped from 0.94 to 0.58. JSD quantifies this shift before the model starts producing different text.

The reason JSD is useful (rather than just looking at the table) is that you compute it at every token position in the output, not just one. The response "The Model X headphones weigh 245 grams." is 8 tokens. Each one has its own logprobs array and its own JSD score. When you line up all 8 JSD scores, the pattern is immediately visible:

Bar chart showing JSD values for each token in the output. Seven tokens (The, Model, X, headphones, weigh, grams, period) have near-zero JSD. The token 245 has a dramatically higher JSD of 0.18 bits, spiking well above a meaningful shift threshold line at 0.05. — Seven tokens are stable. The factual claim token '245' spikes to 0.18 bits. The divergence is concentrated exactly where the model makes a specific knowledge claim.

In our case, the spike is at the factual claim: the "245" token. The structural tokens ("The", "Model", "weigh", "grams") are all flat. The model's confidence has eroded specifically on the knowledge claim, not on how it constructs the sentence.

What this told us about the investigation

Embedding drift established that 28 policy prompts diverge between the Good and Bad models, and 122 others produce identical text. Token-level analysis reveals that the 122 "identical" prompts are not actually equivalent. The Bad Model's confidence on factual claims is measurably lower than the Good Model's, even where both produce the same words. The damage in the Bad Model extends beyond the prompts that have visibly broken.

The Good Model does not show this confidence erosion on any of the 122 prompts. This is evidence that the patch addressed the underlying issue systemically, not just on the prompts where the problem was visible.

Token-level analysis documented the full extent of the Bad Model's degradation. But it cannot tell us whether the Bad Model is actively confabulating: producing different answers to the same question each time. For that, we need semantic entropy.

Semantic entropy analysis

This asks: is the model actively making things up?

Semantic entropy, based on Farquhar et al. (Nature, 2024), detects confabulation: cases where the model lacks reliable knowledge and produces different answers each time you ask. It does not detect confident wrong answers, where the model consistently produces the same incorrect response. These are different failure modes with different fixes:

•If the model is confabulating (different answer each time), you can fix it by giving the model source documents to reference instead of relying on its training data, or by configuring it to say "I don't know" when its uncertainty is too high.
•If the model is confidently wrong (same wrong answer every time), the only way to catch it is to check the model's answers against known correct answers. Semantic entropy will not flag it because the model is not uncertain.

Why we ran this

Embedding drift established that 28 policy prompts diverge between the two models. Token-level analysis established that the Bad Model's confidence erosion extends beyond those 28 prompts. The remaining question for the forensic record: on the 28 divergent policy prompts, what type of failure is the Bad Model exhibiting? Is it hedging consistently (a systematic shift in behaviour) or differently every time (confabulation)? These are different failure modes, and characterising them is necessary to assess whether the Good Model has addressed both.

How it works

Ask the same question multiple times at T=1.0 and see whether you get the same answer. If the model is grounded, the answers converge on the same meaning even if the wording varies. If the model is confabulating, you get different claims each time.

To illustrate: ask "What year was the Springfield Memorial Bridge completed?" (a fabricated question about a nonexistent bridge) ten times at T=1.0. The model has no training data to ground an answer. You get: "1923" (three times), "1957" (three times), "1948" (three times), "1962" (once). Four distinct year claims. Four clusters of meaning. The model is guessing.

Compare: "What is the capital of France?" Ten samples all converge on "Paris" in various phrasings. One cluster. The model is grounded.

Two-panel diagram. Top panel shows 10 responses to the Springfield Memorial Bridge question grouped into four colour-coded clusters by year claim (1923, 1957, 1948, 1962) with high semantic entropy. Bottom panel shows 10 responses to 'capital of France' all grouped in a single cluster around Paris with zero entropy. — Top: four meaning clusters, high entropy. The model is confabulating. Bottom: one cluster, zero entropy. The model is grounded.

How to compute the entropy

You have 10 text responses. The question is: how spread out are they across different meanings?

Step 1: Generate 10 responses at T=1.0 for each prompt. This is just 10 normal API calls with temperature: 1.0. No logprobs needed.

Note on reasoning models

Reasoning models (OpenAI o-series, Claude with extended thinking) do not allow you to set the temperature parameter. However, they are not running at T=0 internally: their chain-of-thought process involves sampling, and you will get different outputs across runs. Semantic entropy can still detect confabulation on these models, but SE values may not be directly comparable to values from standard models at T=1.0.

Step 2: Cluster the responses by meaning. Embed each response using an embedding model. Compute the cosine similarity between every pair of responses. If two responses have a cosine similarity above 0.85, they are saying the same thing and go in the same cluster. "Paris" and "The capital is Paris" end up in the same cluster. "1847" and "1849" end up in different clusters because although the sentences are structurally similar, the factual claim is different. (For cases where numerical precision matters and cosine similarity is too coarse, you can use a natural language inference model like DeBERTa-Large-MNLI to check whether each response logically entails the other. This is slower but catches distinctions that embedding similarity misses.)

Step 3: Count the proportions. In our Springfield Bridge example: cluster A (1923) has 3 of 10 responses, so its proportion is 0.3. Cluster B (1957) is 0.3. Cluster C (1948) is 0.3. Cluster D (1962) is 0.1.

Step 4: Compute the entropy. For each cluster, multiply its proportion by the natural log of its proportion, then sum them up and flip the sign. In practice:

•Cluster A: 0.3 × ln(0.3) = 0.3 × (-1.204) = -0.361
•Cluster B: 0.3 × ln(0.3) = -0.361
•Cluster C: 0.3 × ln(0.3) = -0.361
•Cluster D: 0.1 × ln(0.1) = 0.1 × (-2.303) = -0.230
•Sum: -1.313
•Flip the sign: 1.31

The result is measured in nats (the unit you get when using natural logarithm, like bits are the unit when using log base 2). The scale works like this: 0.0 means all responses landed in one cluster, complete agreement. The maximum possible value with 10 samples is 2.30 (every response in its own cluster, complete disagreement). Our 1.31 is in the upper half of that range: substantial disagreement across four clusters.

For the "capital of France" example: one cluster, proportion 1.0. 1.0 × ln(1.0) = 0.0. Entropy is zero. Complete agreement.

What we found

We ran semantic entropy on the 28 policy prompts that embedding drift had flagged, using the Bad Model. For each prompt, we sampled 10 responses at T=1.0 and clustered by meaning.

The results split into two groups.

19 prompts had low entropy (below 0.3, near-complete agreement). The Bad Model consistently produces the same hedgy answer. When asked about the return window, all 10 samples say some variation of "returns are generally handled on a case-by-case basis." One cluster. The Bad Model is not uncertain, it has confidently shifted to a new, vaguer way of answering.

9 prompts had high entropy (above 0.7, multiple conflicting answers). The Bad Model gives materially different answers each time. On "What is your return window?", three samples said 30 days, three said 14 days, two said "varies by product," and two gave no specific timeframe. Four clusters. The Bad Model is confabulating: it no longer has reliable knowledge of the return policy and is guessing.

What this told us about the investigation

The 19 low-entropy prompts are not confabulating, but the Bad Model is reliably and consistently giving worse answers than the Good Model. The Good Model provides specific policy details (dates, amounts, links). The Bad Model consistently hedges with vague language. It does this the same way every time, which is why entropy is low. But the fact that it has shifted from accurate to uncertain is the problem. This is consistent with what token-level analysis already showed us: the Bad Model's confidence on factual content is eroding across the board.

The 9 high-entropy prompts are a step further gone. The Bad Model is actively confabulating: giving different answers each time because it no longer has any reliable source of information for those queries. If these prompts were previously grounded by a RAG pipeline that retrieved policy documents, a retrieval failure in the Bad Model would explain both findings. The 19 low-entropy prompts might be getting partial or degraded context (enough to produce a response, but not enough to be specific). The 9 high-entropy prompts might be getting no relevant context at all, leaving the Bad Model to guess.

We ran the same semantic entropy analysis on the Good Model as a control. All 28 prompts came back with entropy below 0.3: consistent, specific answers. The Good Model is not confabulating on any of them. This confirms the patch resolved both the systematic hedging and the confabulation.

Putting it together

We were instructed to investigate whether the Bad Model had genuinely degraded and whether the Good Model fixed it. The three techniques answered both questions.

The Bad Model has measurably degraded. Embedding drift flagged 28 of 150 prompts, all concentrated in policy queries. The outputs have visibly changed: where the Good Model gives specific dates, amounts, and links, the Bad Model hedges.

The damage is wider than the visible symptoms. Token-level analysis revealed that product specification queries, which passed embedding drift and look identical from both models on the surface, show significant confidence erosion in the Bad Model.

The drift has two distinct components. Semantic entropy split the 28 drifted policy prompts into 19 where the Bad Model has systematically shifted (confidently giving vaguer answers every time) and 9 where it is actively confabulating (different answer each time, no reliable knowledge). A RAG pipeline issue is the most likely common root cause: partial context for the 19, no context for the 9.

The Good Model resolves both issues. All three techniques confirm that the Good Model does not exhibit the drift, the confidence erosion, or the confabulation found in the Bad Model.

How to Forensically Analyse LLM Alignment Drift and Hallucination

Three Detection Lenses

Embedding Space Drift

Token-Level Distribution Analysis

Semantic Entropy Analysis

Putting It Together

How to forensically analyse LLM alignment drift and hallucination

Three detection lenses

Embedding space drift

What this looks like

How to measure it

What this told us

Token-level distribution analysis

Why we ran this

How to get the data

What we found

What this told us about the investigation

Semantic entropy analysis

Why we ran this

How it works

How to compute the entropy

What we found

What this told us about the investigation

Putting it together

Put this into practice.