AnalystEngine (Internal R&D)·Cybersecurity Consulting

What We Learned Building an AI-Native Assessment Platform from Scratch

Challenge

Cybersecurity framework assessments are one of the most knowledge-intensive engagements in consulting. They depend on senior practitioners who can interpret evidence, map it to controls, and identify what is missing. We wanted to test how much of this workflow AI could handle reliably.

Outcome

A working AI-native assessment platform that processes client documents, extracts structured claims, maps evidence to framework controls, and generates targeted interview questions. Built over twelve months across cybersecurity, AI, and software engineering.

AI Product EngineeringLLM ArchitectureCybersecurity Assessment Methodology

Why We Built This

Most cybersecurity assessments follow the same pattern. A consultant collects documents, reads through them, maps findings to a framework, identifies gaps, writes interview questions, conducts interviews, and produces a report. The knowledge required to do this well is significant. The process itself is largely manual.

We wanted to know: how much of this process could AI handle reliably, and where does it fall short?

Not as a thought experiment. We built the platform, tested it against real assessment workflows, and shipped working software. What follows is a walkthrough of what the platform does, how it works, and what we learned building it.

Document Ingestion and Client Context

The first thing any assessor needs is context. Who is the client? What does their estate look like? What have they told us so far?

The platform ingests client documents (interview transcripts, policy documents, incident logs, internal notes) and synthesises an AI-generated client summary. This is not a simple extraction. The system reads across all uploaded material and builds a contextual picture of the organisation: their size, their technology stack, their security posture, and critically, where the gaps are.

Client Overview

The platform synthesises uploaded documents into a client summary, then generates follow-up questions targeting evidence gaps. In this example, it identified gaps in access control maturity, absence of centralised logging, and non-universal MFA across a 1,300-user Microsoft-centric estate.

From this summary, the platform generates contextually-aware follow-up questions. These target areas where the evidence is thin or missing entirely. In the example above, the system identified four significant gaps:

  • How cybersecurity is governed at the organisational level
  • How security logging and monitoring works (or whether it exists at all)
  • The current network architecture and segmentation approach
  • Whether the organisation holds cyber insurance

Why This Matters

A consultant reviewing the same documents manually might miss the governance gap. The platform surfaces it automatically because it cross-references every document against what a complete assessment requires.

Alongside the narrative summary, the platform extracts structured client attributes from the source material: industry, headcount, primary identity provider, operating system estate, whether they use a managed service provider, and whether they have dedicated security resources.

Structured Data Extraction

Key client attributes are automatically extracted from uploaded documents. The system distinguishes between what is known and what remains unknown, directing the assessor towards targeted evidence gathering.

This structured extraction serves two purposes. First, it gives the assessor an immediate factual baseline without manually combing through transcripts. Second, and more usefully, it highlights what we do not yet know. The gaps in structured data become the starting point for follow-up interviews.

Claim Extraction and Document Analysis

Once documents are uploaded, the platform processes every file and extracts individual claims: discrete factual assertions made within the source material.

In a typical engagement, uploading five documents produced 788 individual claims. Each claim is tagged by severity (High, Medium, Low) and linked back to its source location within the original document.

Document Explorer

Five uploaded documents yielded 788 extracted claims. Each document shows its individual claim count, and claims are categorised by severity: 23 High, 46 Medium, 41 Low.

The claim extraction works inline within the original transcript text. Hover over a highlighted passage and the platform shows you the structured claim it extracted, along with its reasoning.

Inline Claim Extraction

Claims are highlighted directly within the source transcript. Here, the passage 'Our new parent group, Portilia, is moving to ServiceNow' yields the extracted claim 'Portilia is the new parent group of ACME'. The assessor can see exactly where each claim originated.

This is where the platform starts saving serious time. In a traditional assessment, a consultant reads a 40-page transcript and takes notes. They might miss a claim buried on page 23 that contradicts something stated on page 6. The platform reads everything, extracts every claim, and makes all of them searchable and traceable.

What This Looks Like in Practice

An assessor uploads five files on a Monday morning. By the time they have made a coffee, the platform has extracted nearly 800 claims, categorised them by severity, and linked each one to its source. The assessor spends their time reviewing and validating claims rather than hunting for them.

Framework Mapping

Claims on their own are useful. Claims mapped to a compliance framework are actionable.

The platform takes the extracted claims and maps them against the target framework. In this example, CMMC v2.0. The result is a clear picture of evidence coverage: which controls have supporting evidence, which do not, and how many claims support each control.

Framework Mapping (CMMC v2.0)

61 of 149 controls have supporting evidence (41%). Controls are organised by domain and level, with individual cards showing the control ID, description, and number of supporting claims. Assessors can filter to show only controls with evidence, only those without, or search for specific controls.

Each control card shows the control ID, a description, and the number of claims supporting it. The assessor can filter by domain (Access Control, Audit and Accountability, and so on), by level, or toggle between controls with evidence and those without.

This is where the platform's value becomes most tangible for consultancies. The mapping step in a traditional assessment is painstaking. An experienced consultant might spend days cross-referencing interview notes against a framework spreadsheet. The platform does this in minutes and provides full traceability back to source documents.

The assessor still makes the judgement call. The platform maps evidence to controls. It does not decide whether the evidence is sufficient. That decision, whether a control is adequately met, requires human expertise. The platform positions the consultant to make that judgement faster and with better information.

Targeted Interview Questions

Evidence gaps are only useful if you know what to do about them. The platform generates interview questions for every area where supporting evidence was missing or insufficient.

These are not generic questions pulled from a template. Each question is generated in context: the platform knows what evidence already exists, what the client has already told us, and where the specific gaps sit. The result is questions that are immediately usable in a stakeholder interview.

Contextually-Aware Interview Questions

Each generated question includes detailed assessor guidance: what evidence was found, what is still missing, and what to listen for during the interview. Here, the question targets end-to-end access management, noting that the platform found separate admin accounts and MFA bypassed on internal IPs, but needs deeper probing on provisioning workflows and access reviews.

Each question comes with assessor guidance notes. These explain what the platform already knows (for example, that separate admin accounts exist and MFA is bypassed on internal IPs with 60-day reauthentication), and what the assessor should probe further. This is the kind of preparation that typically takes a senior consultant hours. The platform produces it as a byproduct of its analysis.

Design Decision

We deliberately chose to generate guidance notes alongside questions, rather than just the questions themselves. A question without context forces the interviewer to go back and re-read source material. A question with context lets them walk into the room ready.

Framework assessments are rarely linear. An assessor reviewing access controls might suddenly need to know what the client said about endpoint detection three interviews ago. Traditional approaches mean searching through multiple documents manually.

The platform provides semantic search across all uploaded material. Search by concept, not just keyword. Type "edr" and the system returns every relevant claim across every document, ranked by relevance.

Semantic Search

Searching for 'edr' surfaces claims like 'ACME uses Microsoft Endpoint Protection', 'ACME uses the Software Center application to provide the built-in endpoint protection', and 'ACME uses Windows Defender on its Windows 11 devices'. Each result shows the source document, relevance score, and severity rating.

Each search result shows the matched claim, its source document, the effective date, and a relevance score. Claims carry their severity badges (High, Medium) through to the search results, so the assessor can prioritise what to look at first.

This is one of the most powerful features in the platform. It turns the entire document corpus into a queryable knowledge base. When a client says something in an interview that contradicts earlier evidence, the assessor can verify it in seconds.

What We Learned Building This

Twelve months of building, testing, and iterating produced a platform that works. It also produced a set of hard-won insights that we did not expect at the outset.

What AI Is Genuinely Good At

The strongest results came from tasks that involve volume and cross-referencing. Reading hundreds of pages of transcripts and extracting structured claims is exactly the kind of work AI handles well. It does not get tired on page 38 and it does not forget what was said on page 4. Mapping those claims to framework controls across multiple documents and hundreds of data points is tedious for humans and fast for machines.

Gap identification was similarly strong. Once the system knows what "complete" looks like (a framework with full evidence coverage), identifying what is missing is straightforward. We initially tried to have the model also assess the severity of each gap, but the results were inconsistent. Severity depends on business context that the model does not have. We stripped that out and left severity assessment to the human assessor.

Contextual question generation surprised us. We expected the output to be generic, but because the model holds the full evidence set in context, the questions were specific and usable. The assessor guidance notes were an iteration on the first version, which generated bare questions with no context. Those were close to useless in practice.

Where We Hit Walls

Sufficiency is the hardest problem. A control might have three supporting claims. Whether those claims constitute adequate evidence requires professional judgement that the model cannot replicate. We tried several approaches to automated sufficiency scoring, including confidence thresholds and claim-count heuristics. None of them were reliable enough to ship. The assessor makes this call.

We also found that the model's framework mapping was only as good as the claim extraction. Early versions extracted too many low-quality claims, which created noise in the mapping. We spent significant time tuning the extraction pipeline to favour precision over recall. Fewer, higher-quality claims produced better mapping results than a larger volume of uncertain ones.

The other area where AI falls short is narrative. Translating technical findings into board-level recommendations that drive action is a fundamentally human skill. We experimented with report generation and the output was technically accurate but tonally flat. It read like a compliance document, not a strategic recommendation. We stopped pursuing automated report generation entirely.

The Broader Insight

This platform sits at the intersection of cybersecurity, AI, and software engineering. Building it required deep expertise in all three. We had to understand the assessment workflow intimately to know what to automate. We had to understand AI's strengths and limitations to avoid building something that produced confident nonsense. We had to understand software engineering to ship a product that actually works.

The Takeaway

AI can meaningfully augment cybersecurity assessment delivery. The technology works. But the harder problem, the one most organisations underestimate, is redesigning the workflow around what AI is actually good at. That is an engineering challenge, not a procurement one.

This is what practitioner-led looks like.

The experience behind this platform informs every engagement we deliver. If you're working through where AI fits in your organisation, we've already done some of the hard thinking.