Skip to content
Arcitae Labs
Go back

What ISO Auditors Actually Asked About Our AI Systems

We were a few months into Klub’s ISO 27001 readiness work in late 2023 when an auditor asked a question that completely changed how I thought about AI systems:

“What happens if the model gives the customer the wrong answer?”

Until that point, most of our internal AI conversations had been about usefulness and speed. Could the assistant answer faster? Could it surface information more accurately? Could we automate more operations work?

The audit shifted the framing entirely.

The conversation became about customer safety, traceability, failure containment, data exposure, and operational risk. The auditors were not anti-AI. They were trying to understand whether these systems could fail safely without harming customers or destabilizing the business.

What made the experience particularly valuable was that I went through it twice, two years apart, at two very different generations of AI maturity.

The Klub ISO 27001 audit closed in December 2023. The AI in scope was internal-facing, built on infrastructure we had assembled by hand because the ecosystem hadn’t matured yet. LangGraph didn’t exist. OpenAI hadn’t added native function calling. The term “agentic” wasn’t widely used.

Two years later, in October through December 2025, I went through ISO 27001 and SOC 2 audits at Lucio AI in legal tech. By then the AI was customer-facing. Workflows were genuinely agentic. The tooling ecosystem had matured. The risks had changed shape entirely.

Same auditing standards. Different technical questions. Same underlying insight, which is what this piece is about.


The Klub Audit: Internal AI, Real Customer Data

At Klub, our first production AI system was internal-facing, not customer-facing. We deployed an assistant for the operations team to help them answer customer queries about repayment status, upcoming EMIs, and payment history. The Ops team talked to customers; the AI helped Ops talk to them faster and more accurately.

This was 2023. We built what we needed by hand: stitching together multiple LLM calls into multi-step workflows, calling our internal repayment APIs through structured functions, and grounding responses in the data those APIs returned. It was an agentic system before agentic frameworks existed.

When the ISO 27001 audit started, this was the system in scope. The auditor’s questions were specific:

“What customer data is going into the model?”

“What happens if the model gives Ops the wrong information about a customer’s repayment?”

“How do you know that information isn’t being stored or learned by the model?”

These weren’t hypothetical questions. The system was reading from production repayment data. If it told an Ops user the wrong outstanding balance, that Ops user might tell the customer the wrong thing. In lending, that’s the kind of mistake that creates regulatory exposure.

Our answer wasn’t “the model is accurate.” That answer wouldn’t have held up.

Our answer was that we grounded every response in API responses from our own systems, surfaced the underlying data alongside the AI-generated response so the Ops user could verify before speaking to the customer, and logged every query for review. The AI wasn’t the source of truth. The repayment system was. The AI was a translation layer between Ops users and the data, with the data always available for verification.

That’s the framing the auditor accepted. Not “the AI doesn’t make mistakes.” Rather, “when it makes mistakes, the operator has the source data in front of them, the failure is visible, and the system is logged.”

That principle, the AI as translation layer rather than source of truth, shaped everything I built afterward.


The Lucio Audits: Customer-Facing AI, Probabilistic Workflows

By the time I joined Lucio in August 2025, the AI ecosystem had matured significantly. Models were stronger. Tooling had caught up. Workflows started becoming AI-native rather than traditional systems with AI bolted on.

We went through ISO 27001 and SOC 2 audits in October, November, and December 2025. The technical questions auditors asked were different from Klub, because what was being audited was different.

The risks had changed shape.

In earlier chatbot systems, a hallucination mostly affected a response. In AI-native workflows, probabilistic outputs started affecting deterministic systems. That’s a meaningfully different failure mode.

Fluency Is Not Quality

At Lucio, working on legal research workflows, we ran a structured comparison. For the same set of legal research queries, we tested an LLM-only baseline against a system where research agents queried a curated database of judgments and external legal sources, retrieved the relevant documents, and passed them to the LLM along with any user-uploaded files.

On a 1-to-5 quality scale graded by domain reviewers, the LLM-only system scored around 2. The retrieval-grounded version scored consistently between 4 and 4.5.

The LLM-only responses sounded fluent. They were often legally weak, occasionally fabricated case citations with confidence, and almost always too generic to be useful in real legal work. The retrieval-grounded version produced answers that were defensible because every claim could be tied back to a specific source.

Fluency is not quality. In regulated work, an answer that sounds confident but is legally wrong is more dangerous than an answer that admits uncertainty. That insight shaped how we approached every customer-facing AI workflow at Lucio.

Evaluation by People Who Could Tell the Difference

Evaluation was not something engineers could do alone. We had access to a sizable group of lawyers with practice experience, both in-house counsel and a small set of early-adopter customers who participated in evaluation. They reviewed AI responses for legal correctness, not just fluency. Engineers running the same evaluation would have caught grammatical drift but missed the kind of legally-incorrect-but-confident-sounding response that creates real risk in a legal workflow.

Before any major prompt change or workflow shift went to production, the process was: reviewers tested the change against a set of representative queries, captured feedback in shared documents, and the Product Head and Founder signed off on the change before it went live.

The process was real. The artifacts were lightweight. We tracked feedback in Google Docs rather than a formal evaluation tool, and the rubric was reviewer judgment rather than a structured scorecard. That was an honest gap. At our scale, with deeply domain-expert reviewers, it worked. In a more demanding compliance regime or at higher scale, it would not have held up.

When Structured Outputs Aren’t Structured

One concrete example of the new failure mode: structured outputs.

We instructed models to return JSON because downstream workflows depended on structured responses. In testing, this worked. In production, it sometimes didn’t. The model would return JSON that was structurally valid in testing but malformed at the moment a customer asked the question. Or the response would parse correctly but a downstream service couldn’t handle a field that drifted in shape. In some cases, customers saw raw JSON appear on their screen instead of a formatted answer. In others, customers saw blank responses where the workflow had silently failed.

These weren’t theoretical risks. They were incidents we shipped to production and had to recover from. Each one was logged, root-caused, mitigated, and documented. The audit trail of “here is what failed, here is what we changed, here is the evidence the change works” is what auditors actually wanted to see.

What we built in response was layered: schema validation at the LLM boundary, retries with exponential backoff, fallback paths when retries exhausted, alerting that paged on-call when failures crossed a threshold, and human review checkpoints for the workflows where a wrong answer mattered most. Most of this didn’t exist when I joined Lucio. We built it as we found out where the failure modes lived.

The lesson is one I now repeat often: AI systems pass tests and still fail in production in ways traditional software doesn’t. Test passing is necessary but it’s a much weaker signal than it is for deterministic systems. Operational maturity around AI is fundamentally about having the controls in place to detect and contain failures the tests didn’t predict.

Tenant Isolation as a Trust Boundary

Geographic separation was the baseline. We maintained separate deployments per region primarily because of data residency requirements, both at Klub and later at Lucio. Auditors verified the boundaries were real, not just labels.

Tenant isolation followed a tiered model. The default for managed customers was shared infrastructure with logical separation, which is how most SaaS works. For a small set of high-value enterprise customers who required and were willing to pay for it, we offered stronger boundaries: dedicated Azure subscriptions for some, and in a few cases, deployments running inside the customer’s own cloud account.

The tiered approach mattered for two reasons. It made the architecture honest about what isolation customers were actually getting at each price point, instead of marketing every customer as “isolated.” And it made compliance reviews much easier, because we could point to the specific isolation level for each customer rather than waving our hands at “logical separation across the board.”

Isolation is not just an infrastructure decision. It’s a trust boundary, and at enterprise scale, it’s a pricing decision too.


What Auditors Actually Cared About

Across both audits, the consistent theme surprised me.

Auditors cared far less about how intelligent the models were than whether the systems could fail safely.

Most engineering teams naturally focus on quality, latency, automation, and capabilities. Auditors focused on evidence, deployment controls, customer safety, containment, and operational accountability. That mindset initially felt restrictive. Over time, I realized it was the right mindset.

Evidence Was the Whole Game

Operational evidence wasn’t an afterthought. Prompt changes lived in Git, with commits linked to Linear cards so every change had a traceable approval and ticket history. Incidents were logged with root cause, mitigation, and what changed in the system afterward. Evaluation results, even when captured in lightweight artifacts like shared documents, were retained.

What surprised me was how thoroughly the auditors used those artifacts. They didn’t ask “do you have evidence?” and accept a yes. They asked specific questions tied to specific artifacts: this commit, that incident, this approval flow. The audit became a conversation about how the system actually behaved, not whether we had the right policies on paper.

That experience changed how I think about engineering documentation. The artifacts you produce in the normal course of building a system, Git history, ticket links, incident logs, evaluation notes, are usually most of what an auditor needs. The work isn’t in producing extra documentation for compliance. The work is in making sure the documentation you already produce is connected, searchable, and tells a coherent story when read together.

The Question That Stayed With Me

Across both audits, auditors kept pushing toward the same question, in different forms:

“If this system behaves unexpectedly, how do you contain the blast radius?”

That framing stayed with me long after the audits closed.

The problem with production AI systems is not that they make mistakes. Every system makes mistakes. The real question is whether the organization can detect those failures early, contain them safely, and prevent customer harm when they happen.

That is what operational maturity around AI actually looks like.


Two Audits, Two Eras, One Thesis

The Klub audit in December 2023 and the Lucio audits in October to December 2025 sat on opposite sides of a real shift in AI maturity. The technical questions changed. The infrastructure looked different. The failure modes evolved.

The thesis didn’t change.

In regulated environments, AI systems are not judged on intelligence. They are judged on evidence, containment, accountability, and operational discipline.

That was true for an internal Ops assistant calling repayment APIs in 2023. It was true for customer-facing legal research agents with structured outputs in 2025. It will be true for whatever shape AI takes next.

If you’re building AI systems in FinTech, legal tech, healthcare, or any compliance-sensitive environment, the muscles that actually matter are not the ones most engineering teams build first. Treat prompt changes like production code. Ground customer-facing responses wherever possible. Maintain evidence of reviews, incidents, and mitigations as a normal byproduct of how you work, not as a compliance afterthought. Design tenant isolation intentionally. Involve domain experts in evaluation, not just engineers.

Most importantly: do not assume model intelligence alone creates trust.

Operational discipline does.


Share this post on:

Previous Post
The Scraping Layer - How Front-Office AI Is Reshaping Ambulatory EHR Economics