QRefAI
Contents
AI Governance

How Do You Prove Your Agents Are Governed?

Regulators and auditors want proof our agents are governed. What do we actually hand them?

7 min · Updated June 2026

Article 5 of 6 — “Governing AI Agents in the Enterprise: A Practical Architecture Guide”

This article covers accountability governance: audit trails, compliance evidence, runtime content guardrails, and how to draw a clean line between tools so nothing falls through the gaps.

Q5.1 -- Regulators and auditors want proof our agents are governed. What do we actually hand them?

The pattern: Tamper-evident audit and compliance evidence.

Diagram of a tamper-evident audit chain — cryptographically chained and signed append-only log entries mapping agent actions to compliance controls

Having controls is not the same as being able to proveyou had them. When a regulator, an auditor, or a court asks “what did this agent do, who authorized it, and what controls were in place” — the answer must be evidence that is complete, attributable, and demonstrably un-tampered. Ordinary application logs and even observability traces are not built for this. They can be altered. They are not mapped to control frameworks.

The Microsoft Agent Governance Toolkit produces an append-only audit log in which each entry is cryptographically chained and signed — so any later alteration or deletion is detectable. Every policy decision, identity assertion, and significant action is recorded. The toolkit’s compliance module verifies coverage against the OWASP Agentic Security Initiative top-ten and maps controls to regulatory frameworks including the EU AI Act, HIPAA, and SOC 2 — emitting signed attestations that can be produced as evidence.

A verification step runs in CI on every release, so the compliance evidence is continuously refreshed rather than assembled in a panic before an audit.

It is important to keep roles distinct: observability traces are for debugging and analysis — they are not tamper-evident. The governance audit log is the system of record for compliance. The two serve complementary purposes and must not be confused.

Diagram comparing observability traces for debugging versus tamper-evident governance audit logs as the system of record for compliance

Real-world examples

Banking under the EU AI Act -- credit-decisioning support agent

The regulator asks the bank to demonstrate governance of an AI system that influences credit decisions. The bank produces signed attestations mapping its controls to the EU AI Act, plus a cryptographically verifiable audit trail showing, for each decision, which policy applied and where a human signed off.

Healthcare under HIPAA -- clinical-support agent

During a compliance review, the provider must show that every access to protected health information by the agent was authorized and logged. The cryptographically chained audit log provides a complete, tamper-evident record, and the compliance mapping demonstrates alignment with HIPAA safeguards.

Public company under SOC 2 -- internal operations agent

An external SOC 2 auditor needs evidence that automated agents operate within defined controls. The continuously generated, signed compliance attestations and the verifiable audit log become direct audit artifacts — shortening the audit and reducing findings.

Diagram of the compliance evidence pipeline — from agent actions through tamper-evident audit logs to signed attestations for regulators and auditors

Q5.2 -- What about prompt injection, jailbreaks, and PII leaking out of the model -- do these tools cover that?

The pattern: Closing the runtime content-guardrails gap — a known limitation.

Diagram illustrating the runtime content-guardrails gap — the area not covered by evaluation, observability, and governance tools alone

This question is included because honesty about architecture limits is part of good design. The three core tools in this stack — evaluation, observability, governance — do not fully cover runtime content guardrails: inspecting the actual text flowing in and out of the model on every live request to block prompt injection, detect jailbreak attempts, or filter personally identifiable information in real time.

  • DeepEval can test for these weaknesses before release.
  • The policy engine can block actions triggered by these attacks.
  • But neither inspects and filters live model content on every turn.

This gap must be filled explicitly. It is not a deficiency to ignore. It is handled in two parts:

Pre-production (red-teaming). A red-teaming capability runs adversarial attack suites — prompt-injection, jailbreak, data-exfiltration probes — as part of CI. This catches systematic weaknesses before release. The governance toolkit also offers red-team scanning of prompts.

Runtime (a fourth, lightweight component). Add a dedicated content-guardrail component — for example, an open-source guardrails library or a safety classifier model — at the agent framework’s model callbacks. It inspects each prompt before it reaches the model and each response before it reaches the user or a tool, blocking or redacting as needed.

Comparison of DeepEval and Langfuse roles — DeepEval for pre-production evaluation and red-teaming, Langfuse for production observability and prompt management

Real-world examples

Banking -- customer-service agent

A customer message contains hidden instructions attempting to make the agent reveal another account’s details. The runtime guardrail detects the injection pattern and strips it before the model ever processes it. The pre-production red-team suite ensures this class of attack was tested against every release.

Healthcare -- symptom-checker agent

Before any model response is shown to a patient, the runtime guardrail scans for and redacts any PII that should not appear, and blocks responses that drift into definitive diagnosis the agent is not permitted to give.

Retail -- conversational shopping agent

Adversarial users try jailbreak prompts to extract internal pricing logic or discount rules. Red-team testing in CI quantifies the agent’s resistance release over release; the runtime guardrail blocks live jailbreak attempts in production.

Q5.3 -- DeepEval does evaluation and so does Langfuse. We don't want two teams building the same thing. Who owns what?

The pattern: Separation of concerns — the evaluation engine versus the data plane.

Both DeepEval and Langfuse offer evaluation features. Left unmanaged, two teams build overlapping pipelines, ownership blurs, and dashboards disagree. The pattern is to draw a hard boundary based on what each tool does best, and enforce it.

  • DeepEval is the evaluation engine. It owns the research-grade metric library, multi-turn simulation, synthetic dataset generation, red-teaming, and — above all — the pre-production CI gate. Anything that decides whether a build ships belongs here.
  • Langfuse is the data plane. It owns trace storage, dashboards, prompt management, canonical datasets, human annotation queues, and lightweight online evaluation — cheap LLM-as-judge checks on production traffic for things like tone and format.
  • They cooperate through a documented interface. DeepEval pushes its metric scores onto corresponding Langfuse traces. The result: rich, metric-specific scores from the purpose-built evaluation engine, surfaced inside the single observability pane teams already use.

The rule of thumb: the CI quality gate is DeepEval; the production trace store and dashboards are Langfuse. Do not let either drift into the other’s lane.

Real-world examples

B2B SaaS platform company

The platform team mandates the split in internal golden-path documentation: feature teams write DeepEval tests for their agents (CI gate) and instrument with Langfuse for tracing. No team builds a bespoke evaluation pipeline, and every agent’s scores appear in the same Langfuse dashboards.

Media and publishing

An editorial-assistant agent is evaluated for factual accuracy and style. Deep, expensive accuracy checks run in CI via DeepEval; cheap, broad style-and-tone checks run continuously on production traffic via Langfuse. The two layers are complementary, not duplicative.

Fintech scale-up

As the team grows, the documented boundary prevents the classic failure where the platform team and a product team independently build evaluation tooling. One interface — DeepEval scores flowing into Langfuse — keeps a single source of truth.

Q5.4 -- Leadership said 'use Microsoft's agent governance,' but we're multi-cloud and not on Azure. Is that even possible?

The pattern: Distinguish the product from the framework — and choose the cloud-agnostic one.

Diagram distinguishing cloud-tied agent governance products from cloud-agnostic open frameworks for multi-cloud deployments

This is a common and consequential source of confusion. “Microsoft agent governance” refers to two different things from two different parts of Microsoft, and conflating them leads to an architecture that violates a multi-cloud or sovereignty requirement.

  • Microsoft Entra Agent ID and Microsoft Agent 365 are commercial SaaS offerings. They are directory- and platform-bound to Microsoft Entra and Microsoft 365. They are excellent if your organization lives in Microsoft 365 — but they are not cloud-agnostic.
  • The Microsoft Agent Governance Toolkit (AGT) is a separate, open-source (MIT-licensed) project: a set of governance libraries — policy engine, identity, runtime sandboxing, reliability, compliance — that install into your application and run anywhere: any cloud, on-premises, hybrid, air-gapped. It is framework-agnostic by design.

For a cloud-agnostic mandate, adopt AGT and decline Entra Agent ID and Agent 365. One caveat: some of AGT’s documentation tutorials default to Azure deployment examples. Use the framework-agnostic APIs, which carry no cloud dependency.

Real-world examples

Multi-cloud regulated bank

The bank runs workloads across two clouds plus on-premises, and forbids new single-cloud dependencies. AGT satisfies the governance mandate without violating the policy; Entra Agent ID would have introduced exactly the cloud lock-in the architecture board prohibits.

Government agency with data-sovereignty requirements

The agency must run agents in a sovereign or air-gapped environment where Microsoft 365 SaaS is not available. AGT, being self-hostable libraries, runs inside the sovereign boundary.

Hybrid healthcare network

Clinical systems remain on-premises while other workloads are in a public cloud. AGT provides one consistent governance layer across both — whereas a cloud-bound SaaS governance product would leave the on-premises clinical agents ungoverned.

The picture so far

Across the last two articles, the governance layer has addressed:

CapabilityQuestion
Preventive action enforcementQ5 — Policy engine
Agent attribution and trustQ6 — Cryptographic identity
Execution containmentQ7 — Sandboxing
Cascading failure and emergency stopQ8 — Reliability module
Tamper-evident accountabilityQ9 — Audit and compliance evidence
Live content threatsQ10 — Runtime guardrails (fourth component)
Tool ownership boundariesQ11 — DeepEval vs Langfuse split
Cloud-agnostic governanceQ12 — AGT vs Entra / Agent 365

The final article shows how to connect all three disciplines into a single system — and gives a sequenced adoption plan so you know what to build first.