Blog

AI Governance Review and Pressure Testing in Q2 2026

Most AI governance models are still judged too gently. They are reviewed as documents, approved as frameworks and filed as evidence of diligence, yet many are never tested as operating systems that must hold under stress. That distinction matters far more now than it did even a year ago. NIST’s AI Risk Management Framework is explicitly about incorporating trustworthiness into the design, development, use and evaluation of AI systems, while Stanford HAI’s 2025 AI Index warns that AI-related incidents are rising sharply even as standardised responsible AI evaluations remain rare among major industrial model developers. In other words, the industry is producing more governance language than operational proof, and the gap is starting to show. [1] [2] [3]

Paper governance versus execution governance

A review can tell an organisation whether a governance model is coherent, policy-aligned and internally consistent. It can show that roles are defined, controls are mapped, escalation paths exist on paper and decision rights have been allocated. That work is necessary, but it is not the same as proving that the model will survive contact with live conditions. NIST’s Playbook makes the point indirectly but clearly: it is neither a checklist nor a fixed set of steps, because real-world AI risk management depends on context, actors and system behaviour over time. The companion GenAI Profile extends that logic to generative systems, stressing that risk management has to cover design, development, use and evaluation across the lifecycle. That is why governance for higher-consequence AI cannot stop at design review. It has to ask what still holds when the system is working with stale or partial data, when dependencies drift, when a human approver is unavailable, when authority signals conflict, when the model output is technically valid but operationally mistimed, or when an upstream control passed hours earlier no longer reflects the state of the world at execution. [2] [5]

Why pressure testing has become a board-level issue

The commercial context has changed. McKinsey’s 2025 global survey found that 88% of respondents said their organisations were regularly using AI in at least one business function, but only about one-third had begun scaling their AI programmes at enterprise level. The same survey found that 23% were already scaling an agentic AI system somewhere in the enterprise and a further 39% were experimenting with agents, while 51% of respondents at organisations using AI said they had seen at least one negative consequence from AI use. High performers were markedly more likely to have defined processes for deciding how and when model outputs require human validation. Deloitte, meanwhile, forecast that 25% of enterprises already using generative AI would deploy AI agents in 2025, rising to 50% by 2027. The pattern is hard to miss: autonomy is advancing faster than most operating models are maturing. For boards and investors, that means deployment risk now sits less in whether an organisation has an AI policy and more in whether its controls still work at the moment an AI system is allowed to act. [4] [7]

Pressure testing is about broken assumptions, not theatrical catastrophe

There is sometimes a tendency to frame pressure testing as dramatic red-teaming for frontier labs alone. That is too narrow. In practice, the most useful pressure tests are often sober, architectural and uncomfortable rather than cinematic. They examine what happens when the assumptions that made a governance model look clean in review no longer hold. Can authority still be established at execution? Can the commit path still be controlled after a delay? Are pre-action checks revalidated at the point of action, or merely inherited from earlier workflow stages? Is escalation genuinely reachable inside the time window in which it matters? Are there safe defaults if a dependency fails, a queue backs up or a connected system returns contradictory state information? These are not edge cases in any serious operating environment. They are routine conditions. The UK AI Security Institute’s 2025 Frontier AI Trends Report makes this practical concern harder to dismiss. It reports that model safeguards are improving, but also says vulnerabilities were found in every system tested. It further notes that success rates on its self-replication evaluations rose from 5% to 60% between 2023 and 2025, and that AI agents are increasingly being entrusted with high-stakes activities such as asset transfers. The lesson is not panic. It is that technical capability growth and partial safeguard improvement do not remove the need to test whether governance survives under pressure. [6]

What this changes for customers, boards, partners and operators

For customers and end users, the value of review plus pressure testing is straightforward: it increases the odds that controls affecting them are real at the point of execution rather than merely promised in documentation. That matters when an AI system influences access, pricing, claims, triage, eligibility or any other decision where silent failure can be costly and difficult to reverse. For boards and investors, pressure testing is a way to surface structural fragility before it becomes a scale problem, a valuation problem or a public trust problem. McKinsey’s survey data make that point more concrete, because the market is already deploying AI widely while many organisations remain in pilot mode and a majority have experienced at least one negative consequence. For partners and ecosystem participants, the benefit is clarity across system boundaries: who owns which control, who can interrupt a bad action, who must be notified, and what happens when one party’s system state invalidates another party’s assumption. OECD’s incident-monitoring work is especially relevant here because it is built around understanding harms across contexts, actors and sectors rather than treating each failure as a local anomaly. For internal teams, pressure testing replaces wishful thinking with evidence. It shows where escalation will fail in practice, where authority is too diffuse, where dependencies are too brittle and where autonomy must be restricted. [4] [11] [12]

The regulatory signal is moving from policy artefacts to operational evidence

The legal and policy landscape is also moving in the same direction, even if not always in those exact words. The European Commission describes the EU AI Act as the first comprehensive legal framework on AI worldwide and frames it around a risk-based approach, with prohibited practices already applicable from 2 February 2025, obligations for general-purpose AI models applicable from 2 August 2025 and the main rules becoming fully applicable from 2 August 2026, with some high-risk product rules extending to 2027. More importantly for this discussion, the Act does not stop at abstract principles. The Commission and the AI Act Service Desk make clear that high-risk AI obligations include risk management, logging, data governance and human oversight, and that human oversight must be effective while the system is actually in use. Deployers are expected to assign natural persons with the competence, training, authority and support to supervise these systems in context. That is a notable shift. It means governance is being judged less by whether an organisation can describe control and more by whether it can exercise control in operation. [8] [9]

Incident reporting makes the execution point impossible to ignore

The same trend appears in incident management. In September 2025, the European Commission published draft guidance and a reporting template for serious AI incidents, explaining that under the EU AI Act providers of high-risk AI systems will be required to report serious incidents to national authorities from August 2026 and that the obligation is designed to detect risks early, ensure accountability, enable quick action and build public trust. OECD has taken a parallel route. Its AI Incidents and Hazards Monitor exists to document incidents and hazards from public sources so policymakers and practitioners can see risk patterns over time, while its 2025 common reporting framework offers 29 criteria to help stakeholders understand incidents across contexts, identify high-risk systems, assess emerging risks and evaluate impacts on people and the planet. The OECD also launched the Hiroshima AI Process reporting framework in February 2025 as the first global framework for companies to provide comparable information on their AI risk-management actions, including risk assessment, incident reporting and information-sharing mechanisms. Taken together, these initiatives reinforce a simple point: once a system is in the world, governance is judged by traceability, responsiveness and corrective capacity, not by the elegance of the original policy deck. [10] [11] [12] [13]

The argument against overdoing it, and why that argument partly succeeds

There is a serious counterargument here, and it should not be brushed aside. Pressure testing can become bureaucratic theatre. It can be expensive, slow and badly targeted. Teams can end up simulating implausible disasters while missing mundane control failures. Leaders can also mistake a successful pressure test for proof of safety, when in fact it is only evidence that a system survived a particular set of scenarios. NIST’s own language is a useful warning: the Playbook is not a checklist, and its suggestions are voluntary and context-dependent. The EU AI Act is risk-based for a reason. Not every AI system warrants the same testing burden, and low-risk internal copilots should not automatically inherit the governance overhead of systems that can release funds, alter entitlements, route care, settle claims or trigger operational change. So the case for pressure testing is not a case for maximalism. It is a case for proportionality. Review should remain the baseline. Pressure testing should intensify as consequence, autonomy, irreversibility and cross-system dependency increase. That mutual view, rather than an absolutist one, is the defensible position. [2] [8]

Why higher-consequence systems are different

Still, once the consequences become difficult to unwind, the tolerance for paper governance should fall sharply. In a higher-consequence system, the critical question is not whether an approval existed upstream but whether the system was still governable at the moment of action. If a model releases money, grants access, blocks a service, changes a medical or insurance workflow, or triggers a consequential operational step, the window for correction may already have closed by the time audit and review catch up. At that stage, post hoc analysis is useful for accountability and learning, but it is no substitute for ex ante control. This is precisely why the UK government’s 2025 response on the cyber security of AI emphasised that security must be built in across the AI lifecycle, identified clear and specific risks to AI models and systems throughout that lifecycle, and positioned its Code of Practice as the basis for a global standard on baseline security requirements. Pressure testing sits naturally within that secure-by-design logic. It is how an organisation verifies that governance claims are enforceable where they matter most: at the execution edge, under time pressure, with imperfect information and partial failure already in play. [14]

My view

My view is that model review and pressure testing should be treated as complementary disciplines, not competing ones. Review answers whether a governance model is coherent, complete and aligned to policy, law and internal risk appetite. Pressure testing answers whether that model still holds when timing slips, data quality decays, responsibilities fragment, dependencies drift and humans do not respond on cue. The first protects against conceptual weakness. The second protects against operational illusion. Organisations that stop at review will often believe they have a governable AI system when they really have a plausible governance narrative. Organisations that do both are more likely to discover where autonomy should be constrained, where controls must move closer to execution, where human oversight is performative rather than effective and where the right decision is not to proceed at all. That is not anti-innovation. It is what responsible scaling looks like when AI systems are beginning to act with greater reach, speed and consequence. [4] [6] [9]

Summary

AI governance is maturing beyond the document phase. Frameworks from NIST, monitoring initiatives from OECD, technical evaluations from the UK AI Security Institute, enterprise evidence from McKinsey and Deloitte, and the execution-focused obligations emerging under the EU AI Act all point in the same direction. Review remains essential, but it is not enough for systems whose outputs can create real-world consequences. Pressure testing is the discipline that reveals whether governance survives broken assumptions, operational latency, brittle dependencies and human bottlenecks. The practical question for any serious organisation is no longer whether its AI governance model looks controlled on paper. It is whether that control still exists at the exact moment the system is allowed to act. [1] [12] [8]

References

[1] AI Risk Management Framework | NIST — https://www.nist.gov/itl/ai-risk-management-framework

[2] NIST AI RMF Playbook — https://airc.nist.gov/airmf-resources/playbook/

[3] The 2025 AI Index Report | Stanford HAI — https://hai.stanford.edu/ai-index/2025-ai-index-report

[4] The State of AI: Global Survey 2025 | McKinsey — https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

[5] Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile | NIST — https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence

[6] Frontier AI Trends Report by The AI Security Institute — https://www.aisi.gov.uk/frontier-ai-trends-report

[7] Deloitte Global’s 2025 Predictions Report: Generative AI: Paving the Way for a transformative future in Technology, Media, and Telecommunications — https://www.deloitte.com/global/en/about/press-room/deloitte-globals-2025-predictions-report.html

[8] AI Act | Shaping Europe’s digital future — https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai

[9] Frequently Asked Questions | AI Act Service Desk — https://ai-act-service-desk.ec.europa.eu/en/faq

[10] AI Act: Commission issues draft guidance and reporting template on serious AI incidents, and seeks stakeholders’ feedback — https://digital-strategy.ec.europa.eu/en/consultations/ai-act-commission-issues-draft-guidance-and-reporting-template-serious-ai-incidents-and-seeks

[11] OECD AI Incidents Monitor, an evidence base for trustworthy AI - OECD.AI | OECD.AI — https://oecd.ai/en/incidents

[12] Towards a common reporting framework for AI incidents | OECD — https://www.oecd.org/en/publications/towards-a-common-reporting-framework-for-ai-incidents_f326d4ac-en.html

[13] OECD launches global framework to monitor application of G7 Hiroshima AI Code of Conduct — https://www.oecd.org/en/about/news/press-releases/2025/02/oecd-launches-global-framework-to-monitor-application-of-g7-hiroshima-ai-code-of-conduct.html

[14] Government response on AI cyber security | GOV.UK — https://www.gov.uk/government/calls-for-evidence/cyber-security-of-ai-a-call-for-views/outcome/government-response-on-the-cyber-security-of-ai