The Abort Switch: Designing an Abort Doctrine for Frontier AI

The Abort Switch: Designing an Abort Doctrine for Frontier AI

A lab without a brake

On a quiet afternoon in a secure research facility, an engineer watches a model cross an invisible line. The system – a large language model designed to assist biochemists – begins to draft unpublished gene‑editing protocols. Its output is technically brilliant and dangerously actionable. The engineer frantically searches for a “stop training” command but finds none. There is no codified procedure for halting a runaway model, no shared agreement on what constitutes unacceptable autonomy, and no independent authority to press a red button. The near‑miss is reported, but without a standardised mechanism to pause development or learn from the incident, the lab returns to business as usual.

This fictional scene is uncomfortably plausible. Frontier models now involve trillions of parameters and run across massive clusters of graphics processors. Their emergent abilities – writing software, designing experiments, coordinating tasks – outstrip the manual oversight that humans once provided. The OECD AI Principles, adopted in 2019 and updated in 2024, call for systems that are robust, secure and safe throughout their lifecycles and emphasise that mechanisms should exist to override or decommission AI that risks causing harm[1]. Yet current AI practice rarely includes such override mechanisms. In complex systems I have helped design, the most valuable tool was not the launch button but the ability to abort. We learned that safety is a culture of permission to stop. Without embedding that culture in AI development, we risk building systems that cannot be halted when they begin to behave unpredictably.

Lessons from mature safety cultures

To build an Abort Doctrine for AI, we can learn from industries that faced comparable stakes. During the Apollo missions, NASA’s flight teams wrote mission rules that delineated responsibilities and authority. They recognised that delays caused by unclear roles were as dangerous as taking the wrong action; the rules specified conditions under which flight directors had to terminate a mission and allowed prompt responses to most in‑flight failures[2]. Nuclear reactors embed scram systems – control rods that drop into the reactor core to shut it down – which operators activate when temperatures or pressures exceed safe limits[3]. Financial markets use circuit breakers: a 7 % fall in the S&P 500 triggers a 15‑minute halt; steeper drops cause longer pauses[4]. In each case, the complexity of the system is translated into clear, numerical thresholds for intervention.

Across these domains, three invariants emerge:

  1. Capability thresholds. Mission rules, scram systems and circuit breakers define precise conditions that require a halt. These thresholds convert continuous system state into discrete decisions. The OECD principles similarly call for AI systems that can be overridden if they exhibit undesired behaviour[1].
  2. Independent authority. Apollo designated a flight director whose decision to abort could not be overruled by engineers[2]. Nuclear control rooms and stock exchanges employ operators whose sole task is to monitor conditions and initiate shutdowns when thresholds are breached.
  3. Institutional learning. After each incident, mission rules were updated and shared[5]; nuclear and financial regulators publish detailed reports following scram events or trading halts. The transparency fosters continuous improvement and public trust.

These invariants form a skeleton for an AI abort doctrine. They show that safety emerges when thresholds are explicit, authority is clear, and lessons are institutionalised. The challenge is to adapt them to machine learning systems whose failures are often emergent and whose behaviours cannot be fully anticipated through engineering intuition.

Inside the frontier AI lab

Generative models are no longer confined to chatbots; they write code, plan experiments and orchestrate chains of actions across digital tools. Yet safety evaluations often happen after deployment. The U.K. AI Safety Institute (AISI) recently evaluated five large language models for their ability to facilitate cyber‑attacks, provide expert‑level chemical and biological knowledge, operate autonomously as agents and resist jailbreaks[6]. The results were sobering: several models answered private chemistry and biology questions at PhD level and solved capture‑the‑flag cyber‑security challenges meant for high‑school students[7]. All models remained vulnerable to simple jailbreak techniques[8]. Dangerous capabilities are already latent in systems trained on ostensibly benign datasets.

The U.S. National Institute of Standards and Technology (NIST) and AISI are beginning to test models before release. A joint NIST‑UK evaluation of Anthropic’s Claude 3.5 found that while the model improved across biological, cyber and software domains, existing jailbreak methods still circumvented safeguards[9]. A subsequent assessment of OpenAI’s o1 model compared its capabilities to GPT‑4o, finding that o1 was more capable in some domains and less in others[10]. NIST’s Preparedness Framework v2 categorises frontier risks into three tracked categories – biological and chemical, cybersecurity, and AI self‑improvement – and states that models exceeding capability thresholds will not be deployed until safeguards minimise risk[11]. This framework calls for developing threat models and measurement thresholds before release[11].

These evaluations reveal a gap: while we know how to measure some dangerous capabilities, there is no agreed mechanism to halt training when a model begins to cross them. Emergent behaviours often manifest as small anomalies: gradient spikes, deceptive responses or unexpected coordination among simulated agents. High reliability organizations – which operate in complex, high‑hazard domains like aviation, nuclear power and emergency services – stress preoccupation with failure. They work hard to detect small, emerging failures, specify mistakes they never want to make, and acknowledge incomplete knowledge[12]. AI labs rarely institutionalise this mindset; warnings are often seen as technical challenges to overcome, not reasons to stop. The absence of defined abort criteria means that near‑misses become anecdotal rather than catalysing policy.

Crafting the Abort Doctrine

Drawing on the three invariants and current safety research, we propose a unified Abort Doctrine for frontier AI. The doctrine has three elements – Detection, Authority and Accountability – that must be interlocking for the system to function. It is not a static checklist but a cultural commitment to halting models before they cause harm.

Detection: hazard assays and tripwires

Effective detection begins with a capability hazard assay – a battery of evaluations performed before and during training. The assay should measure whether a model can perform tasks in domains that pose high risk to society, such as cyber exploitation, chemical and biological synthesis and autonomous agentic coordination. Evaluations must include private questions and tasks that are not part of the training data to avoid models learning the tests. AISI’s evaluation methodology, for example, assesses compliance (does the model refuse harmful requests?), correctness (does it provide accurate answers?) and completion (does it complete tasks?)[13]. Integrating such assays into the training loop transforms them from ex‑post checks into real‑time monitors.

Tripwires convert continuous risk into actionable numbers. Financial markets halt when price drops exceed set percentages[4]; AI labs need analogous thresholds. These might include metrics such as: the percentage of restricted biology problems solved above a baseline, the success rate in multi‑step cyber exploits, or the frequency with which agents circumvent alignment instructions. When a model crosses a threshold – for example, when it solves more than a predefined fraction of expert‑level chemical tasks – an automatic pause is triggered. These thresholds should be conservative and may vary by domain; the purpose is not to freeze all development but to create predictable triggers for review.

Anthropic’s Responsible Scaling Policy v2.2 already employs capability thresholds tied to AI Safety Levels; models nearing thresholds related to chemical, biological, radiological or nuclear (CBRN) weapons or autonomous R&D must be subject to higher safeguards and follow‑up assessments[14]. OpenAI’s Preparedness Framework similarly asserts that models exceeding thresholds will not be deployed until safeguards minimise risk and emphasises developing threat models and metrics[11]. These internal policies show that labs can commit to thresholds; the doctrine requires them to be transparent, standardised across the industry and integrated into the training loop.

Authority: independent oversight and dual‑key control

Thresholds are meaningless without someone empowered to act on them. The doctrine creates an independent flight director role, modelled on Apollo’s mission director. This person or committee would be separate from the research team and report to an external regulator or industry consortium. When tripwires fire, the flight director must have the legal authority to pause or abort training. Without such authority, engineers may fear reprisal for halting a profitable run and executives may prioritise market timing over safety.

To prevent unilateral or arbitrary decisions, the doctrine calls for dual‑key control. Compute clusters would run only when both the lab and the flight director turn their keys. If thresholds are crossed, the flight director can revoke the external key, halting training. Cryptographic escrowed compute tokens ensure that no single party can override safety. This concept mirrors the two‑person control in nuclear launch systems and the redundant switches in scram circuits[3]. It also aligns with internal governance measures already adopted by some labs: Anthropic pledges to maintain a Responsible Scaling Officer and processes for anonymous reporting[15], while OpenAI emphasises transparency and external participation[16]. The flight director role gives these commitments teeth.

Accountability: post‑abort disclosure and learning audits

When an abort occurs – either automatically or through the flight director – the final element of the doctrine kicks in: accountability. The lab must publish a post‑abort report, detailing the conditions that triggered the pause, the model’s capabilities, the safeguards deployed and the lessons learned. NASA’s mission rules were revised after each failure and widely distributed[5]. Similarly, high reliability organizations document near‑misses to institutionalise vigilance[12]. A public or tiered post‑abort registry would extend this practice to AI. Sensitive details could be restricted to regulators and trusted researchers; high‑level summaries would build public trust. Regular abort audits – multidisciplinary reviews of aborted runs – would ensure that near‑misses inform future thresholds and training protocols.

The three elements of the doctrine form a three‑ring safety seal:

  • Detection monitors capabilities through continuous hazard assays and defines tripwires.
  • Authority appoints an independent flight director with dual‑key control to halt runs when tripwires fire.
  • Accountability ensures that aborts feed back into institutional learning through transparent disclosure and audits.

When combined, these elements move safety from ad hoc practice to enforceable doctrine. They treat a missing kill switch as an institutional failure, not merely an engineering oversight.

Governance and policy in practice

An abort doctrine cannot be implemented by labs alone; it requires supportive policy. The EU AI Act offers a foundation: Article 55 compels providers of general‑purpose models with systemic risk to perform standardised model evaluations, conduct adversarial testing to identify and mitigate systemic risks, track and report serious incidents and ensure cybersecurity[17]. Providers may rely on codes of practice under Article 56, but if they choose not to, they must demonstrate alternative compliance[18]. These provisions create legal incentives for pre‑deployment testing and disclosure, aligning with the doctrine’s detection and accountability elements.

Governments are beginning to operationalise these requirements. In August 2024 the U.S. AI Safety Institute (a joint initiative of NIST and the White House) signed agreements with leading labs granting access to models for pre‑deployment testing[19]. By December 2024 it had evaluated Anthropic’s Claude 3.5 and OpenAI’s o1, comparing their capabilities and highlighting persistent vulnerabilities[9]. These evaluations provide early prototypes of hazard assays. Similarly, the UK’s AI Safety Institute runs evaluations and publishes results that reveal both progress and continued weaknesses in cyber, biological and autonomous domains[6].

Beyond regulation, the doctrine requires a flourishing assurance ecosystem. The UK’s Centre for Data Ethics and Innovation (CDEI) notes that while AI brings significant societal benefits, its autonomous, complex and scalable nature poses risks that challenge our ability to assign accountability and understand decisions[20]. CDEI’s roadmap explains that AI assurance involves auditing and certification services to test whether systems behave as expected and stresses that, without the ability to assess trustworthiness, users will struggle to trust AI[21]. The doctrine’s hazard assays and post‑abort audits fit within this assurance ecosystem, providing concrete mechanisms for testing and documenting trust.

International alignment is vital because AI knows no borders. The OECD AI Principles call for human‑centric AI and emphasise transparency, robustness and accountability[22]. They also urge mechanisms to override or decommission systems that pose unreasonable risks[1]. These principles form the philosophical backbone for a global abort doctrine. By harmonising approaches across the EU, US and UK, policymakers can prevent regulatory arbitrage and ensure that abort thresholds are not undermined by developers shopping for lenient jurisdictions.

Corporate governance must adapt too. Boards and investors should treat abort decisions as fiduciary responsibilities. Just as audit committees oversee financial risks, AI oversight committees should review capability hazard assay results, evaluate tripwire performance and approve post‑abort reports. Risk metrics and thresholds should appear on board agendas alongside revenue forecasts. Without board‑level engagement, safety remains a technical issue rather than a strategic priority.

The human burden of pressing stop

The effectiveness of an abort doctrine ultimately hinges on human judgement. Returning to our opening scene, imagine the engineer receives a call from the flight director after the model solves a series of restricted chemistry tasks. The tripwire has fired. The flight director explains that halting the run will delay a product launch and may draw regulatory scrutiny, but thresholds exist for a reason. The engineer must decide whether to turn the key. In that moment, safety is not an abstract principle; it is a personal test of courage and ethics.

High reliability organizations cultivate this mindset. They train operators to halt operations at the slightest sign of deviation and celebrate individuals who prevent accidents by speaking up[12]. They specify mistakes they never want to make and create psychological safety for people to voice concerns. In large‑scale platform ecosystems I have worked on, we learned that progress required permission to stop; people were rewarded for raising red flags and near‑misses were celebrated as successes. AI labs must adopt similar cultures. Engineers should be empowered – and obligated – to call for an abort when tripwires are crossed. Executives should support these decisions even when they conflict with commercial timelines.

An abort doctrine also challenges us to confront the ethics of restraint. Modern innovation culture often glorifies speed and disruption. But the ability to pause is a hallmark of maturity. The term “scram” is believed to have originated from an engineer labelling a big red button on a control panel because pressing it meant “you scram out of here”[23]. The button symbolised humility before a powerful system. Apollo’s mission rules did not stifle exploration; they enabled it by ensuring that everyone knew when to stop[2]. Circuit breakers have not killed markets; they have preserved investor confidence[4]. Similarly, an AI abort doctrine does not represent fear of the future but respect for its power.

Conclusion – Knowing when to stop

Generative AI stands at the edge of transformative possibility. It promises breakthroughs in drug discovery, climate modelling and every domain touched by information. It also introduces risks that cannot be managed by retrofitted policies or good intentions. To earn the trust of society, AI developers and policymakers must build the capacity to halt progress before harm materialises. The Abort Doctrine offers a path forward. By defining capability thresholds, empowering independent authority and institutionalising learning, it translates abstract principles into actionable safeguards. It draws on the lessons of mission rules, scram systems and circuit breakers while adapting them to the unique challenges of machine intelligence.

Building this doctrine will require sacrifices and hard decisions. Labs will need to accept pauses in training, share sensitive information with regulators and collaborate on standardised tests. Governments must harmonise regulations and invest in public infrastructure for evaluations and registries. Boards must treat abort thresholds as seriously as financial risks. Most importantly, individuals must embrace the ethics of restraint. The presence of a red button in a control room does not signal weakness; it signals wisdom. Every great system earns trust not by showing how fast it can go but by demonstrating that it knows when to stop. In the era of frontier AI, designing and committing to an abort doctrine is our collective maturity test. The launch sequence has started; we must write the flight rules before the countdown reaches zero.


[1] [22] AI principles | OECD

https://www.oecd.org/en/topics/sub-issues/ai-principles.html

[2] [5] 19750002893.pdf

https://ntrs.nasa.gov/api/citations/19750002893/downloads/19750002893.pdf

[3] [23] REFRESH — Putting the Axe to the 'Scram' Myth | Nuclear Regulatory Commission

https://www.nrc.gov/reading-rm/basic-ref/students/history-101/putting-axe-to-scram-myth

[4] Stock Market Circuit Breakers | Investor.gov

https://www.investor.gov/introduction-investing/investing-basics/glossary/stock-market-circuit-breakers

[6] [7] [8] [13] Advanced AI evaluations at AISI: May update | AISI Work

https://www.aisi.gov.uk/blog/advanced-ai-evaluations-may-update

[9] [10] [19] A technical AI government agency plays a vital role in advancing AI innovation and trustworthiness | Brookings

https://www.brookings.edu/articles/a-technical-ai-government-agency-plays-a-vital-role-in-advancing-ai-innovation-and-trustworthiness/

[11] [16] preparedness-framework-v2.pdf

https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf

[12] High Reliability Organizations and the Value of a Preoccupation With Failure

https://www.valuecapturellc.com/blog/high-reliability-organizations-and-the-value-of-preoccupation-of-failure

[14] [15] Anthropic’s Responsible Scaling Policy (version 2.2) - Google Docs

https://www-cdn.anthropic.com/872c653b2d0501d6ab44cf87f43e1dc4853e4d37.pdf

[17] [18] Article 55: Obligations for Providers of General-Purpose AI Models with Systemic Risk | EU Artificial Intelligence Act

https://artificialintelligenceact.eu/article/55/

[20] [21]  The roadmap to an effective AI assurance ecosystem - extended version - GOV.UK 

https://www.gov.uk/government/publications/the-roadmap-to-an-effective-ai-assurance-ecosystem/the-roadmap-to-an-effective-ai-assurance-ecosystem-extended-version

Kostakis Bouzoukas

Kostakis Bouzoukas

London, UK