Outcome‑Based AI Contracts: Buying Results, Not Models

Executive Summary
Public procurement is one of the biggest levers governments and corporations have for shaping digital transformation. In the United Kingdom alone, procurement spending exceeds £380 billion annually — roughly one in every three public pounds — yet until recently there was little visibility over where this money went[1]. Boards and finance committees have spent lavishly on artificial‑intelligence solutions, but most contracts tie payment to inputs (licences, compute, number of API calls) rather than to tangible outcomes. In local government, AI adoption is immature; officials face fiscal austerity while being told to procure technologies they barely understand[2]. Without clear guidance, procurers must navigate a patchwork of policies and definitions[3], and high‑profile failures like the Post Office’s Horizon IT system have eroded trust in AI spending[4].
This whitepaper argues that the most important innovation in AI is contractual: moving from buying models to buying outcomes. Drawing on global frameworks (OECD AI Principles, ISO/IEC 42001, NIST AI Risk Management Framework), UK‑specific guidance (Crown Commercial Service, ATRS, AISI), academic and civil‑society tools (model cards, Data Ethics Canvas), and fresh market data, we outline the anatomy of outcome‑based AI contracts. We show how to tie payments to performance, require documentation and transparency artefacts, and embed evaluation and rollback mechanisms. We also sketch a methodology for boards to measure adoption of these clauses and the realised return on investment (ROI). The result is a blueprint for accountable AI procurement that aligns financial incentives with public value.
The Problem with Buying Models
A procurement fallacy
AI promises to improve public services and reduce costs, yet procurement practices have not kept pace. Local authorities are under unprecedented financial pressure[2]. There is optimism that AI can address the cost‑of‑living crisis or improve efficiency, but adoption is immature and the regulatory landscape is patchy. Ada Lovelace Institute’s review of 16 policy documents found no single, comprehensive source of guidance for local government on procuring AI[3]. Instead, procurers must interpret broad principles, leading to inconsistent practices.
Traditional contracts make payment contingent on delivering a software licence or access to an API. For adaptive AI systems whose behaviour changes with data and context, this input‑based model creates three failures:
- Unclear objectives and ROI – Contracts emphasise the acquisition of a model rather than the outcomes the model should deliver. As a result, suppliers can claim success even if the system fails to achieve policy goals. Boards lack metrics to track whether promised benefits (cost savings, accuracy improvements, reduced bias) have been realised.
- Opaque performance – AI systems are often “black boxes”, making it difficult to understand decision logic. Without contractual requirements for documentation or evaluation, procurers cannot audit system behaviour or explain decisions to those affected. High‑profile failures in welfare algorithms and visa applications have shown the harms of unchecked opacity[5]. The Algorithmic Transparency Recording Standard (ATRS) was created precisely because public bodies lacked a way to publish information about algorithmic tools and why they use them[6].
- Weak governance and rollback – Input‑based contracts rarely specify how to monitor AI performance over time or what to do when systems misbehave. Without logs, evaluation schedules or rollback clauses, public bodies become locked into risky systems that cannot be safely paused or decommissioned.
Evidence of the gap
Data from the UK’s new procurement regime underscores the problem. Despite spending over £380 billion a year, pre‑2025 procurement generated little data on outcomes[7]. Under the Procurement Act 2023, the UK introduced a central digital platform and open data standards (OCDS) to collect richer information[8]. Early results show progress: almost 600 buyers published 1,691 pre‑market engagement notices between February and May 2025; the number of competitive flexible procedures increased from 67 in March to 315 in May[9]; and the proportion of lots tagged suitable for SMEs rose from 15.4 per cent to 32 per cent[10]. Yet these metrics relate to process, not outcomes. The data still lack information on whether AI procurements deliver their intended benefits, and few contracts include clauses mandating transparency artefacts or independent evaluations.
In local government, the Ada Lovelace Institute reports that procurement teams lack capacity and resources, and that existing guidance does not show how to ensure AI delivers societal benefit[3]. The “Spending Wisely” report calls for a national taskforce because there is no coherent support for procurers[11]. Without clear standards, local authorities risk repeating failures like the Post Office Horizon scandal and visa algorithm controversies[12]. Public trust hinges on preventing such missteps.
The Shift to Outcome‑Based AI Governance
Global frameworks converge on outcomes
Multiple international frameworks have emerged to guide responsible AI. The OECD AI Principles emphasise human‑centred values, transparency, accountability and robust security. They call for AI systems to be transparent and explainable, to incorporate safety and security mechanisms throughout their lifecycle, and to be subject to accountability frameworks. The principles underpin legislation such as the EU AI Act and influence procurement guidelines across jurisdictions.
The ISO/IEC 42001 standard, introduced in December 2023, is the first global AI management system standard. It provides a structured framework for governing AI projects and aligning them with regulatory requirements. ISO 42001 lays out key requirements: establishing an AI management system (AIMS), risk management, AI system impact assessment, lifecycle management and third‑party supplier oversight[13]. Certification under this standard helps organisations build transparent, trustworthy and ethical AI systems, meet compliance obligations (such as the EU AI Act), improve risk management and demonstrate leadership in ethical AI[14]. The standard follows a plan‑do‑check‑act approach, emphasising continuous monitoring, stakeholder engagement and integration with existing information security standards[15].
The NIST AI Risk Management Framework (AI RMF) is a voluntary, industry‑agnostic guide designed to help organisations manage AI risks throughout the AI lifecycle. It provides a structured way to identify, assess and mitigate AI risks without stifling innovation[16]. The framework organises practices under four core functions:
- Govern – define governance structures, assign roles and outline responsibilities to align AI systems with standards, regulations and organisational values[17].
- Map – identify and assess risks across the AI lifecycle, fostering proactive risk identification[17].
- Measure – quantify and assess the performance, effectiveness and risks of AI systems to ensure stability and compliance[18].
- Manage – develop strategies for mitigating risks and ensuring continuous monitoring, auditing and improvement[19].
These frameworks share common themes: they prioritise transparency, accountability, risk management and human oversight. Together they signal a shift from input‑centric procurement to outcome‑oriented governance. They also provide language that can be incorporated into contracts as acceptance criteria, key performance indicators and assurance obligations.
Regional leadership and the UK context
Europe has codified many of these principles in legislation. The EU AI Act categorises systems by risk and requires higher‑risk systems to meet stringent requirements. To support public buyers, the EU Model Contractual Clauses (MCC‑AI), updated in March 2025, provide two templates: a high‑risk version, aligned with the AI Act’s Chapter III requirements, and a lighter version for non‑high‑risk or less risky algorithmic systems[20]. The templates include clauses for documentation, evaluation, logging, corrective actions and termination rights, demonstrating how policy can be translated into contract language.
The United States has also moved toward outcome‑based contracting. White House guidance (OMB M‑24‑10 and M‑24‑18) requires agencies to identify rights‑impacting AI systems, conduct impact assessments and include contractual obligations for risk management, human oversight and evaluation. These memos mirror NIST’s approach and encourage agencies to tie procurement decisions to measurable outcomes.
In the UK, several initiatives lay the groundwork for outcome‑oriented procurement. The Guidelines for AI procurement, developed by the Office for AI and the World Economic Forum, instruct public bodies to define problems before procuring AI, avoid black‑box algorithms, engage suppliers early and develop governance and assurance plans[21]. The Algorithmic Transparency Recording Standard (ATRS) provides a template for organisations to publish information about algorithmic tools, driving public understanding and trust and enabling senior responsible owners to take accountability[6]. AI Safety Institute (AISI) evaluations supply independent assessments of model safety and reliability; a May 2024 update showed that leading models can answer hundreds of complex chemistry and biology questions but struggle with university‑level cyber challenges and remain vulnerable to simple jailbreaks[22]. These evaluations underscore the need for contract clauses that require independent testing and logs.
Anatomy of an Outcome‑Based AI Contract
Building on the above frameworks, we propose a five‑artifact stack that makes contracts enforceable and outcome‑driven. Each artifact maps to one or more standards and provides concrete evidence for monitoring, auditing and payment.
1. System and Model Cards
Originating in academic research (model cards and datasheets for datasets), system/model cards are structured documents describing an AI system’s purpose, design, datasets, performance metrics, limitations and intended context of use. Requiring suppliers to deliver and publish system and model cards aligns with ATRS obligations[6] and ISO 42001’s emphasis on documentation and stakeholder engagement[13]. Contracts should specify that cards be updated with each significant model release and include disclosure of data sources, training methods, evaluation results and known limitations.
2. Evaluation Protocols
Outcome‑based contracts must define acceptance criteria and require independent evaluations at commissioning and periodically thereafter. Evaluations should cover both capability and safety. For high‑risk systems, the EU MCC‑AI templates already require testing aligned with the AI Act[20]. In the UK, buyers can leverage AISI’s evaluation methods, such as Capture the Flag challenges for cyber resilience and “Inspect” for agentic behaviour[22]. Contracts should mandate that suppliers provide full evaluation logs, including prompts, responses, grading criteria and outcomes. Payment milestones can be tied to meeting specified thresholds (e.g., a minimum accuracy on relevant tasks and a maximum allowed rate of non‑compliance or harmful outputs).
3. Logs and Telemetry
Continuous monitoring requires robust logging. Contracts should oblige suppliers to provide logs of model inputs, outputs, and system events, with privacy protections and data minimisation as per the UK Information Commissioner’s Office (ICO) guidance on accountability and governance[23]. Logs enable auditors to detect performance drift, bias or malicious attempts to jailbreak models. They also support reproducibility and compliance with data protection impact assessments (DPIAs). The NIST AI RMF’s “Measure” function emphasises continuous assessment[24], and ISO 42001 requires ongoing performance evaluation[15]. Contracts should specify retention periods, formats (e.g., JSON, CSV), and rights of access for auditors.
4. Rollback and Safeguards
Outcome‑based AI contracts must include clauses that allow public bodies to pause, roll back or terminate systems when safety issues arise. Suppliers should provide rollback plans, including version control, feature flags and shadow modes that can run in parallel with existing systems. ISO 42001’s continuous improvement and risk mitigation requirements[15], along with the EU MCC‑AI’s corrective action clauses, support this. Contracts should also require safe‑shutdown procedures, clear escalation paths and remediation timelines.
5. Outcome KPIs and Payment Tied to Results
Finally, the contract should define outcome‑based key performance indicators (KPIs) that reflect the policy objectives. For example, in a customer‑service chatbot procurement, KPIs might include: reduction in average call‑handling time, resolution accuracy, customer satisfaction, and reduction in bias against certain demographic groups. For a fraud‑detection system, KPIs might track true positives, false positives, fairness metrics and cost savings. Payment schedules should be tied to achieving these KPIs, with bonuses for exceeding targets and penalties or withholding for underperformance. Boards can adapt the OECD and ISO principles to define such metrics: fairness and accuracy for human‑centred values, robustness and security for reliability, and explainability for accountability.
Bringing it together
These artefacts together make the contract auditable. System cards provide transparency; evaluation protocols and logs create a paper trail; rollback plans ensure safety; and outcome KPIs align incentives. The result is an enforceable contract that treats AI procurement like any other high‑risk activity: with clear deliverables, evidence of performance and rights to rectify or exit if things go wrong. Governance is no longer abstract — it can be invoiced.
The Proof: Measuring Clause Adoption and ROI
Outcome‑based contracting is more than a theory — it must be evidenced. We propose a two‑part methodology for boards and oversight bodies.
Sampling and clause detection
The UK’s central digital platform now publishes procurement notices in the Open Contracting Data Standard (OCDS). Users can access data via the Find a Tender Service API[8]. To estimate the adoption of AI‑specific clauses:
- Define the sample – Query notices from 2024–25 containing keywords such as “artificial intelligence”, “algorithmic”, “machine learning”, “model card”, “evaluation”, “logging” and “ISO 42001”. Restrict to notices with a procurement category relevant to technology or professional services.
- Identify AI clauses – Apply natural‑language processing to procurement documents (e.g., contract specifications and terms) to detect references to documentation (system/model cards, datasheets), evaluation requirements, logging obligations, rollback or corrective action clauses, and outcome‑based payment terms.
- Compute adoption metrics – For each contract, flag whether at least one outcome‑based clause appears. Aggregate results by authority type (central government, local government, NHS, education) and by contract value. For example, early data might show that only a small share of AI tenders include any transparency artefact; such a baseline helps track progress over time.
Linking outcomes to ROI
To assess realised ROI, boards need to track outcomes after contract award. The new Procurement Act encourages the use of quality criteria and social‑value metrics; the proportion of lots including quality criteria increased from 48.3 per cent in February 2025 to 72 per cent in May 2025[25]. However, only 9.5 per cent of these lots specify weights for quality criteria[26], making it difficult to interpret scores.
Boards can adapt the following dashboard:
- Outcome attainment – For each KPI defined in the contract (e.g., accuracy, bias reduction, cost per transaction), report baseline values, target thresholds and actual performance. Highlight whether the supplier met or exceeded the KPI and link this to payment disbursement.
- Benefit realisation – Compare the promised benefits in bid documents with realised benefits at milestones (e.g., one year after deployment). For example, if a predictive maintenance system promised 20 per cent reduction in equipment downtime, measure actual downtime reduction and compute cost savings. Where targets are not met, note corrective actions taken (e.g., re‑training, parameter adjustments).
- Risk incidents and interventions – Count the number of logged incidents, policy violations or data‑protection breaches. Report how often rollback mechanisms were activated and whether AISI evaluations or independent audits prompted changes.
- Stakeholder feedback – Collect user and citizen feedback to assess perceived fairness, explainability and trust. For example, a local authority might measure citizen satisfaction with an AI‑driven welfare eligibility tool.
By systematically applying these metrics, boards can move beyond anecdotal evidence and gain a quantitative view of AI contract performance. This, in turn, supports evidence‑based policy and budgeting decisions.
Global Convergence and Leadership Lessons
Why the UK can lead
The UK is uniquely positioned to pioneer outcome‑based AI procurement. The Procurement Act 2023 centralises and opens procurement data[27], and early results show increased use of competitive flexible procedures and pre‑market engagement[9]. The Crown Commercial Service’s AI agreements and Dynamic Purchasing System (DPS) provide modular contracts that can embed outcome clauses. The ATRS mandates transparency and accountability[6], and the AI Safety Institute produces independent evaluations[22]. Together, these initiatives give the UK a “procurement trilogy” — data, transparency and evaluation — that few countries possess.
Lessons from the EU and US
The EU demonstrates how regulatory clarity can drive contractual innovation. The MCC‑AI templates show that detailed clauses for high‑risk AI are feasible[20]. They include requirements for risk assessment, human oversight, testing and corrective actions, and can be adapted by private companies and local governments alike. They also illustrate how to differentiate obligations based on risk level, reserving heavier governance for systems that directly impact rights.
The United States provides a comparator in its emphasis on risk management and human oversight. NIST’s AI RMF and OMB guidance align procurement with risk tiers and call for continuous monitoring, assessment and reporting. The voluntary nature of the framework makes it adaptable but relies on agency initiative, illustrating the trade‑off between flexibility and enforceability.
Procurement as a lever for change
The AI Now Institute argues that public procurement is a powerful lever for reshaping digital markets and breaking dependence on a handful of tech giants. It notes that with political commitment and investments through public procurement, governments can build alternative, ethical digital infrastructure rather than reinforcing existing power structures[28]. Procurement should incorporate strict criteria for sustainability, social standards and privacy, enabling small and medium‑sized enterprises (SMEs) and nonprofit initiatives to compete[29]. Taxpayer money should flow to options that are best for society, not just the cheapest[30]. These insights underscore why outcome‑based AI contracts matter: they shift power away from suppliers and towards public value.
Implications for Boards and Policymakers
- Ask the right questions – Before approving AI spend, boards should ask: What problem are we solving? How will success be measured? Does the contract require the supplier to provide system and model cards? Are independent evaluations scheduled? What are the rollback provisions?
- Embed documentation and transparency – Make system/model cards and ATRS records mandatory for every AI deployment. Require suppliers to fill out the Data Ethics Canvas or similar frameworks to document ethical considerations[31].
- Tie payments to outcomes – Link a significant portion of contract value to achieving defined KPIs. Use the metrics from the board dashboard to approve or withhold payments. Encourage performance bonuses for exceeding targets.
- Invest in evaluation capacity – Establish an independent evaluation unit or partner with bodies like the AI Safety Institute. Contract clauses should allow procurers to share models with evaluators and to require remedial actions based on findings.
- Promote SME participation – Use the Procurement Act’s fields for SME suitability and social value[10]. Set quality criteria weights and ensure they are transparent[26]. Encourage flexible procedures that allow negotiation and demonstration[32].
- Coordinate across jurisdictions – Align UK contracts with EU MCC‑AI and US OMB guidance to reduce fragmentation for suppliers operating internationally. Incorporate ISO 42001 and NIST AI RMF language to ensure global compliance and to facilitate certification.
Conclusion: Contracts as Code for Accountability
AI has the potential to transform public services, but without accountable procurement it can just as easily entrench inefficiency, inequality and monopoly. The evidence is clear: spending on AI is rising, but adoption remains immature and guidance fragmented[3]. The UK’s Procurement Act and open‑data reforms provide an unprecedented opportunity to redefine value in public-sector AI.
Outcome‑based AI contracts make governance tangible. By requiring system and model cards, independent evaluations, logging, rollback plans and outcome KPIs, boards can tie payments to performance and ensure transparency. International frameworks — OECD principles, ISO 42001 and NIST AI RMF — converge on these practices, and regional templates like the EU MCC‑AI show they can be embedded in contracts. Data from the UK’s new procurement platform reveal that process improvements are underway[9], but outcome clauses remain scarce; this is the next frontier.
Public procurement is not just a bureaucratic necessity; it is a lever for building a digital future aligned with public values. By buying results, not models, boards and policymakers can shift the AI industry towards accountability and fairness. Governance becomes real only when it can be invoiced.
Mini FAQ: Understanding Outcome‑Based AI Contracts
- Why move from buying AI models to buying outcomes?
Traditional AI procurements pay suppliers for inputs like licences or API calls, leaving no guarantee that systems deliver real-world benefits. Outcome‑based contracts link payments to performance metrics such as accuracy, fairness, cost savings or improved service delivery. This aligns supplier incentives with public value and encourages continuous improvement rather than one‑time deployment. - What are “system and model cards,” and why do they matter?
System and model cards are structured documents describing an AI system’s purpose, data sources, training methods, performance metrics and limitations. Requiring these cards as contract deliverables promotes transparency and makes it easier to audit the technology. They help procurers understand exactly what they are buying, and they support compliance with transparency standards like the Algorithmic Transparency Recording Standard. - How do evaluation protocols contribute to accountability?
Evaluation protocols define when and how AI systems will be tested—both before deployment and throughout their lifecycle. Independent evaluations measure capabilities (e.g., accuracy, efficiency) and safety (e.g., resilience to misuse or bias) against clear benchmarks. By writing evaluation schedules and pass/fail thresholds into contracts, buyers can ensure that suppliers remedy issues or face penalties if the system underperforms. - Why are logs and telemetry essential in AI contracts?
Logs record inputs, outputs and system events, enabling auditors to detect performance drift, unfair outcomes or attempts to circumvent safeguards. Telemetry data supports reproducibility and regulatory compliance, particularly with data‑protection laws. Contracts should specify retention periods, privacy protections and auditor access to make logging a tool for continuous oversight rather than a bureaucratic afterthought. - What purpose do rollback and safeguard clauses serve?
Rollback clauses give procurers the right to pause or revert an AI system if it produces harmful or unlawful outcomes. Safeguards such as shadow modes (running the new system in parallel with the old) and feature flags (controlling which functions are active) help manage risk during deployment. Including these provisions ensures that AI use remains under human control and prevents lock‑in to unsafe or ineffective solutions. - How can boards measure whether outcome‑based AI contracts deliver value?
Boards should track a set of metrics tied directly to contract KPIs: baseline vs. actual performance on defined outcomes, realised cost savings, frequency and severity of risk incidents, and user satisfaction. A dashboard summarising these metrics alongside contract milestones helps directors decide whether to release payments, seek remediation or terminate contracts. Without such measurement, it is impossible to know if AI spending has produced the intended returns. - What makes the UK a potential leader in outcome‑based AI procurement?
Recent reforms—like the Procurement Act’s open data platform, the Algorithmic Transparency Recording Standard, and the AI Safety Institute’s independent evaluations—provide the infrastructure for measuring and enforcing outcomes. When combined with global standards (ISO/IEC 42001, NIST AI RMF) and EU templates, these tools allow UK bodies to embed transparency, evaluation and risk management into contracts. Early adoption can set a precedent for other jurisdictions and drive suppliers to raise their standards worldwide.
Resources:
[1] [4] [7] [8] [9] [10] [25] [26] [27] [32] UK Procurement Act implementation: what do the first three months of data tell us? - Open Contracting Partnership
[2] [3] [5] Buying AI | Ada Lovelace Institute
https://www.adalovelaceinstitute.org/report/buying-ai-procurement/
[6] Algorithmic Transparency Recording Standard - guidance for public sector bodies - GOV.UK
[11] [12] Spending wisely | Ada Lovelace Institute
https://www.adalovelaceinstitute.org/report/spending-wisely-procurement/
[13] [14] [15] ISO/IEC 42001: a new standard for AI governance
https://kpmg.com/ch/en/insights/artificial-intelligence/iso-iec-42001.html
[16] [17] [18] [19] [24] NIST AI Risk Management Framework: A tl;dr | Wiz
https://www.wiz.io/academy/nist-ai-risk-management-framework
[20] EU’s Community of Practice Publishes Updated AI Model Contractual Clauses | Inside Privacy
[21] Guidelines for AI procurement (Print version).pdf
[22] Advanced AI evaluations at AISI: May update | AISI Work
https://www.aisi.gov.uk/blog/advanced-ai-evaluations-may-update
[23] What are the accountability and governance implications of AI? | ICO
[28] [29] [30] VII. Public Procurement as a Lever for Change - AI Now Institute
https://ainowinstitute.org/publications/vii-public-procurement-as-a-lever-for-change
[31] Practical tools for designers in government looking to avoid ethical AI nightmares - Oxford Insights