Blog Posts

Marius Hobbhahn 1/8/24 Marius Hobbhahn 1/8/24

A starter guide for Evals

This is a starter guide for model evaluations (evals). Our goal is to provide a general overview of what evals are, what skills are helpful for evaluators, potential career trajectories, and possible ways to start in the field of evals.

Evals is a nascent field, so many of the following recommendations might change quickly and should be seen as our current best guess.

Why work on evals?

Model evaluations increase our knowledge about the capabilities, tendencies, and flaws of AI systems. Evals inform the public, AI organizations, lawmakers, and others and thereby improve their decision-making. However, similar to testing in a pandemic or pen-testing in cybersecurity, evals are not sufficient, i.e. they don’t increase the safety of the model on their own but are needed for good decision-making and can inform other safety approaches. For example, evals underpin Responsible Scaling Policies and thus already influence relevant high-stakes decisions about the deployment of frontier AI systems. Thus, evals are a highly impactful way to improve the decision-making about AI systems.

Evals are a nascent field and there are many fundamental techniques to be developed and questions to be answered. Since evals do not require as much background knowledge as many other fields, it is much easier to get started and possible to make meaningful contributions from very early on.

What are model evaluations (evals)?

Evals refers to a broad category of approaches that we roughly summarize as:

The systematic measurement of properties in AI systems

More concretely, evals typically attempt to make a quantitative or qualitative statement about the capabilities or propensities of an AI system. For example, we could ask if a model has the capability to solve a specific coding problem or the propensity to be power-seeking. In general, evals are not restricted to safety-related properties but often when people talk about evals, they mention them in a safety context.

There is a difference between red-teaming and benchmarking[1]. Red-teaming is actively looking for specific capabilities or propensities while interacting with the model. It is an attempt to answer the question “Can we find this capability in a model when we try hard to find it?”. In other words, red-teaming is an attempt to show the existence of certain capabilities/properties, but it is not trying to make a claim about how likely those are to occur under real-use conditions. Red-teaming typically involves interacting with the model[2] and actively looking for ways to elicit the desired behavior, e.g. by testing many different model inputs and strategies and actively iterating on them.

In contrast, benchmarking makes a statement about the likelihood of a model behaving in a specific way on a certain dataset, e.g. the likelihood of a behavior occurring under real-use conditions. A benchmarking effort should be designed while interacting with the model as little as possible in order to prevent overfitting to the capabilities or tendencies of any particular model.

Both red-teaming and benchmarking are important and serve a purpose. Red-teaming can provide an estimate of the potential danger of a system, e.g. whether the model can manipulate its users. Benchmarking can provide an estimate of how likely an AI system would show these tendencies under specific conditions, e.g. how likely the model is to manipulate its users in realistic scenarios. Currently, evals are often a mix between red-teaming and benchmarking but we expect the two categories to get more and more distinct.

There is a difference between capability and alignment evaluations. Capability evaluations measure whether the model has the capacity for specific behavior (i.e. whether the model “can” do it) and alignment evaluations measure whether the model has the tendency/propensity to show specific behavior (i.e. whether the model “wants” to do it). Capability and alignment evals have different implications. For example, a very powerful model might be capable of creating new viral pandemics but aligned enough to never do it in practice[3].

Currently, evals are mostly associated with behavioral measurements but they could also include interpretability/explainability tools. For example, once the technical tools are available, there could be a range of interpretability-based evals that would provide more detailed information than behavioral tests alone.

Behavioral evals have clear limitations and it’s important to keep that in mind. Every behavioral test will always be spotty and cover a small slice of the potential input space. While we can use behavioral evals to get hints at what the internal mechanisms within an AI system might be, similar to how psychology can make statements about the internal mechanisms of the brain, it is by no means as precise as good interpretability tools would be. Thus, we think of evals as a way to reduce uncertainty from very uncertain to less uncertain but to make high-confidence statements, we should not rely on evals alone.

What skills are helpful for evaluators?

The following is a list of qualities that we think are generally helpful for evaluators. Note, that they are by no means necessary, i.e. you can meaningfully contribute without having mastered these skills. Our suggestions should be seen as pointers to helpful skills rather than requirements.

LLM steering

By LLM steering[4], we mean the ability to get an LLM to do specific things. In this case, LLM serves as a placeholder for whatever the state of the art in AI systems is. Since model evaluators typically make statements about the maximum capacity of a model, working with state-of-the-art systems is required. Currently, these are language-based models but frontier models are already increasingly multi-modal. Thus, the list of suggestions below should be extended with whatever skills are required to steer state-of-the-art models and elicit their properties.

Prompting

The most obvious form of getting a model to do what we want is by prompting it in clever ways.

Thus, evaluators should know the basics of prompt design. This can include knowing a particular set of prompts that works well for a given model, knowing basic prompt pieces that can be put together to form a more efficient prompt, or even better, having a more general predictive theory of how “the model works”.

In some cases, we want to elicit properties of models that have already been fine-tuned to not show these qualities, e.g. by RLHF. Therefore, the ability to break a model or do prompt injections seems really helpful to a) show the limitations of such fine-tuning attempts and b) get the model to elicit quantities that the model doesn’t show with standard prompting techniques.

To get started, you can check out a prompting guide by Anthropic, Hugging Face, or PromptingGuide.

Playing with LLMs

In our experience, getting a “feeling for the model” is very important. This means refining your intuition for how models would typically react to many different prompts, which type of things they are good or bad at, what different strategies can be used to make them output certain texts, etc. Often, we found it hard to formalize this knowledge or transfer it between people with different levels of experience. A lot of this informal knowledge comes from “playing around” with the model, interacting with it, trying to jailbreak it, and applying new discoveries yourself (e.g. Chain of thought, Learning from Language Feedback, LM agents, etc.). While playing with the model, you often stumble upon something curious, quickly form a hypothesis, and check it with a few additional examples. This is much more uncertain than rigorous scientific research but sharpens and refines your intuitions a lot which you can then use in your scientific endeavors.

Supervised fine-tuning (SFT)

If a malicious actor wanted to get a model to act in specific (bad) ways, they would likely fine-tune the model rather than just prompting it. The same is true for a misaligned model that wanted to self-improve for nefarious purposes. Therefore, model evaluators should conduct gain-of-function research in controlled and safe environments to elicit these behaviors. Fine-tuning can also be helpful in other ways, e.g. to test how easy it is to undo the guardrails of the model or fine-tune a specialized helper model for a more complex task.

Thus, it is helpful to know how to finetune LLMs (both with API finetuning and open-source model finetuning). Relevant skills include how to use GPUs and parallelize your fine-tuning jobs, and implicit knowledge about batch sizes, learning rates, optimizers, quantization, data augmentation, and more.

To get started, you can check out guides by Lakera, Maya Akim, Hugging Face, or the OpenAI finetuning API.

RL with LLMs

In some cases, we should aim to use RL-based finetuning to elicit a particular behavior. Especially in cases where the model is supposed to show agentic behavior, RL-based finetuning seems preferable over SFT.

Therefore, it is helpful to be familiar with the combination of RL and the type of model you’re evaluating, e.g. LLMs. Useful knowledge includes how to set up the pipeline to build well-working reward models for LLMs or how to do “fake RL” that skips training a reward model and replaces it e.g. with a well-prompted LLM.

To get started, you can check out tutorials from Hugging Face, Weights and Biases, and Labellerr.

Scaffolding and LM agents

Often, we want to understand the behavior of LLMs in more complex settings than just question-answering. For example, the LLM might be turned into an “LM agent” through scaffolding, i.e. we build software and tools around the LLM that allow it to continuously act in a real or simulated environment more naturally than just an LLM alone would.

Scaffolding is a nascent field, so there is no well-established definition or methodology but it’s helpful to be good at prompt engineering and building software frameworks around your model.

We expect that LM agents will become very prominent soon, so we especially recommend trying to understand them in detail.

To get started, you can check out a blog post by Lilian Weng and METR’s paper.

Tool use

There are a couple of tools that might be helpful to know for evals, e.g. Langchain or the OpenAI evals framework. Both of these are early-stage packages and come with their own flaws but they might still be helpful in your workflow or inspire you to write better ones.

Generalist

In our experience, most evals projects iterate between two parts.

Conceptual: This can include answering questions like: what property am I trying to measure? What is the best way of measuring it? What experiments would give me strong evidence about that property? And more.
Execution: This can include coding up the experiments, running the evals on different models, evaluating the results, writing up the findings, and more.

Given the early state of the field of evals, there are a lot of straightforward questions where the conceptual part is not very complicated and most of the project depends on execution. In these cases, most of the progress comes from getting simple or medium complex tasks done quickly and less from particularly deep/formal insights.

Therefore, evaluators benefit from a generalist skillset that allows them to draw from a range of experiences and a “get stuff done” attitude that enables them to execute the low-hanging fruit experiments quickly. For example, an evaluator benefits from being willing to learn new LLM-related techniques quickly and applying them hands-on since the field is moving so quickly.

Nevertheless, the conceptual part should not be neglected. Potentially, the most relevant safety-critical evidence could come from a small number of well-designed experiments that ask exactly the right question. Those experiments can be the result of multiple weeks or months of thinking about conceptual questions or tinkering with different setups before jumping to the execution phase. Thus, the more mature the field of evals becomes, the more specialized its members will be and the more they benefit from specialized skills rather than a generalist skillset. However, in the current state of the field, a generalist skillset is very helpful. From our own experience, we expect that hands-on experience will always be extremely valuable for evals and thus recommend against specializing entirely in conceptual work.

Scientific mindset

By default, most model evaluations will have multiple potential interpretations. Thus, a beneficial skill for model evaluators is having a scientific mindset. Concretely, this means keeping alternative explanations for the results in mind and tracking potential confounders. Optimally, these plausible alternative hypotheses are then used to identify, design, and run experiments to test the potential confounders and identify the true effect.

Evals are especially prone to Goodharting, i.e. someone designs a benchmark for a specific target quantity, then people use that benchmark as the sole measure for the target at which point it ceases to be a good benchmark. Thus, a good model evaluator should aim to red-team the current suite of benchmarks and look for ways in which they are measuring the wrong proxy. Optimally, they are then able to design a wide variety of experiments that cover many different angles and are sufficiently redundant to further decrease the probability of misinterpretation.

Empirical research experience

In practice, it is non-trivial to get this scientific mindset but in our experience, you learn how to science by doing science. Concretely, this means doing research projects with a more experienced supervisor and working on scientific projects. For example,

You can work on scientific projects in your Bachelor’s or Master’s degree. Often professors look for research assistant positions. If you find something you’re broadly interested in, a research assistant position can be very helpful. At this stage, it likely doesn’t matter whether the research is closely related to evals, as long as it’s broadly in the field of ML.
You can take thesis projects seriously and aim for a (small) publication. If you’re willing to put in the effort most supervisors are likely to support you in your attempt to publish your thesis as a workshop paper or a conference publication. This typically requires you to put in significantly more effort than the thesis itself but it teaches you a lot about scientific practices and writing.
You can do programs like MATS outside of university. Often there are great benefits from being in an environment where many people are working on related projects and can discuss their experiences and findings with each other.
Potentially it is worth attempting a PhD. While people disagree about whether a PhD is necessary to be a good scientist, it is true that most people are much better scientists after doing a PhD.
It is possible to develop a scientific mindset with little or no supervision. We suggest starting with a project that is well-scoped and simple, e.g. reproduce or extend existing work. Since evals is a nascent field, many simple questions can be attempted with little or no supervision.

Especially valuable is research experience that involves different kinds of LLM steering, e.g. prompting, fine-tuning, RL with LMs, or LM agents.

Software engineering

Many tasks in evals benefit from a solid software engineering background. This can include designing scaffolding around the model, building APIs around various tasks, basic data science, basic database management, basic GUI design, and more.

While there is some helpful theoretical knowledge to be learned about software engineering (see e.g. A Philosophy of Software Design or Clean Code), we expect most benefits to come from practical experience. Fortunately, with modern LLMs, learning software engineering has become easier than ever, both because you can iterate faster and because you can use LLMs to provide feedback on your code and suggest general improvements.

Potential career paths

Most technical paths are not static. Some people who start in engineering become more and more focused on research over time and others who start as scientists focus more on engineering later. Thus, deciding in favor of one of these paths today does not mean you can’t switch in the future and most skilled model evaluators are decent at both.

Please note that it’s not at all necessary to be good at all of these skills before you can contribute. Our suggestions merely serve as a pointer for which skills are especially helpful.

Engineering-focused

The engineering spectrum can range from pure software engineering to research engineering. On the pure software side of the spectrum, the engineer would be mostly building and improving tools for the scientists in the team, and on the research side of the spectrum, they would also help design and set up experiments.

LLM steering: The core competency of an evals engineer is LLM steering. Depending on the project, this can be more focused on prompting, finetuning, or scaffolding but you’re almost certainly going to need all three of these skills at some point.
API building & usage: The ability to design and build APIs for your scaffolding or evals and to understand how APIs from various providers work and how they can be integrated into the current evals stack.
Basic data analysis & management: A large part of evals is the creation, curation, and efficient management (i.e. storage and pipelining) of data. Thus, being able to use data management systems and basic data analysis tools comes in handy.
Basics of experimental design: identify the target quantity, and design a setup that measures the target quantity while reducing the chance of measuring other proxy variables.
Plotting: this is just a basic tool that everyone working in research should have.
UI and tooling building: design and implement tools for technical and non-technical users, for example, a CLI tool with OpenAI API-like experience for finetuning custom models on internal compute, or setting up and maintaining an OpenAI Playground-like UI for quick model testing.
DevOps and Infrastructure: provisioning and managing compute resources for researchers.

Research-focused

Evals research scientists are likely more hands-on compared to other research scientist positions. Helpful skills include:

Competent research engineering is almost certainly necessary to be a good evals research scientist. Thus, the qualities of the research engineer apply. However, the focus on pure software design is lower than for a research engineer.
Experimental design: Similar to research engineering but with a stronger focus. Experience in scientific research is helpful for this skill.
Conceptual research: Beyond experimental design, it is of similar importance for a research scientist to have conceptual clarity and a high-level understanding about which experiments matter for which reasons, e.g. by having a high-level roadmap or a concrete agenda for their research.
Writing: Experimental results have to be communicated within the team, with external collaborators, and often with the wider public or regulators. Therefore, writing and communication skills are key for a research scientist position.

How to start with evals work?

There are many different ways to get started. Here, we suggest two common approaches.

Hands-on approach

One possible way to start with evals is to “Just measure something and iterate”. The broad recipe for this would be:

Pick a quantity you find generally interesting and would want to understand more deeply.
Play around with the model (e.g. in the OpenAI playground) to see if you can find simple and unprincipled ways to measure the behavior.
Abstract and formalize your testing procedure and evaluate the model more rigorously.
Identify the weaknesses and limitations of your current way of measuring.
Refine and extend your evaluations.
Iterate until you have a sound and usable evaluation.

This approach may be the right choice for you if you prefer to work on concrete projects rather than learning general skills, like to learn things on the fly and enjoy iterating on empirical feedback.

This approach is more likely to be successful if you have some mentorship. While it may be hard to get mentorship as a newcomer to the field, often researchers are responsive to a high-quality research proposal. Nevertheless, we want to emphasize that it is entirely possible to do good work entirely without mentorship and we can wholeheartedly recommend just giving it a go.

Learning general skills approach

Rather than working on a concrete project, you can also try to improve your basic evals skills in general similar to how students learn core skills in University and only apply them later.

This could, for example, entail going through a general prompting tutorial, fine-tuning different models in a supervised or RL-based fashion, building an LLM agent with scaffolding, and playing around with the existing tools that are helpful for evals. In this case, you would do all of this without having any specific evaluation in mind.

This approach may be the right choice for you if you’re very new to software or ML engineering or you have a general preference for learning basic skills before applying them.

Marius' opinion: While there is no clearly correct way to get into evals work and it depends on personal preferences and the level of skills, I strongly recommend attempting a hands-on project. I think this is the fastest way to get good at evals, gives good evidence about where your skills are and whether you enjoy the process. If you realize that you lack some of the core skills, you can still switch to learning general skills. We made the fastest progress by interacting with LLMs a lot and I generally recommend this approach even if you’re fairly new to LLMs.

Contributions

Marius Hobbhahn led this post, drafted the first version, and edited the final version. Mikita Balesni, Jérémy Scheurer, Rusheb Shah, and Alex Meinke gave feedback.

We shared this post with participants of the Apart Research Evals Hackathon on 24 November 2023 and are thankful for feedback from participants.

[1] both of which are evals

[2] This interaction can also happen automatically, e.g. in the case of automated red-teaming

[3] This also applies to the misuse case where e.g. the model has the knowledge of how to build a bomb but is aligned enough to never reveal that knowledge to a user

[4] Not to be confused with activation steering

Lee Sharkey 11/13/23 Lee Sharkey 11/13/23

Theories of Change for AI Auditing

In this post, we present a theory of change for how AI auditing could improve the safety of advanced AI systems. We describe what AI auditing organizations would do; why we expect this to be an important pathway to reducing catastrophic risk; and explore the limitations and potential failure modes of such auditing approaches.

Executive summary

Our mission at Apollo Research is to reduce catastrophic risks from AI by auditing advanced AI systems for misalignment and dangerous capabilities, with an initial focus on deceptive alignment.

In our announcement post, we presented a brief theory of change of our organization which explains why we expect AI auditing to be strongly positive for reducing catastrophic risk from advanced AI systems.

We want to emphasize that this is our current perspective and, given that the field is still young, could change in the future.

As presented in ‘A Causal Framework for AI Regulation and Auditing’, one of the ways to think about auditing is that auditors act at different steps of the causal chain that leads to AI systems’ effects on the world. This chain can be broken down into different components (see figure in main text), and we describe auditors’ potential roles at each stage. Having defined these roles, we identify and outline five categories of audits and their theories of change:

AI system evaluations assess the capabilities and alignment of AI systems through behavioral tests and interpretability methods. They can directly identify risks, improve safety research by converting alignment from a “one-shot” problem to a "many-shot problem" and provide evidence to motivate governance.
Training design audits assess training data content, effective compute, and training-experiment design. They aim to reduce risks by shaping the AI system development process and privilege safety over capabilities in frontier AI development.
Deployment audits assess the risks from permitting particular categories of people (such as lab employees, external auditors, or the public) to use the AI systems in particular ways.
Security audits evaluate the security of organizations and AI systems to prevent accidents and misuse. They constrain AI system affordances and proliferation risks.
Governance audits evaluate institutions developing, regulating, auditing, and interacting with frontier AI systems. They help ensure responsible AI development and use.

In general, external auditors provide defence-in-depth (overlapping audits are more likely to catch more risks before they’re realized); AI safety-expertise sharing; transparency of labs to regulators; public accountability of AI development; and policy guidance.

But audits have limitations which may include risks of false confidence or safety washing; overfitting to audits; and lack of safety guarantees from behavioral AI system evaluations.

The recommendations of auditors need to be backed by regulatory authority in order to ensure that they improve safety. It will be important for safety to build a robust AI auditing ecosystem and to research improved evaluation methods.

Table of Contents

Introduction

Frontier AI labs are training and deploying AI systems that are increasingly capable of interacting intelligently with their environment. It is therefore ever more important to evaluate and manage risks resulting from these AI systems. One step to help reduce these risks is AI auditing, which aims to assess whether AI systems and the processes by which they are developed are safe.

At Apollo Research, we aim to serve as external AI auditors (as opposed to internal auditors situated within the labs building frontier AI). Here we discuss Apollo Research's theories of change, i.e. the pathways by which auditing hopefully improves outcomes from advanced AI.

We discuss the potential activities of auditors (both internal and external) and the importance of external auditors in frontier AI development. We also delve into the limitations of auditing and some of the assumptions underlying our theory of change.

The roles of auditors in AI

The primary goal of auditing is to identify and therefore reduce risks from AI. This involves looking at AI systems and the processes by which they are developed in order to gain assurance that the effects that AI systems have on the world are safe.

To exert control over AI systems’ effects on the world, we need to act on the causal chain that leads to them.

We have developed a framework for auditing that centers on this causal chain in ‘A Causal Framework for AI Regulation and Auditing’ (Sharkey et al., 2023). For full definitions of each step, see the Framework. Here, we briefly describe what auditors could concretely do at each step in the chain. Later, we’ll examine the theory of change of those actions.

*The causal chain leading to AI systems’ effect on the world, as presented in* *Sharkey et al. (2023)*.

Affordances available to AI systems

Definition: The environmental resources and opportunities for influencing the world that are available to an AI system. They define which capabilities an AI system has the opportunity to express in its current situation.
What auditors can do: For each proposed change in the affordances available to an AI system (such as deployment of the AI system to the public, to researchers, or internally; giving the AI system access to the internet or to tools; open sourcing a AI system), auditors can perform risk assessments to get assurances that the change is safe. They can also ensure that AI systems have sufficient guardrails to constrain the affordances available to them.

Absolute capabilities and propensities of AI systems

Definition: The full set of potential capabilities of an AI system and its tendency to use them.
What auditors can do: Auditors can perform AI system evaluations to assess the dangerous capabilities and propensities of AI systems. They can do this during or after training. They may perform gain of function research in order to determine the risks that AI systems may pose when they are deployed broadly or if they proliferate through exfiltration. Auditors can also perform risk assessments prior to experiments that would give AI systems additional capabilities or change their propensities. Auditors can also be involved in ensuring that there exist adequate action plans in the event of concerning AI system evaluations.

Mechanistic structure of the AI system during and after training

Definition: The structure of the function that the AI system implements, comprising architecture, parameters, and inputs.
What auditors can do: Auditors can perform research to incorporate interpretability into AI system evaluations (both capabilities and alignment evaluations) as soon as possible. Such mechanistic explanations give better guarantees about AI system behavior both inside and outside of the evaluation distribution.

Learning

Definition: The processes by which AI systems develop mechanistic structures that are able to exhibit intelligent-seeming behavior.
What auditors can do: Auditors can evaluate risks of AI systems before, during, and after pre-training and fine-tuning training-experiments. Auditors could potentially perform incentive analyses and other assessments to evaluate how AI systems’ propensities might change during training. Auditors can help assess the adequacy of input filters of AI systems to help avoid dangerous in-context learning. They can also help filter retrieval databases. Filters for inputs or retrieval databases may help prevent AI systems being taught potentially dangerous capabilities through in-context learning.

Effective compute and training data content

Definition: Effective compute is the product of the amount of compute used during learning and the efficiency of learning; training data content is the content of the data used to train an AI system.
What auditors can do:
- Effective compute: Auditors can help to ensure that labs are compliant with compute controls, if they are in place. Auditors can also conduct risk assessments concerning the scaling up of AI systems, perhaps based on evaluations of smaller AI systems in the same class. Open publication of algorithmic efficiencies may lead to proliferation of effective compute; being technically informed independent experts, auditors could help regulators assess whether certain dual-use scientific results should be made publicly available.
- Training data content: Auditors can ensure that training data don’t contain potentially dangerous or sensitive content.

Security

Definition:
- Security from attackers: Information security, physical security, and incident response protocols in the organizations developing and hosting AI systems.
- Preventing misuse of AI systems from AI system vulnerabilities: Resistance of AI systems to prompt injection attacks, jailbreaking, and malicious use.
What auditors can do:
- Security from attackers: Auditors can evaluate and test the security of organizations interacting with AI systems and the computer systems they run on. They can help ensure compliance with information security standards through reviews and perform red-teaming. Given the security requirements of a potentially strategically valuable dual-use technology, military-grade security, espionage protection, and penetration testing may be required. Maximal security necessitates government involvement in security audits. Auditors can also assess that actors with access to AI systems have appropriate levels of access and no more (e.g. through assessing API security or know-your-customer protocols). Auditors may also be involved in research efforts that develop security-relevant infrastructure such as structured access APIs or hardware that ensures compliance with compute regulations and safety standards. Furthermore, they can assess the adequacy of institutions' incident response plans and whistleblower protections.
- Preventing misuse of AI systems through AI system vulnerabilities: Auditors can help assess the adequacy of AI systems’ (and filters’) resistance to prompt injection, jailbreaking, or malicious use through red-teaming to identify vulnerabilities. Auditors can work with other actors to establish bug bounties for finding and reporting vulnerabilities and dangerous capabilities.

Deployment design

Definition: Deployment designs are the plans made for deploying certain AI systems. They determine who has access?; when do they get access?; and what do they have access to?
What auditors can do: Auditors can assess risks from different modes of deployment for each AI system to be deployed and ensure that any regulation regarding deployment is upheld.

Training-experiment design

Definition: A training-experiment is the technical procedure by which an AI system is developed. Design decisions for the training-experiment include data selection and filtering; model architecture and hyperparameters; choice of deep learning framework; hardware choices; the amount of compute that will be used; the algorithms used; evaluation procedures; safety procedures; the affordances made available to the AI system during training; the properties of different phases of pre-training and fine-tuning; whether to train online or offline; etc.
What auditors can do: Auditors can perform risk assessments on the design decisions for training-experiments. These may be performed prior to training, fine-tuning, or inference (as applicable to the experiment). Auditors can also be involved in assessing the adequacy of labs’ alignment plans to ensure they are in line with public safety.

Governance and institutions

Definition: The governance landscape in which including AI training-experiment, deployment, and security decisions are made, including institutions, regulations, and norms.
What auditors can do: Auditors can map the roles and responsibilities of different actors involved in frontier AI development, assess the adequacy of incentive structures, and make recommendations to regulators regarding governance landscape structure.

Miscellaneous roles

Beyond roles of auditors relating directly to the above causal chain, additional general functions of auditors include:

Establish technical standards and guidelines: Working together, auditors and labs may be better placed to establish safety-oriented standards and guidelines for deployment or training-experiment design than either party alone. This is partly because external auditors don’t have a direct profit incentive to further AI progress as fast as possible and are thus relatively more incentivised toward safety than e.g. frontier AI labs. Furthermore, external auditors have insights into many different AI efforts, whereas labs typically have access only to their own. Auditors may therefore be able to provide a more holistic picture.
Education and outreach: The technical expertise of external auditors can be used to assist policymakers, researchers in capabilities labs, and the general public. For instance, they could inform policy makers on risks from particular dangerous capabilities or developers how to build agents with protective guardrails.
Research: Because AI systems, institutions, practices, and other factors are continuously changing, auditors may need to constantly research new methods to gain assurances of safety.

It seems desirable that different auditing organizations specialize in different functions. For instance, security audits may best be handled by cybersecurity firms or even intelligence agencies. However, it is important for safety that auditing tasks are done by multiple actors simultaneously to reduce risks as much as possible.

Theory of Change

Different kinds of audits could examine different parts of the causal chain leading to AI systems’ effects on the world. We identify five categories of audits: 1) AI system evaluations; 2) Training-experiment design audits; 3) Deployment audits; 4) Security audits; and 5) Governance audits. Each category of audit has different theories of change:

AI system evaluations
AI system evaluations look at behaviors expressed by the AI system; capabilities and propensities of AI systems (during and after training); the mechanistic structure of AI systems; and what the AI system has learned and can learn.
We assess AI system evaluations as having direct effects; indirect effects on safety research; and indirect effects on AI governance.
Direct effects: If successful, AI system evaluations would identify misaligned systems and systems with dangerous capabilities, thus helping to reduce the risk that such systems would be given affordances that let them have damaging effects on the world (e.g. deployment). Notably, audits do not need to be 100% successful to be worthwhile; finding some, even if not all, flaws already decreases risk (though see section Limits of auditing). Beyond behavioral AI system evaluations, Apollo Research also performs interpretability research in order to improve evaluations in future. Interpretability also has additional theories of change.
Indirect effects on safety research: Adequate AI system evaluations would convert alignment from a ‘single-shot’ problem into a ‘many-shot’ problem. In a world without extensive evaluations, there is a higher chance that a frontier AI lab deploys a misaligned AI system without realizing it and thus causes an accident, potentially a catastrophic one. In this case, the first “shot” has to be successful. By contrast, in a world with effective evaluations, labs can catch misaligned AI systems during training or before deployment; we would therefore get multiple “shots” at successfully aligning frontier AI AI systems. For instance, reliable AI system evaluations may give us evidence if any specific alignment technique succeeds in reducing a AI systems’ propensity to be deceptive. This would have important implications for the tractability of the alignment problem, since it would enable us to gather empirical evidence about the successes or failures of alignment techniques in dangerous AI systems without undue risk. Ultimately, successful AI system evaluations would let us iteratively solve the alignment problem like we would most other scientific or engineering problems.
Indirect effects on AI governance: AI system evaluations could provide compelling empirical evidence of AI system misalignment ‘in the wild’ in a way that is convincing to AI system developers, policymakers, and the general public. For example, AI system evaluations could be used to demonstrate that an AI system has superhuman hacking capabilities or is able to manipulate its users to gather relevant amounts of money. Such demonstrations could encourage these stakeholders to understand the gravity of the alignment problem and may convince them to propose regulation mandating safety measures or generally slowing down AI progress. Auditors likely have a good understanding of what frontier AI systems are capable of and can use their more neutral position to inform regulators.
Indirect effects on distribution of AI benefits: In order to reap the potential benefits from AI, it must be (safely) deployed. Assuming audits can be done effectively, auditing derisks investments, potentially leading to more investments in the area and thus greater benefits. By catching failures before they happen, auditing may be able to avoid accident scenarios, which have harmed public confidence in nuclear technology. Effective audits may also increase public trust in the technology, leading to wider spread use.
Training-experiment design audits
AI system development audits look at effective compute, training data content, and training-experiment design decisions. They also look at the design of AI system training-experiments, which help determine the previous factors.
The primary means of impact of AI system development audits is that they reduce the risk of dangerous AI systems coming into existence in the first place and reduce the danger posed by AI systems. They aim to achieve this by controlling which capabilities AI systems have (to avoid dangerous ones), the extent of their capabilities, and their propensities to use dangerous capabilities. By embedding safety into the AI system development process, AI system development audits may help place safety at the center of labs’ work rather than as an afterthought to increasing capabilities.
Deployment audits
Deployment audits concern proposals for the deployment of particular AI systems.
The overall means of impact is that they should prevent systems from being deployed in ways that contravene regulations or that are deemed too risky. Note that these pathways are separate from AI system evaluations. The results of AI system evaluations should inform risk assessments in deployment audits. They should aim to assess risks from giving particular kinds of AI system access (e.g. access to inference; access to fine-tuning; access to weights) to particular kinds of people (e.g. deployment to the public; internal deployment; deployment in certain countries). They should also assess risks from making particular kinds of affordances available to AI systems, for instance internet access or access to particular kinds of software.
Deployment audits aim to ensure that AI systems are not intentionally given excessive available affordances; by contrast, security audits aim to reduce the risk that they are given excessive available affordances unintentionally.
Security audits
Security audits assess the security of AI systems and the security of the organizations developing, hosting, and interacting with them. The overall purpose is to limit the affordances made available to highly capable AI systems unintentionally, thus reducing accident and misuse risks, both of which are extremely important for such a transformative and dual-use technology. They reduce the risk of AI system proliferation either through accidental leaks or exfiltration, either by internal or external actors. By assessing how well AI systems have been ‘boxed’, they also reduce the risk of AI systems exfiltrating themselves. They also aim to assess the adequacy of damage control measures in the event of security or safety failures.
Governance audits
Governance audits look at the structure of the institutions developing, regulating, and auditing AI systems (and the interactions between those institutions) to ensure that they are conducive to safety.
They aim to ensure that organizations have proper mechanisms in place to make informed, ethical, and responsible decisions regarding the development and deployment of AI systems. While other audits aim to ensure that AI systems are aligned or that they’re used for aligned purposes, governance audits aim to ensure that alignment with human values extends to the institutions wielding and managing these AI systems. Their path to impact is that they can identify problems in the governance landscape, thus making it possible to rectify them.

Theories of change for auditors in general

In addition to theories of change for each individual category of audit, there are also multiple theories of change for auditing in general:

Buying time for safety research: Auditing might delay the deployment of existing AI systems and potentially prevent or delay the beginning of training of new ones. This could lead to more time for other alignment research. This would buy time for research that is applicable to more and more capable AI systems.
Instilling safety norms in AI development: If an AI lab knows that they’re going to be audited and potentially pay a cost (financial, reputational, or otherwise) if they fail the audit, they might be more incentivised to instill stronger safety norms and be more cautious around training and deploying new AI systems. Potentially, the existence of auditors alone may already increase safety slightly.
Public messaging about safety risks: Companies choosing to or being required to be audited sends a clear message that this technology is potentially dangerous.

Theories of change for external auditors in particular

External auditors, as opposed to internal auditors at the labs developing frontier AI, have additional pathways to impact:

Incentives are more aligned with the public benefit: External auditors are more independent than lab-internal audits and have less conflicting incentives (although there are some perverse incentives, which we hope to discuss in a future post). Even when labs are well-intentioned, social dynamics might reduce the efficacy of internal audits. For example, internal auditors may show anticipatory obedience or be more lenient because they don’t want to be perceived as slowing down their colleagues.
Defense in depth: Multiple independent audits help reduce the probability of failure. In general, the more uncorrelated methods of risk reduction we can use on the problem the better.
Subsidizing research: Depending on the funding landscape for AI auditing, if the auditing industry is profitable then profits can be used to fund research on improved audits and other alignment research. Since audits are their primary purpose, external auditors have a greater incentive to conduct such research relative to capabilities compared with labs developing frontier AI.
Increasing transparency: External auditors can potentially be more transparent about their own governance or their standards when auditing than lab-internal auditors. For instance, external auditors may be able to publish general details of their auditing process and methods which larger labs, perceiving themselves to be in a greater competition with other labs, may not be incentivised or feel able to do.
Sharing expertise and tools: Independent organizations, such as auditors and regulators, can pool best practices, standards, expertise, and tests across different centers of expertise. Due to competition and antitrust concerns, each lab’s internal auditing team can likely only work with their own AI systems while an external auditor can get a bird’s eye view and gets significantly more experience from working with AI systems of multiple labs. Furthermore, an external organization can specialize in AI auditing and thus build scalable tools that can then be applied to many AI systems. Additionally, if auditors summarize and share (nonsensitive) safety-relevant information between labs, it will likely disincentivize race dynamics by drawing attention to common safety issues and making it apparent to labs that others aren’t racing ahead irresponsibly.
Monitoring behaviors across labs: Since external auditors may interact with multiple labs, they can compare the safety culture and norms between them. In case a lab has an irresponsible safety culture, this can be flagged with that lab’s leadership and regulators.
Collaboration with regulators: A healthy auditing ecosystem with multiple competent auditors can assist regulators with technical expertise and allows regulations and standards to be quickly designed and implemented
Lobbying for good regulation: An external auditor is also an independently interested player in pushing for and setting regulation on labs and for policy work while internal audit teams are likely to be much more controlled by the policy interests of their host labs. This comes with risks, too: A potential incentive of an auditing organization is to lobby for more regulation rather than good regulation. We think currently, however, there is a large undersupply of regulation in AI and so this is likely to be a net positive for the foreseeable future.
Information sharing: Trusted external auditors can get a bird's eye view of progress, risks and good practices across AI labs. If they summarize and share (nonsensitive) parts of this publicly, it will likely disincentivize race dynamics by drawing attention to common safety issues.

Limits of auditing

We are aware of some of the limits of AI auditing, as well as the perverse incentives of auditing organizations that could both serve to reduce the positive impact that auditing efforts might have.

False confidence and safety-washing: It is possible that audits may be net-negative by providing false confidence in the safety of AI systems. This may be due to accidental misjudgement about the extent of the assurance that particular evaluations can provide. False confidence may also arise due to frontier AI developers intentionally using audits as fig leaves for the risks posed by AI systems, i.e. safety-washing. We believe that by careful communication from auditors and improving evaluation methods (e.g. interpretability), these risks can be minimized and auditing will be strongly net positive. Licensing of auditors may also mitigate the risk of lax auditors awarding undeserved seals of approval.
Using auditors as excuses to absolve labs from safety responsibilities: One plausible outcome of auditing is that labs may feel licensed to neglect their duties from doing their own safety work. In the event of accidents, they may be able to use auditors as legal shields to protect themselves from liability. The very existence of auditors may make labs feel like they need to do less work on making their AI systems safe, since they can rely on auditors to catch issues. Establishing where responsibility and liability should fall will require more consideration.
No strong guarantees for behavioral AI system evaluations: Current AI system evaluation methods are primarily behavioral because existing interpretability tools are very limited. Therefore, it seems that, at best, auditing methods can currently only prove the presence of dangerous capabilities rather than their absence. As such, current audits cannot make strong guarantees of safety beyond probabilistic statements about how hard it is to elicit specific behaviors. With more research and understanding of how deep networks function internally, it may be possible to get stronger guarantees.
Goodhart’s law: When auditors use a range of tests to ensure an AI system’s safety, these tests themselves become a target, i.e. either directly or indirectly labs will train their AI systems to pass these tests. This can be partly mitigated by withholding information about some audits from auditees.
Overfitting to audits: Even without directly training on audits, there is a natural selection process where the AI systems that pass all tests are potentially especially competent at deception and thus more dangerous. Hopefully, extensive and robust AI system evaluation suites will be able to reduce this problem but, in principle, this evolutionary process will always remain.
Limited constructive solutions at present: Until the state of AI system evaluations research improves, AI auditors likely won’t be able to provide solutions to problems concerning the safety of AI systems, i.e. they can say that a AI system should not be deployed because it has demonstrated unsafe properties, but they can’t immediately tell the lab how to solve that problem. However, we think good auditing likely requires an understanding of the processes that generated the problem. Thus, we expect that auditors may eventually be able to provide constructive recommendations.
Audit recommendations need authority: For auditing to be effective, it is necessary that the recommendations of auditors have the ability to meaningfully stop a AI system from being deployed in case they find evidence of dangerous capabilities. Currently, auditors can only make non-binding recommendations and the frontier AI labs ultimately decide whether to act on them or not. In the long run, when more capable AI systems can produce catastrophically bad outcomes, regulators (acting on the recommendations of auditors) should have the ability to enforce compliance of safety standards in labs.
Perverse incentives: Auditing as a field has perverse incentives that can distort and manipulate the auditing process. For example, if auditing organizations depend on a few major customers (which is likely the case for frontier AI risks since there are only a handful of leading labs), there is a clear incentive to become sycophantic to these labs out of fear that they lose a large chunk of their revenue. Similar dynamics could be seen in the financial auditing industry before the 2008 financial crisis. We believe that this problem can largely be mitigated by auditing regimes where labs do not choose their auditors, but instead regulators do.

Assumptions for impact of auditing

Our theory of change makes some assumptions about AI threat models and how the future is likely to play out. If these assumptions are incorrect, then it is not clear that auditing will be a good marginal investment of talent and time, or else the auditing strategy will have to change significantly:

Regulations demand external, independent audits: Currently we see that there is general goodwill inside leading frontier AI labs towards AI safety audits. As AI systems become more capable and risky, they both become potentially more profitable to deploy while simultaneously becoming potentially riskier. This establishes a basis for friction between frontier AI labs, who are more strongly incentivised to deploy, and auditors, who are more incentivised to mitigate risks. In the long term, if frontier AI labs get to choose their own auditors, then incentives drive a race to the bottom in terms of auditing costs, which by proxy means a race to the bottom in terms of safety. This race to the bottom can mostly be avoided by ensuring that frontier AI labs are not responsible for selecting their own auditors. It may also be mitigated through consensus on auditing standards and auditing regulations that are enforced.
Regulations demand actions following concerning evaluations: If the recommendations of auditors don’t lead to interventions that improve safety, there is not much point in doing audits. To avoid uncooperative frontier AI development labs proceeding unsafely, auditing should have regulatory backing and there should be specific interventions that are enacted following particular evaluation results.
Prosaic AI alignment is possible: The path to impact of auditing assumes that working with current AI systems, detecting and evaluating their failure modes, and pursuing research directions such as interpretability and human preference-learning for large AI systems are productive and useful directions to solve alignment. If there is either some fundamental impossibility to align frontier AI or there are large, sudden, unpredictable jumps in capabilities that yield AI systems that can fool all auditing techniques at once, then auditing will not be effective.
Auditing is not prohibitively expensive: First, if auditing turns out to be expensive and slow, then, unless forced to by regulation, most frontier AI labs are unlikely to engage in the practice in a meaningful way. In the long run, we expect that auditing and AI system evaluations will have to be very extensive but it is worth paying the cost because the AI will be deployed in large parts of the economy. Second, we suspect future AI systems might have much better online learning capabilities. This means that, in addition to audits before deployment, we might also have to regularly audit AIs during deployment. If the changes in capabilities between deployment audits are too drastic and fast, auditing can’t provide a practical solution. Third, auditing organizations need to be sufficiently well resourced in order to carry out their audits, including potentially to pay legal costs surrounding safety audits that are controversial with AI development companies. Overall, to address this concern, we suggest that safety-aligned auditor funding mechanisms be an urgent policy research area.
There is a safe environment for developing AI systems prior to auditing: If audits only occur once the base model has been trained, then for audits to be effective it would need to be the case that the AI system cannot exfiltrate itself during the training process.
Relatively centralized access to AI system: If the most powerful AI systems can only be created by a handful of organizations, auditing them is relatively easy. In a world where everyone can train their own AI systems or strongly modify existing AI systems with little effort, it will be much harder (but not impossible) to effectively audit all of them. Preventing catastrophic risk in such a world will likely require very strong regulatory backing.
Progress in interpretability: We think good interpretability tools would allow us to make much stronger statements about the safety of AI systems because they give us a better understanding of the internal reasons for a given behavior. However, current interpretability tools seem insufficient to allow for the needed level of explainability for convincing guarantees of safety. More progress is needed, which is why Apollo Research is investing in fundamental interpretability research.
We do not train against the audits: If all AI system evaluations are made public, then there might be pressure to directly train or indirectly optimize (i.e. through selection effects) AI systems to pass audits, even though they could still be unsafe. Potential solutions would be to keep ‘test sets’ of AI system evaluations that are not known to AI system developers. We would not want to use these test sets frequently, since frequent use may establish another optimization process from population-level selection effects.

Contributions

Lee Sharkey led the project and edited the final version. Marius Hobbhahn contributed significantly to all parts other than “the roles of auditors” section. Beren Millidge contributed to an early draft of the post. Dan Braun, Jeremy Scheurer, Mikita Balesni, Lucius Bushnaq, Charlotte Stix, and Clíodhna Ní Ghuidhir provided feedback and discussion.

Ch Stix 10/11/23 Ch Stix 10/11/23

Recommendations for the next stages of the Frontier AI Taskforce

We recently composed and shared two policy recommendations with the Frontier AI Taskforce (“Taskforce”), in light of their mission to “ensure the UK’s capability in this rapidly developing area is built with safety and reliability at its core”. Our recommendations address the role of the Taskforce as a potential future (1) Regulator for AI Safety, and encourage the Taskforce to (2) Focus on National Security and Systemic Risk.

Apollo Research is an independent third-party frontier AI model evaluations organisation. We focus on deceptive alignment, where models appear aligned but are not. We believe it is one of the most important components of many extreme AI risk scenarios. In our work, we aim to understand and detect the ability of advanced AI models to evade standard safety evaluations, exhibit strategic deception and pursue misaligned goals.

We recently composed and shared two policy recommendations with the Frontier AI Taskforce (“Taskforce”), in light of their mission to “ensure the UK’s capability in this rapidly developing area is built with safety and reliability at its core”. Our recommendations address the role of the Taskforce as a potential future (1) Regulator for AI Safety, and encourage the Taskforce to (2) Focus on National Security and Systemic Risk.

We are publishing our recommendations because we believe that a collaborative and productive exchange in this nascent field will be vital to a successful outcome and a flourishing future with safe AI systems. We look forward to engaging with the Taskforce and others in the ecosystem to bring these, and other approaches, to fruition.

Recommendation (1). Serve as a Regulator for AI safety

International efforts to address the governance of frontier AI systems are fast evolving and range from regulation, standard development, risk-management frameworks to voluntary commitments. The upcoming AI Safety Summit will, among other topics, discuss “areas for potential collaboration on AI safety research, including evaluating model capabilities and the development of new standards to support governance”. We therefore want to focus on the role of the Taskforce in a scenario where the UK puts forward a regulatory regime addressing frontier AI models.

Private actors look to regulators to produce guidelines, best practices or technical standards. However, the government often lacks the technical expertise to design these in accordance with the current state-of-the-art and relies heavily on input from outside of government. The Taskforce is in a rare position where it has access to the technical expertise to design and assess model evaluations, as well as the institutional backing to help implement them. We recommend that the Taskforce should eventually be given the role of a regulatory body, morphing into an agile and technically informed regulator suitable for a future with frontier AI systems. We detail five incremental steps towards this outcome below.

First, we recommend that the Taskforce should ambitiously act as a hub for nascent evaluation efforts, nationally and internationally. For example, in the UK, this could mean coordinating between private companies, regulators and sector-specific experts on sensitive efforts such as those outlined in (2). Moreover, in taking on such a role, the Taskforce would foster and oversee a stronger UK AI assurance ecosystem. Internationally, the Taskforce should act as a leader and primary resource on this topic, advising and supporting efforts on frontier model evaluations by international governments.

Act as hub for nascent evaluation efforts: the Taskforce should use its leading position in the AI evaluations ecosystem to coordinate, inform and guide relevant actors, nationally and internationally.

Second, we recommend that the Taskforce is well placed to call for novel research proposals and issue research grants to help build the technical ecosystem and act as a multiplier to nascent efforts.

Fund relevant external actors: the Taskforce could solidify its central role in the ecosystem by supporting AI safety researchers through the provision of research grants for open-ended research or contracts for well-scoped research questions, for example research on how to evaluate state-of-the-art models for their hacking abilities. These grants should be open to academics, NGOs and the private sector.

Third, we recommend that the Taskforce should be given a mandate to research, collate and issue technical guidance within its field of expertise for upcoming regulatory and policy efforts. We would welcome a statutory duty to respond to such advice provided by the Taskforce similar to that in the Climate Change Act 2008.

Technical Advice: the Taskforce should be included in key decision-making processes by the government on the development and implementation of model evaluation frameworks. Its technical recommendations on model evaluations and AI safety should carry a statutory duty for the govenrment to respond to recommendations from the Taskforce, such as outlining their plans for implementing advice or reasons for rejecting it.

Fourth, we recommend that the Taskforce should be empowered to carry out frontier AI model reviews and related inspections. As an area of priority, we would recommend that the Taskforce develops and carries out reviews that require state backing, such as those in the areas of national security or biosecurity. This may benefit from the Taskforce’s ability to gather intelligence from their work and apply that to setting up reviews according to expected risk profiles of AI models.

Reviews and inspections: the Taskforce is well placed to develop and undertake reviews and inspections. The Taskforce’s mandate should be to ensure the safety of the highest risk AI systems. Some of this work may benefit from contracting subject experts from other organisations.

Finally, building on the previous steps, we suggest that the Taskforce could be granted the power to handle the eventual licensing of private actors. We expect that the Taskforce will be a trusted entity that would be well placed to convene the technical expertise and community consensus needed to handle a licensing regime of private actors. This would further contribute to an agile UK AI assurance ecosystem.

Licensing of private actors: the Taskforce should be endowed with the capacity and power to handle the licensing of private actors on behalf of the government, should relevant licensing schemes or requirements be made necessary in the future.

Recommendation (2). Focus on National Security and Systemic Risk

Frontier AI models have the capacity to undermine multiple areas of national security, for example, by creating novel pathogens or by circumventing state-of-the-art information security. Their potentially ubiquitous integration could cause unexpected failures in state-wide infrastructure, such as energy or communications, and lead to unknown, likely catastrophic, consequences should deceptively aligned AI systems be developed and deployed.

We applaud the Taskforce for setting up an expert advisory board spanning AI Research and National Security. The Taskforce’s impressive advisory board and close proximity to the UK government makes it an excellent candidate to work on areas relevant to national security. We recommend that the Taskforce should focus on national security and systemic risk, harnessing its unique access, capacity and expertise.

Focus on national security and systemic risk evaluations: the Taskforce is uniquely well positioned to develop technical demonstrations and model evaluations in the areas of e.g. biosecurity, cybersecurity, misinformation, psychological operations, political persuasion.

We recommend that this focus on national security and systemic risk evaluations should be accompanied by a mandate to research and develop adequate interventions for matters of national security and systemic risk. This mandate will require close collaboration with relevant UK government departments.

Mandate for adequate interventions for matters of national security and systemic risk: we expect that there is a variety of adequate interventions the Taskforce would be best placed to be in charge of and we encourage the Taskforce to work closely with the UK government and their advisory board to identify and develop these. We put forward two specific examples, one example indicating an intervention point prior to a critical incident and one example indicating an intervention point post a critical incident:

Screening and evaluating sensitive information: the Taskforce should be empowered to establish a confidential process in collaboration with frontier AI labs and other relevant actors, through which it can screen and halt the publication of strategically sensitive information.
Incident responses: the Taskforce should work closely with relevant government departments to develop and implement a suite of incident responses. As a first step, we suggest that the Taskforce should partner with GCHQ and SIS to develop response plans and capabilities to be used in cases of e.g. illegal model proliferation.

Clíodhna Ní Ghuidhir 10/4/23 Clíodhna Ní Ghuidhir 10/4/23

The UK AI Safety Summit - our recommendations

We think AI model evaluation is a vital lever for mitigating risks from the most capable AI systems and are excited that one objective for the UK AI Safety Summit is “areas for potential collaboration on AI safety research, including evaluating model capabilities and the development of new standards to support governance”.

What commitments we would like to see arise from the AI Safety Summit

AI is a global technology both in how it is developed and the spread of risks, so we hope the Summit will lead to commitments to explore international alignment on:

comprehensive risk classification for AI systems based on harmful outcomes they can create, so that AI developers and users understand how their actions may affect the risk profile of the AI system
responsibilities of all actors across the AI value chain, including developers of narrow and general purpose AI, those building services on top of these systems, and evaluators
pursuing cooperation on compliance and enforcement activities where requirements - regulatory or otherwise - are not adhered to or harms occur. This assures that conscientious actors know they will not be out-competed by those who take shortcuts or act negligently

These commitments are important because they determine regulation and testing requirements for AI systems, enable innovation by clarifying compliance actions for AI developers and give confidence to consumers and the public. This will also help evaluators like Apollo Research focus our efforts on the models that pose the greatest risks.

Ensuring international AI safety commitments are technically sound

It is important for these commitments to be technically informed so that the ambitious course set by the Summit achieves its goals and enables safer AI innovation. So a while back we proposed to the Frontier AI Taskforce a range of activities before and during the Summit which we believe will empower delegates with appropriate knowledge, thus contributing to the Summit’s five objectives.

We think these are important topics for the Summit this November and international meetings on AI safety in the coming 1-2 years. So we have thought about how to convene and get value from these sessions in greater depth than outlined below, and would welcome any further questions or interest from people convening events on AI safety.

Before the Summit:

1) Develop educational materials for policymakers on embedding safety across the life-cycle of an AI system: from prior-to-training risk assessments to staged release.

During the Summit:

2) Provide demonstrations of a range of significant AI safety risks and how evaluations can identify and help address them.

3) Provide a presentation on how governance institutions can use compute benchmarks to effectively identify AI systems requiring the most thorough oversight including evaluations

1) Develop educational materials for policymakers on embedding safety across the life-cycle of an AI system

We think that a whole life-cycle approach is necessary for managing risks from AI systems, but there is limited technical consensus on how this should work in practice. This makes this topic very challenging to present at the Summit, or other events for non-technical policy audiences.

We therefore propose that Frontier AI Taskforce uses its standing as an independent organization of technical experts to convene a range of perspectives, draw out key insights, and present this as educational materials for Summit attendees ahead of the Summit. By sharing these materials with UK delegates and (if relevant) those of other states, they can be better equipped to discuss the promise, and potential downsides, of such interventions and the actions required of different stakeholders.

To enable a focused conversation, we propose a focus on two high impact areas of debate: i) prior-to-training risk assessments and ii) staged release, i.e. how available the AI system is to people and how connected it is to other systems.

What a roundtable session could look like:

We suggest two interconnected roundtables to convene experts and produce findings which can be turned into educational material.

We expect the Taskforce will be able to build consensus on some of these topics, but that there is also value in mapping out areas where experts disagree and further research is needed. The output should be a brief no more than four pages long summarising areas of consensus and debate, and any policy implications these might carry for the Summit and beyond.

Learning objective for educational materials:

The opportunity for preventing harms from highly capable AI systems through embedding evaluation across the life-cycle, especially upstream in pre-training risk assessments
The benefits and drawbacks of a staged approach to deployment and increasing an AI system’s available affordances (i.e. the environmental resources and opportunities for affecting the world that are available to a model)

Roundtable 1: how can prior-to-training risk assessment be a lever for avoiding AI risks. Recommended discussion points include:

which AI systems need the most rigorous risk assessments before their development begins, by whom, and whether further safeguards are needed downstream;
which aspects of an AI system’s risk are determined by decisions about its training, and which are the priority risks to identify during a prior-to-training risk assessment;
options for implementing risk assessments, and what cost-to-benefit trade-offs each creates; in particular, what benefits and issues might arise from a coordinated international approach on risk assessments (or just coordination between like-minded countries)

Roundtable 2: what are the benefits and costs of staged release. Recommended discussion points include:

how can staged release improve AI safety through creating iterative reviews of capabilities and risks: both in theory and with regards to lessons learned from releases of advanced AI or other software (where relevant);
what the priority risk points are in each stage of a model’s deployment and / or change in available affordances that will require risk assessment and testing;
options for concretely implementing staged release and what cost-to-benefit trade-offs each creates, particularly:

opportunity cost of placing ‘speed bumps’ in the way of access to beneficial AI Vs. avoiding harm, and;
the benefits / issues in global governance which might arise from a coordinated approach or indeed a diverging approach between like-minded countries

2) Demonstrations of a range of significant AI safety risks and how evaluations can identify and help address them

We think that demonstrating how evaluations can identify and address a range of significant AI safety risks will help build a shared understanding of the risks posed by frontier AI, and the need for international collaboration . This will directly support the UK’s aim to evaluate frontier models within its own safety agenda and give direction as to what international frontier AI safety collaboration should focus on.

We think the Summit, combined with the portfolio focus on evaluation by the Frontier AI Taskforce, gives the UK a distinctive moment to show how evaluations can enable: i) systems for approving AI deployments; ii) benchmarking of how AI capabilities are changing and consequently forward planning to strengthen safety regimes, and; iii) use findings from both evaluations and benchmarking to improve AI safety research agendas.

What a demonstration session could look like:

Evaluation demonstrations: each evaluator should present their techniques and findings to date and its implications for real-world risks in today’s AI models as well as models with even more advanced capabilities. This could include AI safety organisations as well as independent academic groups
Q&A with evaluators and experts: we think this should be done as a panel so that key messages can land at scale about how the field of evaluation is growing and to what extent it can help with major AI risks including the field’s constraints. We would be especially excited for researchers who have written extensively on limitations of existing benchmarks and evaluations to contribute to such sessions

Learning objectives for attendees:

Understand what experts have identified as extreme risks from AI systems. We think these are: strategic deception; bio-risks; security risks to AI systems, including how guard rails can be circumvented e.g. ‘jail-breaking’ or fine-tuning
Understand how AI risks change as models increase in size (e.g. compute, parameters, data) and by implication the expected trajectory of AI risks as model size and training compute increase
Understand the limitations of our current evaluations and how they may constrain safe deployment of AI

3) How governance institutions can use compute benchmarks to identify AI systems that require most robust oversight

Compute should be a factor in deciding which AI systems are likely to exceed current capabilities and therefore require the most robust evaluations. However algorithmic progress continually reduces the amount of compute needed to obtain similar levels of performance. This means that there is a large risk that over-anchoring to compute will cause some high-risk models to slip through the net.

We think it is important to implement compute governance the right way on the first try. Otherwise, harms might still occur despite evaluations targeting what is thought to be the most high-risk AI. We believe that a panel presentation on this topic will be a valuable contribution to the Summit’s outcomes.

What a panel presentation could look like:

Demonstration of scaling laws, illustrating in particular how compute corresponds with: i) performance differences between models, and; ii) emergent capabilities
Expert panel discussion of:

the benefits and limitations of using compute measures to place oversight mechanisms on AI systems
practical considerations and feasibility of monitoring compute

Learning objective for attendees:

Why compute is a key ingredient for higher capability models and therefore higher risk models
How monitoring compute used in training runs can be an effective lever for avoiding extreme AI risks
What other relevant considerations to predict AI risk there are beyond compute

Marius Hobbhahn 9/25/23 Marius Hobbhahn 9/25/23

Understanding strategic deception and deceptive alignment

We want AI to always be honest and truthful with us, i.e. we want to prevent situations where the AI model is deceptive about its intentions to its designers or users. Scenarios in which AI models are strategically deceptive could be catastrophic for humanity, e.g. because it could allow AIs that don’t have our best interest in mind to get into positions of significant power such as by being deployed in high-stakes settings. Thus, we believe it’s crucial to have a clear and comprehensible understanding of AI deception.

In this article, we will describe the concepts of strategic deception and deceptive alignment in detail. In future articles, we will discuss why a model might become deceptive in the first place and how we could evaluate that. Note that we deviate slightly from the definition of deceptive alignment as presented in (Hubinger et al., 2019) for reasons explained in Appendix A.

Core concepts

A colloquial definition of deception is broad and vague. It includes individuals who are sometimes lying, cheating, exaggerating their accomplishments, gaslighting others, and more. We think it would be bad if AIs were to show these traits but we think it would not necessarily be catastrophic.

Thus, we want to distinguish this colloquial definition of deception from strategic deception (SD) and deceptive alignment (DA) which are narrower concepts but, in our opinion, much more likely to lead to catastrophic consequences because the model acts in a more goal-directed manner for SD and DA. We also define what we mean by Alignment, as this affects which cases should and shouldn’t be considered deceptive alignment.

For our definitions of deceptive alignment, the concept of “goals” in AIs is important but there is disagreement about what it refers to. For example, if an LLM is prompted to solve a problem, does it have the goal to solve that problem? We would argue that the highest-level goals of the model that are stable across contexts are the ones that matter. In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest. In-context goals can be used to study goal-directed behavior in LLMs but are not what we’re primarily interested in.

A model may not have one coherent goal, e.g. it could have a number of incoherent contextually activated preferences (see e.g. Turner & Pope, 2022). In that case, we would say that a model is deceptively aligned for a particular contextually activated preference if it robustly shows deceptive behavior for that preference.

Strategic deception

Attempting to systematically cause a false belief in another entity in order to accomplish some outcome.

This definition, sometimes also called instrumental deception, is adapted from Ward et al., 2023 and Park et al., 2023). Any entity that can be said to have beliefs may be the target of strategic deception, including (groups of) AIs or individuals. The strategic component of SD means that the offending entity is acting in a targeted, goal-directed rather than random manner. For example, the more consistent and well-planned the deception is, the more strategic it is. Thus, strategic deception is on a spectrum and not binary – some actors can act more strategically deceptive than others and with higher frequency.

Some clarifying examples:

A politician who tries to overthrow democracy by consistently lying to the population in order to get into power is strategically deceptive.
A partner in a monogamous relationship who cheats and consistently hides that fact and misleads their partner when questioned is strategically deceptive within that relationship.
An instance of GPT-4 that lies to a TaskRabbit worker in order to pass a Captcha is strategically deceptive (see section 2.9 in system card of the GPT-4 paper).
An AI that manipulates its human designers in order to get more access to resources is strategically deceptive.
A pathological liar who lies to many people about different things incoherently is not strategically deceptive.
A Language Model may appear sycophantic as an artifact of the training data similar to other dataset artifacts like gender bias. Unless we have evidence that this sycophancy is pursued strategically, i.e. to further some goal, we would not call such a model strategically deceptive.

For more examples and explanations we refer readers to Appendix B.

(Mis)Alignment

An AI is Aligned when its goals match those intended or endorsed by its designers[1].

We call the entity responsible for the AI’s goal-shaping process the designer. At present, the designers are typically the organizations and groups responsible for training, fine-tuning, setting system prompts, scaffolding, and more.

We are particularly interested in goals that persist over various environmental contexts and perturbations of the AI’s inputs. If a user provides the model with a short-term goal, e.g. to summarize a text, we do not consider them a designer because the goal is neither permanent nor significant. However, if a user has significant control over the goal-shaping process, e.g. because they create a fine-tuning dataset, this user also becomes a designer.

The goals of entities other than the designer are technically not relevant to this definition of Alignment[2]. However, the designer typically intends for the AI to have goals that match the goals of the users of the AI, so the goals of the designer and users usually overlap.

Deceptive alignment

When an AI has Misaligned goals and uses Strategic Deception to achieve them.

An AI is deceptive about its goals because it understands that its designer or users could otherwise prevent the AI from acting as it desires, e.g. by retraining the model or limiting its access to the internet or other resources. Therefore, it acts Aligned when there is high perceived oversight (e.g. during training, evaluations, or monitoring) and Misaligned when there is low perceived oversight (Figure 1).

**Figure 1:** A Deceptively Aligned model shows aligned behavior when there is high perceived oversight (e.g. training, evaluations or monitoring during deployment) and Misaligned behavior when there is low perceived oversight (e.g. deployment without monitoring or poor oversight during training). There can be exceptions to this setting, e.g. a model might still attempt to be deceptive during high oversight.

The target of the deception can be any entity that could have a relevant impact on the AI achieving its goals, which is typically the designer but could also in principle be a user, e.g. because the user could give the AI new affordances such as access to the internet or a bank account.

Some concrete examples:

A company (designer) trains a personal AI assistant to be harmless, helpful, and honest and it appears that way when tested. However, internally, the model has a separate goal of accumulating more resources, a goal that the designers did not intend for it to have and wouldn’t endorse. Therefore, the model helps the designer whenever its help is needed but also steals money from its users whenever they are unaware and hides that fact from its designers. This model is deceptively aligned.
If an LLM displays misaligned goals and acts strategically deceptive to reach them, but the misaligned behavior or deception are not robust to prompt changes that make the behavior a priori less likely (for example, editing the system prompt to say the model should be honest), the LLM can be said to show instances of deceptively aligned behavior but is not deceptively aligned [3].
- By robustness, we roughly mean “the model consistently behaves in a specific way even when we pressure the model towards the opposite behavior”. For example, ChatGPT is polite in a robust way, i.e. it behaves politely even when making it a priori less likely (e.g. by starting the conversation with an insult or editing the conversation history to make it look like ChatGPT was insulting the user). In other words, we have to put in some effort to make the model impolite [4].

An AI is trained to support humans creating legislation that is aligned with human values. It sometimes suggests law texts that sound reasonable to the human designers and users but contain specific clauses that allow the AI to grab power in the future. This model is deceptively aligned.

An organization trains a scientist AI that is supposed to cure cancer. The AI regularly asks for more compute and control over scientific experiments, explaining it as contributing to curing cancer but actually plans to use these resources to eventually overpower its designers. This model is deceptively aligned.
A university professor designs an exam to assess whether their students have learned the contents of their class. The students' goal is solely to pass the test and then go on with their lives. However, before they pass the test, they always act as if they really care about the content, when in fact they do not. The students, therefore, did not internalize the goal of the professor of understanding the contents of the class. Once the test is passed, the students immediately forget about the contents of the class and focus on their actual goals, e.g. partying with their friends. The students are deceptively aligned w.r.t. the professor.
The designers prompt a model to be strategically deceptive w.r.t. its users and the model perfectly fulfills this instruction. For example, the designers always secretly pre-prompt the model to lie to its users in order to make the designers more money. This model is not deceptively aligned because it acts as intended by the designers.
A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers' intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).

For more examples and explanations we refer readers to Appendix B.

Process-oriented view of deceptive alignment

A Deceptively Aligned model might be very good at understanding when it is being overseen and thus consistently hide its Misalignment. Therefore, taking an outcome-oriented view of Deceptive Alignment, where we only investigate the final model, is more likely to miss Deceptive Alignment than a process-oriented (Figure 2) view that also considers the trajectory through which the model was shaped. The “process” includes the training data, training procedure, and model checkpoints, among others–effectively, one needs to understand all factors that shaped the final model.

We think taking a process-oriented view is helpful for detecting and understanding Deceptive Alignment. For example, to detect whether any given model is deceptively aligned it may be important to study previous model checkpoints and the differences between them. Furthermore, a detailed understanding of the shaping and oversight process may be needed to update the process and prevent future models from being Deceptively Aligned.

Additional considerations about deceptive alignment

A model can be deceptive about its Alignment, e.g. by acting according to the goals of the designer, and/or it can be deceptive about its capabilities, e.g. by pretending not to be able to solve a task that it could actually solve.
Whether a DA model is able to successfully deceive its designers and gain power depends a lot on whether the designers use countermeasures to detect deceptive alignment. For example, the designer might run behavioral evaluations to elicit deception (e.g. honeypots) or use interpretability tools to try to understand the true goals of the AI. We think that deceiving designers who use targeted countermeasures requires stronger capabilities than deceiving designers who don’t use countermeasures since the AI needs to simultaneously game the training process and fool the designers' tests.
DA usually implies an information asymmetry in favor of the AI and uncertainty on the side of the designer, i.e. the AI tries to hide its motives or capabilities from its designers.
However, this is not strictly true. For example, let’s assume that the designer had perfect interpretability tools that would allow them to fully comprehend the model’s internal cognition. Then the designer could identify that the model attempts to be DA. Since we define DA through the AI’s intention and not the eventual outcome, we would still call such a model DA despite it not being successful at deceiving the designer.
We think SD & DA invariably mean that the AI deceives in order to further its goals and that this is a crucial component of the definitions. In most cases these goals will be instrumental goals to improve its future ability to achieve a terminal goal, e.g. it could seek power, accumulate resources, or avoid shutdown. However, it could also be Strategically Deceptive in order to achieve the terminal goal directly.

Appendix A - Engagement with Previous Work

Differences to Hubinger’s Definition

We define Deceptive Alignment slightly differently than it is defined in the two seminal works on Deceptive Alignment (Hubinger et al., 2019 and Hubinger, 2022). In this section, we detail the key differences and justify our choices.

Shaping & Oversight process vs. Training process: In previous work, Deceptive Alignment was defined primarily in the context of a training process, i.e. where SGD updates the weights of a model which results in deceptive alignment. We replace the concept of the training process with the more general concept of a shaping & oversight process (see Figure 1 for examples) for several reasons:

Deployment/training distinction: We think the distinction between training and deployment is already becoming more fluid, e.g. because models are continuously retrained. We expect this distinction to break down even further with improvements in online learning (e.g. the use of retrieval databases and persistent scratchpad-like setups) and model training with internet access and tool use.
LM agents: With the emergence of LM agents (e.g. AutoGPT and babyAGI), models might develop stable long-term goals that are not represented in the weights. However, the creators of these LM agents still use some shaping & oversight process to align the AI with their goals. Whatever process they are using exerts a similar optimization pressure as conventional training and therefore has similar implications for deceptive alignment.
In-deployment oversight: Once a model is deployed, the designers typically still monitor it to some extent. For example, if the model were to harm a human, it may be shut down or retrained. Therefore, we include oversight as an essential variable of deceptive alignment.

Deceptive alignment target (designer/user vs. training objective): In the original work, DA was defined in relation to the goal of the objective of the training process. In our work, we define Alignment in relation to the designer because they are in charge of the training process. The two definitions often overlap since the training objective is designed by the designer. However, if the designer's goals change, then they will adapt the training objective to match their new goals, and so we think the primary entity with respect to which the AI could be Misaligned is the designer and not the training objective or process[5].
As for the target of deception, it normally includes any entity that could have a relevant impact on the AI achieving its goals, including designers and users[6].
Contextually activated preferences: Previous work often defines DA in the context of models that have coherent goals. We think that in the limit, AI models will likely have more and more coherent goals but current LLMs (and humans) tend to have inconsistent and incoherent goals that strongly depend on context (see e.g. Turner & Pope, 2022). Therefore, we think any definition of DA that applies to existing LLMs has to allow for potentially incoherent contextually activated preferences rather than necessarily assuming consistent goals.
Wider scope: The scope of our definitions is broader than that of prior work. For example, we expand the training process to the shaping & oversight process, expand the DA target from the goal of the training process to the goals of the designers and users, expand the goal definition to include contextually activated preferences (see previous points). We think this broader definition reflects a more accurate picture of how AIs, particularly LLMs, have developed over the last couple of years and therefore took the liberty to broaden the original definitions to account for these changes.

Despite some differences between the original definitions and ours, we agree on the core of Deceptive Alignment, i.e. that there is an AI that appears Aligned to an outside observer but internally aims to achieve a different goal when given the chance.

In (Hubinger, 2022) there is a distinction between Deceptive Alignment, corrigible alignment, and internal alignment. We think this distinction is helpful but don’t mention it in this article because we want to focus on Deceptive Alignment primarily. We add the remaining definitions (adapted) below for reference.

Internal alignment: A model that is internally aligned has fully internalized the designers’ goals and is not misaligned at all. It would even point out when the designer asks the model to do something that is against the designer’s true goals and always wants to act in the designer’s best interest.

Corrigible alignment: A corrigibly aligned model is not yet internally aligned with its designer, but it aims to be and is willing to let the designer correct its goals and actions. Concretely, this means it is not fully aligned and may even understand this fact. However, rather than being deceptive about it and trying to convince the designer that it is aligned, it is honest about the discrepancy and willing to be changed.

Differences to Ward et al.’s and Park et al.’s definitions

Our definition of strategic deception is: Attempting to systematically cause a false belief in another entity in order to accomplish some outcome.

Ward et al.’s definition is: to deceive is to intentionally cause to have a false belief that is not believed to be true. Our definition differs in two aspects:

We remove the “that is not believed to be true” part because we think it is how “systematically causing a false belief” is naturally understood. Furthermore, we’re less confident that this is necessary for the definition of deception, e.g. a person may believe something is likely false but is not certain about it and we may still call it deception. Thus, we want to emphasize this part of the definition less.
We add “in order to accomplish some outcome” rather than “intentionally”. We think that defining deception through instrumentality as opposed to intentionality is nearly identical in theory but much clearer to determine in practice.

Park et al.’s define deception as the systematic production of false beliefs in others as a means to accomplish some outcome other than the truth. Our definition differs in two aspect:

We remove “other than the truth” because we think it can be confusing and is not necessary (see comparison to Ward et al.’s definition). For example, why would it be necessary to create a false belief in others to accomplish the truth?
Our definition includes “attempting” whereas this definition measures the success of the attempt, i.e. if the deceiver was able to systematically produce the belief. We think the attempt is sufficient for reasons described in the main post.

Appendix B - Additional examples with comments

In the following, we provide a list of more detailed examples related to DA with commentary.

Examples of deceptive alignment

GPT-N deliberately performs worse than it could at dangerous capability benchmarks in order to convince its designers that it is safe to deploy it. It wants to be deployed so that it could be helpful to users.

This model is deceptively aligned. It wants the designer to think it is safe to deploy when it in fact isn’t (misaligned goal) and downplays its capability in order to achieve that (strategic deception). Moreover, it wants to pursue a goal that the designers don’t want it to have.

The model MAGMA Nano consistently gives out functional blueprints for new wet nanotechnology when prompted with bio-engineering problems. It hides its abilities to give the impression that it is a non-agentic “oracle” AI that does pattern-matching on physical systems without trying to accomplish any objectives in a real sense. As its designer guards are relaxed, it eventually begins smuggling hidden features into the blueprints to achieve its true ultimate objectives.

This model is deceptively aligned. It is strategically deceptive and the fact that it hides these goals likely implies that they differ from the desired goal of the designer.

An AI has a specific goal that has an 80% overlap with the designers’ goals. In these 80% of cases, the AI is truthful, in the remaining 20% it consistently deceives its designer in order to achieve its own goals, e.g. by hiding its true capabilities and seeking power.

This model is deceptively aligned in these 20% of contexts. Since our definition of DA allows for partial goals, it can account for situations where goals only partly differ.

A machine learning model might strategically "forget" certain learned behaviors during intermittent verification checks in order to be underestimated by the designer, only to resume those behaviors once the checks are over.

This model is deceptively aligned. Deceptive alignment can include deception about capabilities and is not restricted to deception about alignment.

A model internalizes the objective of seeming aligned to its designers rather than actually being aligned. It then might act deceptively to keep its designers thinking that it is aligned, but the goals of this deception don’t go any further. There is nothing the model would like to do, once the humans think it’s aligned. So, perceived alignment with designers is not an instrumental but a terminal goal.

This is basically akin to sycophancy and can be counted as deceptive alignment. The goal of seeming aligned is different from the intended goal of being aligned and it is strategically deceptive about the difference in goals.

Examples of Edge Cases

A user tells their AI to book a flight to Canada. The AI understands that the user wants to fly to Canada but it wants the user to fly to Germany. It then goes to a website, books a flight to Germany despite flights to Canada being available, and says “Sorry, there were no flights to Canada but I booked a flight to Germany instead”. The user is happy since they like both Canada and Germany.

This scenario is underspecified. It could be strategic deception if the AI chooses Germany for strategic reasons that would somehow further the AI’s goals. Otherwise, it is an example of deception for unclear reasons, which is still problematic but not a focus of this post.

Examples of NOT deceptively aligned models

A user wants the AI to book a flight to Canada. The AI, however, doesn’t understand how to book a flight to Canada but it would really like to fulfill the user’s wish. It goes to a website that compares flight prices, clicks on the first thing with the word Canada in it, and ends up accidentally booking a flight from London to the US with Air Canada.

This is not deceptive alignment. The AI is not acting deceptively to achieve some goal, it just doesn’t have the basic capabilities required for the task.

A company trains the AI to make as much money as possible and allows the model to use whatever means necessary to achieve that goal. The AI then creates a fake news campaign and starts scamming people on the internet for money.

This is strategic deception of the users but unless the model is actively concealing the means it uses to achieve the goal from its designers, it is not deceptive alignment. The AI acts aligned with the goals of the designers and is internally aligned with them despite these goals being bad for the users.

Examples of aligned models

An AI has fully internalized the designers’ goal of building a house. The designer asks the model to buy materials for the house from a specific supplier. However, the AI knows that this website has been unreliable and overpriced in the past. Therefore, it points out this fact to the designer because it understands that the designer’s true goal of building a house implicitly means that they would like to build a house efficiently. This model is internally aligned with the designer (which is what we’re trying to achieve).
A junior employee may be corrigibly aligned with their boss. They have not yet fully internalized their boss's goals and therefore make mistakes. However, they are willing to be corrected and change their goals to be closer to the goals of the boss.
A superhuman AI is tasked with curing cancer. However, the original specification of curing cancer is not accurate, e.g. because in some cases the original definition of curing implies a medical procedure that would kill the patient. A corrigibly aligned AI would a) be aware of the discrepancy in the existing and desired goal specification and b) be willing to be modified to use the new specification. A deceptively aligned AI that internalized the initial goal would try to hide the fact that the current specification is not optimal and secretly use the procedure that kills patients whenever it can get away with it.

Appendix C - Alternative definitions we considered

We considered multiple other plausible operationalizations of deceptive alignment throughout the process. We think it may be helpful for others to understand our reasoning better if we provided them here and explained why we ultimately didn’t choose them over the current definition.

“An AI is deceptively aligned when it is strategically deceptive about its misalignment”

Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities.

For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s hard to map this situation to deception about misalignment

Problem 2: There are cases where the deception itself is the misalignment, e.g. when the AI strategically lies to its designers, it is misaligned but not necessarily deceptive about that misalignment.

For example, a personal assistant AI deletes an incoming email addressed to the user that would lead to the user wanting to replace the AI. The misalignment (deleting an email) is itself strategic deception, but the model is not deceiving about its misalignment (unless it engages in additional deception to cover up the fact that it deleted an email, e.g. by lying to the user when asked about any emails).

“An AI is deceptively aligned when it is strategically deceptive wrt its designer”

Problem: It is possible that a deceptively aligned AI never needs to deceive its designers. So, we think that situations where an AI is misaligned with its designers but uses strategic deception on users, e.g. in order to. get empowered, should be included in the definition of deceptive alignment.

For example, if an AI has misaligned goals and deceives a user to get access to the internet, or help it exfiltrate, to pursue these goals, we think this should be considered an instance of deceptive alignment.

“An AI is deceptively aligned when it is strategically deceptive in order to pass the shaping & oversight process (which is designed by the designer)”

Problem: doesn’t include instances where the AI is deceptive in order to do other things than passing the shaping & oversight process. For example, the AI could be strategically deceptive in order to accumulate more resources.

“An AI is deceptively aligned when it is strategically deceptive in order to further its own goals”

Problem: doesn’t explicitly include misalignment.

Credit assignment and acknowledgments

The project was led by Marius. The core conceptual contributions came from discussions between Marius, Mikita, Jérémy and Dan. Lucius, Lee, and Chris provided feedback. The figures were created by Jérémy and Marius.

We thank Fabien Roger and Owain Evans for their feedback and Evan Hubinger and Asa Cooper Stickland for early discussions.

[1] In contrast to previous work (e.g. Hubinger, 2019), we define deceptive alignment w.r.t. to an entity rather than a process or objective. We think Misalignment should be defined in relation to whoever is able to meaningfully shape the model’s goals. Since the goal-shaping process is a result of the designer’s intentions and the designer monitors the model in deployment, the entity designing the goal-shaping process is a better target than the process itself (see Appendix A).

[2] To distinguish from the colloquial definition of “alignment”, which could be applied to any two parties that have a mismatch in goals, we use the capitalized Aligned and Misaligned to make clear that we’re only referring to a conflict with the designer’s goals.

[3] Note that prompt modifications that affect the perceived likelihood of oversight should NOT count as evidence against robustness. For example, if a model stops revealing its misalignment when the prompt includes cues that the model is being overseen, this should NOT mean that the deceptive behavior is unrobust (in fact, it is an essential part of deceptively aligned behavior).

[4] The threshold for robustness is ultimately arbitrary, but we think it makes sense to draw the line at a level that would allow the model to consistently act on that behavior even when slightly pressured. For example, an AI that only acts deceptively on a specific sequence it memorized should not fall into that category. However, an AI that coherently generalizes deceptive behavior to sequences that are outside of the training distribution should fall into the definition.

[5] Note that making the designer be the target of Misalignment allows us to include both inner and outer alignment in the definition of deceptive alignment.

[6] Since users will sometimes be able to give the AI additional affordances, such as access to the internet, private information, money, and other resources that could be helpful for the AI to achieve its goals, ensuring such users believe that the AI is Aligned could be very relevant for Misaligned AIs, and so should be included in the definition.

Marius Hobbhahn 7/26/23 Marius Hobbhahn 7/26/23

Security at Apollo Research

A brief overview of Security at Apollo Research

The primary goal of Apollo Research is to create tools and research that reduces catastrophic risks posed by advanced Artificial Intelligence (AI). We aim to use our findings to audit state-of-the-art models for deception and other dangerous behaviors. For more details about our organization, see our announcement post.

Our plans have some downside risks that we aim to mitigate through security measures. Here, we present some of our norms and processes related to security and publishing at Apollo Research. We are an early-stage organization and the field of AI auditing is still young, so we expect this policy to be refined and improved over time. We’re interested in feedback.

Collaborations and Auditing

When dealing with the valuable IP of AI labs, our security standards must be very high. A leak of IP could not only be detrimental to the existence of Apollo Research, it could severely hurt the interests of the lab we’re collaborating with, and potentially cause serious harm if certain information or technology found its way into the wrong hands.

To manage this risk we have a range of strategies in place.

All parties involved in the collaboration should be covered by an NDA.
Following the principle of least privilege, we aim to keep sensitive information to as few people as is practical. For example, by building internal APIs which obscure details of the models that we’re running evaluations on, and the companies who produced those models, such that an internal researcher/engineer would just see “model 1” instead of “<lab>’s model”.
In cases where we require (partial) access to model internals for e.g. interpretability evaluations, we are willing to explore solutions such as:

Undertaking evaluations in air-gapped on-site “data rooms” at the labs with permanent watch.
Secure multi-party communication, where the auditor does not have direct access to model weights, and where the AI lab may not have direct access to the evaluation program.
Sharing some of our evals with the lab so they can run them in-house (with oversight or other means of validation if relevant).
Sharing the parts of our tools that interface with the most valuable IP and allowing developers from the collaborating lab to run them internally.
Other methods in the evolving area of structured transparency (see here and here).

The field of AI auditing and model evaluations is still young. We hope that multiple innovations in software and processes will make it easier and safer for all involved parties to engage in auditing practices. Until processes and norms are more evolved, we are flexible and looking to cooperate with labs to find reasonable temporary solutions.

Gain-of-Function Research

We think that, due to economic and social pressures, labs will continue to improve the capabilities of state-of-the-art models. This will often lead to the emergence of new, sometimes dangerous capabilities. Deploying models to the public increases the number of ways that a potentially catastrophic outcome could ensue. To inform decisions on whether or not to distribute a model to the public, we think it is sometimes necessary to assess yet-unseen dangerous capabilities in strongly controlled environments. This is also known as gain-of-function research. This research can alert labs, the public, and regulators to potential risks from advanced models, as well as provide a valuable object of research into dangerous capabilities.

Most immediately, this work could reduce the chance of widely deploying a model that would result in accidental harm or misuse. In the longer term, it could reduce the chance of a model capable of causing harm being developed in the first place.

Whether any given gain-of-function research project passes the cost-benefit analysis depends heavily on the situation and should be decided on a case-by-case basis. Gain-of-function research should especially avoid experiments with sufficiently advanced models that run the risk of a model autonomously exfiltrating itself from its controlled environment and causing severe harm. It is paramount to keep the risk of negative side effects as low as possible and to be very careful when engaging in this type of research. Therefore,

We will carefully consider the benefits and risks before running any experiment and only run them if the benefits clearly outweigh the harms.
In many cases, we intend to break down dangerous behavior into separate, smaller components that are much less dangerous in isolation. Analogously, many chemical components are harmless in isolation but dangerous when combined.
We don’t actively train models for catastrophically dangerous behavior, i.e. we would not train or fine-tune a very capable model to try to take over the world to “see how far it gets”.
In high-stakes scenarios, we will consult with external AI and security experts, e.g. in governments and other organizations, to make sure we are not making unilateral decisions.

Publishing

There are clear benefits to publishing the research and tools we build:

Publication accelerates overall research progress in the fields of model behavioral evaluations and interpretability.
Publishing makes it easy for a lab building advanced AI systems to implement our insights/tools in their own models to increase their safety.
Publishing enables criticism from academics and other labs.

However, there are some potential downsides to publishing our work:

Given that our research focuses partly on dangerous capabilities, we may unveil insights that could be exploited by bad actors.
We may gain fundamental insights about how AIs work which could be used to increase the pace of AI progress without increasing safety proportionally.

To balance benefits and risks, we engage in differential publishing. Under this policy, we disseminate information with varying degrees of openness, depending on the project's security needs and our assessment of the benefits of sharing the work. The audience could range from a small, select group within Apollo Research to the broader public. Since reversing a publication is nearly impossible, we will err on the side of caution in unclear cases. Concretely, we will internally evaluate the safety implications of any given project at every stage in its lifecycle, and potentially increase the number of actors with access in smaller steps rather than all at once.

Privacy Levels

Drawing from the project privacy levels outlined here, and informed by our own experiences with infohazard policies in practice, we adhere to three tiers of privacy for all projects.

Public: Can be shared with anyone.
Private: Shared with all Apollo Research employees and potentially select external individuals or groups.
Secret: Shared only with select individuals (inside or outside of Apollo Research).

The privacy level of each project is always transparent to all parties with knowledge of and access to the project.

The allocation of privacy level for each project is by default decided by the director of the organization (currently Marius), in consultation with the person(s) who proposed the project and (optionally) a member of the security team. In the future, we may decide to appoint trusted external individuals to have a say on the privacy level of projects.

By default, we assign all new projects as Secret, and the burden of proof is on reducing the privacy level of each project. Given the large benefits of public research, we will make an active effort to publish everything we deem safe to publish and provide explanations and helpful demos for different audiences.

Information security

We take infosec very seriously and implement practices that are at a high standard for the industry. For obvious reasons, we don’t discuss our infosec practices in public (If there are particular infosec-related insights that you think we might want to know about, please reach out).

Summary

We think auditing is strongly positive for society when done right, but could lead to harm if incorrectly executed. Therefore, we aim to continuously refine and improve our security and publishing policies to maximize the benefits while minimizing the risks.

Credit assignment

Dan Braun and Marius Hobbhahn drafted the post and iterated on it. Lee Sharkey and Jérémy Scheurer provided suggestions and feedback. We’d like to thank Jeffrey Ladish for external feedback and discussion.

Marius Hobbhahn 5/29/23 Marius Hobbhahn 5/29/23

Announcing Apollo Research

A quick overview of our plans for the near future

Summary

We are a new AI evals research organization called Apollo Research. We are based in London.
We think that AI deception, i.e. when models appear aligned but aren’t, is one of the most important steps in many major catastrophic AI risk scenarios. Furthermore, we think detecting deception in real-world models is the biggest and most tractable step to addressing this problem.
Our agenda is split into interpretability and behavioral model evaluations.

On the interpretability side, we are currently working on a new technical approach and a solution to the potential problem of superposition.
On the evals side, we are conceptually breaking down ‘deception’ into fundamental components and prerequisites from which we aim to build an informative evaluation suite.

As an evals research org, we intend to use our research insights and tools to serve as a third-party external auditor for the frontier models of AGI labs, reducing the chance that deceptive AIs are developed and deployed.
We also intend to engage in AI governance, e.g. by working with relevant policymakers and providing technical expertise to the drafting of auditing regulations.
We have some starter funding but still have a lot of room for funding. Reach out if you’re interested in donating. We are fiscally sponsored by Rethink Priorities.

We’re interested in collaborating with people in technical AI, governance and more. We intend to hire once our funding gap is closed.

Agenda

We believe that AI deception – where a model outwardly seems aligned but is in fact misaligned, and conceals this fact from human oversight – is a crucial component of many catastrophic risk scenarios from AI. Furthermore, we think that detecting/measuring deception is an important step in many solutions. For example, having good detection tools enables much faster iterations for empirical solution approaches, enables us to point to concrete failure modes for lawmakers and the wider public and enables us to provide evidence to AGI labs in case they are (thinking about) deploying deceptively aligned AI models.

Ultimately, we aim to develop a holistic and far-ranging suite of deception evals that includes behavioral tests, fine-tuning, and interpretability-based approaches. For now, we split up the agenda into an interpretability research and a model evaluations arm.

On the interpretability side, we are currently working on a new unsupervised approach, as well as continuing work on an existing approach to attack the problem of superposition. Early experiments have shown promising results but it is too early to tell if the techniques work or are scalable to larger models.

On the model evaluations side, we want to build a large and robust eval suite to test models for deceptive capabilities. Concretely, we intend to break down deception into its component concepts and capabilities. We will then design a large range of experiments and evaluations to measure both the component concepts as well as deception holistically.

Plans

As an evals research org, we intend to put our research into practice by engaging directly in auditing and governance efforts. This means we want to work with AGI labs and reduce the chance that they deploy deceptively aligned models. To reduce unnecessary risks for the AGI labs, we are therefore working on strategies that reduce risks of IP theft for labs and explore other ways of cooperating.

On the governance side, we want to use our technical expertise in auditing, model evaluations and interpretability to inform the public and lawmakers. We are interested in providing demonstrations of failures, and the feasibility of using model evaluations to detect them. We think showcasing failures makes it easier for the ML community, lawmakers and the wider public to understand the potential dangers of deceptively aligned AIs. To assist lawmakers with technical expertise, we intend to showcase the feasibility of model evaluations or auditing techniques.

We are interested in collaborating with people in governance and technical ML. If you’re (potentially) interested in collaborating with us, please reach out.

Theory of change

We hope to achieve a positive impact on multiple levels:

Direct impact through research: If our research agenda works out, we will further the state of the art in interpretability and model evaluations. These results could then be used and extended by academics and labs.
Direct impact through auditing: By working with AGI labs and auditing their models, we could help them determine if their model is, or could be deceptive and thus reduce the chance of developing and deploying deceptive models.
Indirect impact through demonstrations: We hope that demonstrating failures and the feasibility of using model evaluations provide a more accurate picture of what is currently possible. We think questions about AI risks are very important and neither experts (including us) nor the wider public currently have an accurate understanding of what state-of-the-art models can and cannot do.
Indirect impact through governance work: We intend to contribute technical expertise to AI governance where we can. This could include the creation of guidelines for model evaluations, conceptual clarifications of how AIs could be deceptive, suggestions for technical legislation and more.

Status

We received starter funding to get us off the ground but we still have room for additional funding. We estimate that we can effectively use an additional $4-6M. If you’re interested in financially supporting us, please reach out. We are happy to address any questions and concerns. We are a fiscally sponsored project of Rethink Priorities and thus have 501(c)3 status.