Identifying functionally important features with end-to-end sparse dictionary learning
Dan Braun Dan Braun

Identifying functionally important features with end-to-end sparse dictionary learning

We propose end-to-end (e2e) sparse dictionary learning, a method for training sparse autoencoders (SAEs) that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.

Read More
A Causal Framework for AI Regulation and Auditing
Lee Sharkey Lee Sharkey

A Causal Framework for AI Regulation and Auditing

This article outlines a framework for evaluating and auditing AI to provide assurance of responsible development and deployment, focusing on catastrophic risks. We argue that responsible AI development requires comprehensive auditing that is proportional to AI systems’ capabilities and available affordances. This framework offers recommendations toward that goal and may be useful in the design of AI auditing and governance regimes.

Read More
Our research on strategic deception presented at the UK’s AI Safety Summit
Marius Hobbhahn Marius Hobbhahn

Our research on strategic deception presented at the UK’s AI Safety Summit

We investigate whether, under different degrees of pressure, GPT-4 can take illegal actions like insider trading and then lie about its actions. We find this behavior occurs consistently, and the model even doubles down when explicitly asked about the insider trade. This demo shows how, in pursuit of being helpful to humans, AI might engage in strategies that we do not endorse. This is why we aim to develop evaluations that tell us when AI models become capable of deceiving their overseers.

Read More