Eliciting latent knowledge

Paul Christiano
AI Alignment
Published in
3 min readFeb 25, 2022

--

In this report, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment.

In the report we will describe ELK and suggest possible approaches to it, while using the discussion to illustrate ARC’s research methodology. More specifically, we will:

  • Set up a toy scenario in which a prediction model could show us a future that looks good but is actually bad, and explain why ELK could address this problem (more).
  • Describe a simple baseline training strategy for ELK, step through how we analyze this kind of strategy, and ultimately conclude that the baseline is insufficient (more).
  • Lay out ARC’s overall research methodology — playing a game between a “builder” who is trying to come up with a good training strategy and a “breaker” who is trying to construct a counterexample where the strategy works poorly (more).
  • Describe a sequence of strategies for constructing richer datasets and arguments that none of these modifications solve ELK, leading to the counterexample of ontology identification (more).
  • Identify ontology identification as a crucial sub-problem of ELK and discuss its relationship to the rest of ELK (more).
  • Describe a sequence of strategies for regularizing models to give honest answers, and arguments that these modifications are still insufficient (more).
  • Conclude with a discussion of why we are excited about trying to solve ELK in the worst case, including why it seems central to the larger alignment problem and why we’re optimistic about making progress (more).

Much of our current research focused on “ontology identification” as a challenge for ELK. In the last 10 years many researchers have called out similar problems as playing a central role in alignment; our main contributions are to provide a more precise discussion of the problem, possible approaches, and why it appears to be challenging. We discuss related work in more detail in Appendix: related work.

We believe that there are many promising and unexplored approaches to this problem, and there isn’t yet much reason to believe we are stuck or are faced with an insurmountable obstacle. Even some of the simplest approaches have not been thoroughly explored, and seem like they would play a role in a practical attempt at scalable alignment today.

Given that ELK appears to represent a core difficulty for alignment, we are very excited about research that tries to attack it head on; we’re optimistic that within a year we will have made significant progress either towards a solution or towards a clear sense of why the problem is hard. If you’re interested in working with us on ELK or similar problems, get in touch!

Here’s the link. There is some discussion at the alignment forum post about the report, and a prize we offered for solutions. This is a more thorough and polished discussion of the same ideas as Answering questions honestly given world model mismatches, A naive alignment strategy, and My research methodology.

Thanks to María Gutiérrez-Rojas for the illustrations in this piece. Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.

--

--