Sign in Get started

AI Alignment

My research methodology

I explain why I focus on the “worst” case when doing theoretical alignment research.

Paul Christiano

Mar 22, 2021

An unaligned benchmark

What an unaligned AI might look like, how it could go wrong, and how we could fix it.

Paul Christiano

Mar 20, 2018

Eliciting latent knowledge

How can we train an AI to honestly tell us when our eyes deceive us?

Paul Christiano

Feb 25, 2022

Latest

My views on “doom”

I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.

Paul Christiano

Apr 27, 2023

Can we efficiently distinguish different mechanisms?

Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?

Paul Christiano

Dec 26, 2022

Can we efficiently explain model behaviors?

It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.

Paul Christiano

Dec 16, 2022

AI alignment is distinct from its near-term applications

Not everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.

Paul Christiano

Dec 12, 2022

Finding gliders in the game of life

Finding gliders in the game of life

Walking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.

Paul Christiano

Dec 1, 2022

Mechanistic anomaly detection and ELK

Mechanistic anomaly detection and ELK

An approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.

Paul Christiano

Nov 25, 2022

Answering questions honestly given world-model mismatches

I expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?

Paul Christiano

Jun 13, 2021

A naive alignment strategy and optimism about generalization

I describe a very naive strategy for training a model to “tell us what it knows.”

Paul Christiano

Jun 9, 2021

Teaching ML to answer questions honestly instead of predicting human answers

Teaching ML to answer questions honestly instead of predicting human answers

I discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.

Paul Christiano

May 28, 2021

Decoupling deliberation from competition

A possible framing of intent alignment.

Paul Christiano

May 25, 2021

Mundane solutions to exotic problems

I often think about exotic problems like gradient hacking or ultra-long-term plans. Why do I hope to solve them with mundane approaches?

Paul Christiano

May 4, 2021

Low-stakes alignment

Why I often focus my alignment research on the special case where individual decisions are low stakes.

Paul Christiano

Apr 29, 2021

Announcing the Alignment Research Center

I’m now working full-time on the Alignment Research Center (ARC), a new non-profit focused on intent alignment research.

Paul Christiano

Apr 26, 2021

“Unsupervised” translation as an (intent) alignment problem

Unsupervised translation is an interesting domain where models seem to “know” something we can’t get them to tell us.

Paul Christiano

Sep 29, 2020

Better priors as a safety problem

Many universal priors are inefficient in the finite data regime. I argue that’s a safety problem and we should try to fix it directly.

Paul Christiano

Jul 5, 2020

Learning the prior

I suggest using neural nets to approximate our real prior, rather than implicitly using neural nets themselves as the prior.

Paul Christiano

Jul 5, 2020

Inaccessible information

Inaccessible information

What kind of information might be hard to elicit from ML models?

Paul Christiano

Jun 2, 2020

AI alignment landscape

AI alignment landscape

A talk I gave at EA global 2019, describing how my work fits into the broader project of making AI go well.

Paul Christiano

Oct 12, 2019

The strategy-stealing assumption

If humans initially control 99% of the world’s resources, when can they secure 99% of the long-term influence?

Paul Christiano

Sep 15, 2019

Training robust corrigibility

Reviewing the prospects for training models to behave acceptably on all inputs, rather than just the training distribution.

Paul Christiano

Jan 20, 2019

Universality and model-based RL

Ascription universality may be very helpful for safe model-based RL, facilitating benign induction and “transparent” models.

Paul Christiano

Jan 9, 2019

Universality and consequentialism within HCH

One exotic reason HCH can fail to be universal is the emergence of malicious patterns of behavior; universality may help address this risk.

Paul Christiano

Jan 9, 2019

Informed oversight

An overseer can provide adequate rewards for an agent if they know everything the agent knows. (Update of a 2016 post.)

Paul Christiano

Jan 9, 2019

Towards formalizing universality

An attempt to formalize universality as “able to understand anything that any computation can understand.”

Paul Christiano

Jan 9, 2019

About AI AlignmentLatest StoriesArchiveAbout MediumTermsPrivacyTeams