Homepage
Open in app
Sign in
Get started
AI Alignment
Follow
Following
My research methodology
I explain why I focus on the “worst” case when doing theoretical alignment research.
Paul Christiano
Mar 22, 2021
An unaligned benchmark
What an unaligned AI might look like, how it could go wrong, and how we could fix it.
Paul Christiano
Mar 20, 2018
Eliciting latent knowledge
How can we train an AI to honestly tell us when our eyes deceive us?
Paul Christiano
Feb 25, 2022
Latest
My views on “doom”
I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.
Paul Christiano
Apr 27, 2023
Can we efficiently distinguish different mechanisms?
Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?
Paul Christiano
Dec 26, 2022
Can we efficiently explain model behaviors?
It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.
Paul Christiano
Dec 16, 2022
AI alignment is distinct from its near-term applications
Not everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.
Paul Christiano
Dec 12, 2022
Finding gliders in the game of life
Finding gliders in the game of life
Walking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.
Paul Christiano
Dec 1, 2022
Mechanistic anomaly detection and ELK
Mechanistic anomaly detection and ELK
An approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.
Paul Christiano
Nov 25, 2022
Answering questions honestly given world-model mismatches
I expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?
Paul Christiano
Jun 13, 2021
A naive alignment strategy and optimism about generalization
I describe a very naive strategy for training a model to “tell us what it knows.”
Paul Christiano
Jun 9, 2021
Teaching ML to answer questions honestly instead of predicting human answers
Teaching ML to answer questions honestly instead of predicting human answers
I discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.
Paul Christiano
May 28, 2021
Decoupling deliberation from competition
A possible framing of intent alignment.
Paul Christiano
May 25, 2021
Mundane solutions to exotic problems
I often think about exotic problems like gradient hacking or ultra-long-term plans. Why do I hope to solve them with mundane approaches?
Paul Christiano
May 4, 2021
Low-stakes alignment
Why I often focus my alignment research on the special case where individual decisions are low stakes.
Paul Christiano
Apr 29, 2021
Announcing the Alignment Research Center
I’m now working full-time on the Alignment Research Center (ARC), a new non-profit focused on intent alignment research.
Paul Christiano
Apr 26, 2021
“Unsupervised” translation as an (intent) alignment problem
Unsupervised translation is an interesting domain where models seem to “know” something we can’t get them to tell us.
Paul Christiano
Sep 29, 2020
Better priors as a safety problem
Many universal priors are inefficient in the finite data regime. I argue that’s a safety problem and we should try to fix it directly.
Paul Christiano
Jul 5, 2020
Learning the prior
I suggest using neural nets to approximate our real prior, rather than implicitly using neural nets themselves as the prior.
Paul Christiano
Jul 5, 2020
Inaccessible information
Inaccessible information
What kind of information might be hard to elicit from ML models?
Paul Christiano
Jun 2, 2020
AI alignment landscape
AI alignment landscape
A talk I gave at EA global 2019, describing how my work fits into the broader project of making AI go well.
Paul Christiano
Oct 12, 2019
The strategy-stealing assumption
If humans initially control 99% of the world’s resources, when can they secure 99% of the long-term influence?
Paul Christiano
Sep 15, 2019
Training robust corrigibility
Reviewing the prospects for training models to behave acceptably on all inputs, rather than just the training distribution.
Paul Christiano
Jan 20, 2019
Universality and model-based RL
Ascription universality may be very helpful for safe model-based RL, facilitating benign induction and “transparent” models.
Paul Christiano
Jan 9, 2019
Universality and consequentialism within HCH
One exotic reason HCH can fail to be universal is the emergence of malicious patterns of behavior; universality may help address this risk.
Paul Christiano
Jan 9, 2019
Informed oversight
An overseer can provide adequate rewards for an agent if they know everything the agent knows. (Update of a 2016 post.)
Paul Christiano
Jan 9, 2019
Towards formalizing universality
An attempt to formalize universality as “able to understand anything that any computation can understand.”
Paul Christiano
Jan 9, 2019
About AI Alignment
Latest Stories
Archive
About Medium
Terms
Privacy
Teams