Homepage
Open in app
Sign in
Get started
AI Alignment
Follow
Following
Iterated Distillation and Amplification
Guest post summarizing my approach to aligned RL.
Ajeya Cotra
Mar 4, 2018
An unaligned benchmark
What an unaligned AI might look like, how it could go wrong, and how we could fix it.
Paul Christiano
Mar 20, 2018
Prosaic AI alignment
I argue that AI alignment should focus on the possibility that we build AGI without learning anything fundamentally new about intelligence.
Paul Christiano
Nov 18, 2016
Latest
“Unsupervised” translation as an (intent) alignment problem
Unsupervised translation is an interesting domain where models seem to “know” something we can’t get them to tell us.
Paul Christiano
Sep 29, 2020
Better priors as a safety problem
Many universal priors are inefficient in the finite data regime. I argue that’s a safety problem and we should try to fix it directly.
Paul Christiano
Jul 5, 2020
Learning the prior
I suggest using neural nets to approximate our real prior, rather than implicitly using neural nets themselves as the prior.
Paul Christiano
Jul 5, 2020
Inaccessible information
Inaccessible information
What kind of information might be hard to elicit from ML models?
Paul Christiano
Jun 2, 2020
AI alignment landscape
AI alignment landscape
A talk I gave at EA global 2019, describing how my work fits into the broader project of making AI go well.
Paul Christiano
Oct 12, 2019
The strategy-stealing assumption
If humans initially control 99% of the world’s resources, when can they secure 99% of the long-term influence?
Paul Christiano
Sep 15, 2019
Training robust corrigibility
Reviewing the prospects for training models to behave acceptably on all inputs, rather than just the training distribution.
Paul Christiano
Jan 20, 2019
Universality and model-based RL
Ascription universality may be very helpful for safe model-based RL, facilitating benign induction and “transparent” models.
Paul Christiano
Jan 9, 2019
Universality and consequentialism within HCH
One exotic reason HCH can fail to be universal is the emergence of malicious patterns of behavior; universality may help address this risk.
Paul Christiano
Jan 9, 2019
Informed oversight
An overseer can provide adequate rewards for an agent if they know everything the agent knows. (Update of a 2016 post.)
Paul Christiano
Jan 9, 2019
Towards formalizing universality
An attempt to formalize universality as “able to understand anything that any computation can understand.”
Paul Christiano
Jan 9, 2019
When is unaligned AI morally valuable?
It might be easier to build an AI that deserves our sympathy than to build an AI that is aligned with us. Is that a plausible plan B?
Paul Christiano
May 2, 2018
Implicit extortion
Extortion can be equally effective, and harder to notice, when you don’t tell the target it’s occurring.
Paul Christiano
Apr 13, 2018
Two guarantees
I suspect AI alignment should aim to separately establish good performance in the average case, and lack-of-malice in the worst case.
Paul Christiano
Apr 9, 2018
Clarifying “AI alignment”
Clarifying what I mean when I say that an AI is aligned.
Paul Christiano
Apr 7, 2018
Universality and security amplification
A slightly more detailed view of security amplification.
Paul Christiano
Mar 10, 2018
Techniques for optimizing worst-case performance
Optimizing neural networks for worst-case performance looks really hard. Here’s why I have hope.
Paul Christiano
Feb 1, 2018
AlphaGo Zero and capability amplification
AlphaGo Zero happens to be a great proof-of-concept of iterated capability amplification (my preferred approach to safe RL).
Paul Christiano
Oct 19, 2017
Approval-maximizing representations
Approval-maximizing representations
If we train our agents with human oversight, can they learn superhuman representations?
Paul Christiano
Jun 17, 2017
Corrigibility
Corrigible AI seems nearly as good as aligned AI, but significantly more robust.
Paul Christiano
Jun 10, 2017
Benign model-free RL
Benign model-free RL
Reward learning, robustness, and amplification may be sufficient to train benign model-free RL agents.
Paul Christiano
Mar 19, 2017
Directions and desiderata for AI alignment
I lay out three research directions in AI alignment, and three desiderata that I think should guide research in these areas.
Paul Christiano
Feb 6, 2017
Benign AI
Something is benign if it isn’t optimized to be bad. “Benign” is weaker than “aligned,” but I find it helpful for thinking about AI…
Paul Christiano
Nov 29, 2016
Hard-core subproblems
I think that discussions of AI control should aim to identify subproblems that we aren’t making progress on but are necessary.
Paul Christiano
Nov 26, 2016
About AI Alignment
Latest Stories
Archive
About Medium
Terms
Privacy