Homepage
Sign in
Get started
AI Alignment
Follow
Following
Iterated Distillation and Amplification
Guest post summarizing my approach to aligned RL.
Ajeya Cotra
Mar 4
An unaligned benchmark
What an unaligned AI might look like, how it could go wrong, and how we could fix it.
Paul Christiano
Mar 20
Prosaic AI alignment
I argue that AI alignment should focus on the possibility that we build AGI without learning anything fundamentally new about intelligence.
Paul Christiano
Nov 18, 2016
Directions and desiderata for AI alignment
I lay out three research directions in AI alignment, and three desiderata that I think should guide research in these areas.
Paul Christiano
Feb 6, 2017
Human-in-the-counterfactual-loop
For powerful AI systems, human oversight may be cheaper than it appears.
Paul Christiano
Jan 20, 2015
AlphaGo Zero and capability amplification
AlphaGo Zero happens to be a great proof-of-concept of iterated capability amplification (my preferred approach to safe RL).
Paul Christiano
Oct 19, 2017
Latest
When is unaligned AI morally valuable?
It might be easier to build an AI that deserves our sympathy than to build an AI that is aligned with us. Is that a plausible plan B?
Paul Christiano
May 2
Implicit extortion
Extortion can be equally effective, and harder to notice, when you don’t tell the target it’s occurring.
Paul Christiano
Apr 13
Two guarantees
I suspect AI alignment should aim to separately establish good performance in the average case, and lack-of-malice in the worst case.
Paul Christiano
Apr 9
Clarifying “AI alignment”
Clarifying what I mean when I say that an AI is aligned.
Paul Christiano
Apr 7
Universality and security amplification
A slightly more detailed view of security amplification.
Paul Christiano
Mar 10
Techniques for optimizing worst-case performance
Optimizing neural networks for worst-case performance looks really hard. Here’s why I have hope.
Paul Christiano
Feb 1
Approval-maximizing representations
Approval-maximizing representations
If we train our agents with human oversight, can they learn superhuman representations?
Paul Christiano
Jun 17, 2017
Corrigibility
Corrigible AI seems nearly as good as aligned AI, but significantly more robust.
Paul Christiano
Jun 10, 2017
Benign model-free RL
Benign model-free RL
Reward learning, robustness, and amplification may be sufficient to train benign model-free RL agents.
Paul Christiano
Mar 19, 2017
Benign AI
Something is benign if it isn’t optimized to be bad. “Benign” is weaker than “aligned,” but I find it helpful for thinking about AI…
Paul Christiano
Nov 29, 2016
Hard-core subproblems
I think that discussions of AI control should aim to identify subproblems that we aren’t making progress on but are necessary.
Paul Christiano
Nov 26, 2016
AI “safety” vs “control” vs “alignment”
Defining what I mean by “AI safety,” “AI control,” and “value alignment.”
Paul Christiano
Nov 18, 2016
Handling destructive technology
Solving AI control is just “delaying the inevitable” with respect to the need for global coordination, but it seems high-impact anyway.
Paul Christiano
Nov 14, 2016
Thoughts on reward engineering
Addressing a bunch of details that come up when we try to convert our preferences into a reward function for RL.
Paul Christiano
Nov 8, 2016
Security amplification
Can we use agents with many vulnerabilities to implement an agent with fewer vulnerabilities?
Paul Christiano
Oct 26, 2016
Meta-execution
Meta-execution
Building agents out of agents.
Paul Christiano
Oct 25, 2016
Of humans and universality thresholds
I’ve suggested that HCH might be a universal deliberative process if run with humans but not if run with apes. Is that suspicious?
Paul Christiano
Oct 23, 2016
Some thoughts on training highly reliable models
A grab bag of relevant considerations, mostly pointing out that the problem is even harder than it might at first appear.
Paul Christiano
Oct 22, 2016
Aligned search
Powerful searches are likely to pose a distinctive challenge for AI control.
Paul Christiano
Oct 21, 2016
Reliability amplification
Can redundancy increase the reliability of complex policies in the same way it can increase the reliability of computation?
Paul Christiano
Oct 20, 2016
ALBA on GitHub
A preliminary ALBA implementation is now on GitHub: https://github.com/paulfchristiano/alba
Paul Christiano
Oct 19, 2016
Not just learning
I’ve been focusing on aligned learning, but AI is more than just learning.
Paul Christiano
Oct 16, 2016
Imitation+RL
Imitation+RL might be a more natural model for powerful AI than either imitation or RL.
Paul Christiano
Oct 15, 2016
Security and AI alignment
AI alignment and AI security are probably more closely connected than I used to think.
Paul Christiano
Oct 14, 2016
About AI Alignment
Latest Stories
Archive
About Medium
Terms
Privacy