AF - Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural by Rubi Hudson

No se pudo agregar al carrito

Solo puedes tener X títulos en el carrito para realizar el pago.

Add to Cart failed.

Por favor prueba de nuevo más tarde

Error al Agregar a Lista de Deseos.

Por favor prueba de nuevo más tarde

Error al eliminar de la lista de deseos.

Por favor prueba de nuevo más tarde

Error al añadir a tu biblioteca

Por favor intenta de nuevo

Error al seguir el podcast

Intenta nuevamente

Error al dejar de seguir el podcast

Intenta nuevamente

AF - Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural by Rubi Hudson

Escúchala gratis

Ver detalles del espectáculo

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural, published by Rubi Hudson on July 16, 2024 on The AI Alignment Forum.
Max Harms recently published
an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A
large installment in the series is dedicated to cataloging the properties that make up such a goal, with
open questions including whether the list is exhaustive and how to trade off between the items that make it up.
I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that.
In a recent blog post comparing
corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can't be straightforwardly captured in a ranking of end states.
Why does this definition of corrigibility matter? It's because properties that are not anti-natural can be explicitly included in the desired utility function.
Following that note, this post is not intended as a response to Max's work, but rather to MIRI and their 2015 paper
Corrigibility. Where Max thinks the approach introduced by that paper is
too narrow, I don't find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model's end goals.
Corrigiblity (2015)
The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.
1. The agent shuts down when properly requested
2. The agent does not try to prevent itself from being shut down
3. The agent does not try to cause itself to be shut down
4. The agent does not create new incorrigible agents
5. Subject to the above constraints, the agent optimizes for some goal
MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.
My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they're distinct attributes and can be addressed separately.
To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.
Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness - in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility).
Weakening this desideratum to avoid incentivi...

Todavía no hay opiniones