Twitter | Search | |
Eliezer Yudkowsky
A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse.
Reply Retweet Like More
Eliezer Yudkowsky Sep 4
Replying to @ESYudkowsky
Tl;dr (he said with deliberate irony) you can ask for results as good as the best 99th percentile of rated stuff in the training data (a la Jessica Taylor's quantilization idea). Ask for things the trained reward function rates as "better" than that, and it starts to find...
Reply Retweet Like
Eliezer Yudkowsky Sep 4
Replying to @ESYudkowsky
..."loopholes" as seen from outside the system; places where the trained reward function poorly matches your real preferences, instead of places where your real preferences would rate high reward. ("Goodhart's Curse", the combination of Optimizer's Curse plus Goodhart's Law.)
Reply Retweet Like
Eliezer Yudkowsky Sep 4
Replying to @ESYudkowsky
That is: they had to impose a (new) quantitative form of "conservatism" in my terminology, producing only results similar (low KL divergence) to things already seen, in order to get human-valued output. They didn't directly optimize for the learned reward function!
Reply Retweet Like
Eliezer Yudkowsky Sep 4
Replying to @ESYudkowsky
Reply Retweet Like
Eliezer Yudkowsky Sep 4
Replying to @ESYudkowsky
Why this doesn't solve the whole problem: with powerful AGI, you're not limited by how far you can optimize a learned reward function before the learned reward function stops well-predicting human feedback; you're limited by how hard the AI can optimize before human raters break.
Reply Retweet Like
Eliezer Yudkowsky Sep 4
Replying to @ESYudkowsky
Not to undersell the research:
Reply Retweet Like
Eliezer Yudkowsky Sep 4
Replying to @ESYudkowsky
To be explicit about precedents: this is not "learning a conservative concept" as I proposed that, nor "expected utility quantilization" as Jessica proposed that. OpenAI did a new thing, which you could see as simultaneously "mildly optimizing" and "conservative".
Reply Retweet Like
Marcin Bogdanski Sep 4
Replying to @ESYudkowsky
I initially ignored that paper, but after you thread will definitely take a close look. Thanks for pointing it out.
Reply Retweet Like
Dr Tedros diretor da OMS no bar do Boris Sep 5
Reply Retweet Like