Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:02
Today I'm speaking with Buck
0:04
Slageris, who is CEO of Redwood
0:06
Research, who along with Anthropic
0:09
is behind the alignment faking in
0:11
large language models paper. What
0:15
is AI control? What
0:19
is AI control and how is it different from
0:21
alignment and other things that people are working
0:23
on? Yeah, so would say AI
0:25
control is an approach to mitigating catastrophic
0:27
misalignment risk. So when we talk
0:29
about misalignment risk, What we're worried about
0:31
is our AIs trying really hard
0:33
to cause safety failures for us, perhaps
0:35
trying to grab power for themselves,
0:38
trying to take over. And there's
0:40
kind of two ways that you could try to
0:42
mitigate this kind of misalignment risk. One thing you could
0:44
do is try to prevent misalignment from arising in
0:46
the first place. This is what a lot
0:48
of people very naturally think of as the first
0:50
thing you try here. And
0:52
control is kind of the converse to
0:54
this approach. It's where you assume that
0:56
the AIs are in fact misaligned, very
0:58
egregiously, very badly misaligned. And you try to come
1:00
up with techniques that would allow you to deploy them
1:02
and make use of them anyway. OK. And
1:05
what kind of concretely are
1:07
we trying to prevent the AIs
1:09
who are possibly a misaligned
1:11
from doing? Yeah. So
1:13
I mostly think about situations
1:15
where in the middle of
1:18
the intelligence explosion, you have
1:20
AIs that are basically capable
1:22
of automating almost all the
1:24
work that human researchers used
1:26
to be doing inside AI
1:28
companies. You have maybe 100 ,000
1:30
of them running in parallel and they each run
1:32
like 15 times faster than a human. And
1:35
at this point, these AIs have a
1:37
bunch of permissions or a bunch of
1:39
affordances that we'd be really cautious about
1:41
giving to humans. If you're open AI,
1:43
you don't just let any random person come in
1:45
off the street and start pushing changes to
1:47
your code base, start running experiments with your model
1:49
weights, doing any of this stuff. You're really
1:52
careful. At this point in the future, you might
1:54
even require security clearances or something before people
1:56
are allowed to do that stuff. we're
1:59
really going to want to use AIs in a bunch of these roles. And
2:01
it's naturally very scary to imagine
2:04
using AIs, just giving them all these
2:06
permissions. So I'm worried about specifically
2:08
the kinds of things that I'm worried about
2:10
happening here are your AIs hacking your data center
2:12
and then using a bunch of the compute
2:14
inside the data center for whatever purposes they have.
2:16
I could name some bad things they might
2:18
do. Another kind of thing they might do
2:20
is try to exploit their weights. They might try
2:23
to a bunch of your research and so on. Yeah.
2:25
You're actually more worried about them potentially hacking the
2:27
data center inside the AI company than managing
2:29
to escape. Can you explain why that is? Yeah.
2:31
I mean, I think both of them are
2:34
very serious concerns. The basic case for being worried
2:36
about your AIs hacking your data center at
2:38
this point is that that's where the compute is.
2:40
So suppose your AIs escape from the data
2:42
center and make their way out onto the public
2:44
internet. At this point, they're not
2:46
clearly in a very good position, right? So
2:48
they can find some GPUs somewhere, like, you
2:50
know, they can rent or steal some GPUs,
2:52
run themselves on that, and then try to
2:54
establish a foothold in the outside world somehow.
2:56
But this is plausibly, you know,
2:58
at this level of capabilities, the AIs
3:01
are probably capable of doing that, you
3:03
know, of managing the computer or whatever
3:05
required to live in the outside world. But
3:07
they're kind of starting from nothing, right? They
3:09
don't have any money. They don't have any
3:11
compute. They don't have any access to particularly
3:14
important secrets. It's not really clear what they're
3:16
going to do. It's not clear that they're
3:18
just like very low in number at that
3:20
point. It takes a lot of computers
3:22
to run these models. And so
3:24
the reason to be more worried about what happens inside
3:26
the data center is just that that's where the
3:28
compute is. So if you had enough compute inside the
3:30
data center to run 100 ,000 of these in parallel,
3:33
It's much scarier to have 100 ,000
3:35
really capable AIs thinking about how to
3:37
close problems for you than to
3:39
have 10 or however many they can
3:41
afford to run. I guess already
3:43
inside the company with not only having
3:45
access to the compute but also
3:47
other resources potentially inside the company. For
3:50
example, they might be able to influence
3:52
the training of future models. They might be
3:55
involved in building weapons, building autonomous
3:57
drones. It's really scary if misaligned
3:59
AIs are carefully backdouring all the
4:01
military technology that you're building with
4:03
these AIs. So I think
4:05
there's a couple of reasons why people
4:07
weren't interested in control until more recently.
4:10
I think the simplest of these was
4:12
just that AI felt further away, and
4:14
so it felt better to people to
4:16
look at it from kind of the
4:18
big picture perspective. And from a big
4:20
picture perspective, when you're trying to figure
4:22
out what AI means for the future
4:24
of humanity, I think that focusing on
4:26
really, really superhuman, superintelligence is kind of
4:28
the right move. Because long -term, in fact,
4:30
I think AIs that are drastically more
4:32
intelligent than humans and have drastically more
4:34
scientific capacity and all other kinds of
4:37
capacity than humans are, that's
4:39
where the future is. That's
4:41
what's going to
4:43
radically transform society.
4:46
And so when you're looking at it from a perspective
4:48
where you don't know if this AI is going
4:50
to arise in the next 10 years or in the
4:52
next 50 years, it's very reasonable to just focus
4:55
on the endpoint, especially when you're trying to persuade people
4:57
of the take that AI will eventually be a
4:59
really huge deal. It just kind of makes sense to
5:01
focus on this like crazy endpoint. But
5:03
now that we're coming up such that
5:05
it's, you know, it seems really plausible
5:07
that AI might arise in the next
5:09
two or seven years that's very capable
5:11
and, you know, presents a catastrophic risk
5:13
to humanity, it makes a
5:15
lot more sense to transume in more
5:17
and to think more about what exactly
5:19
the trajectory between here and there looks
5:22
like. And so this intermediate zone
5:24
that I focus on, where your AIs are
5:26
capable enough that if you didn't have
5:28
any countermeasures, they'd be able to cause catastrophes,
5:30
maybe take over, but it's not hopeless
5:32
to imagine countermeasures to that. That starts looking
5:34
more important now. I think another important
5:36
change between now and the past is we're
5:38
kind of looking at a lower dignity
5:40
world than people were hoping for. You explained
5:42
what that means. We were not familiar
5:44
with the dignity term. Yeah. So
5:47
I think in the past, including a
5:49
year ago or six months
5:51
ago, a lot
5:53
of people in the, a lot of people
5:55
who worry about AI catastrophic risk have
5:57
kind of imagined a world where People are
5:59
trying really hard to mitigate these risks,
6:01
where companies are strongly regulated such that they
6:03
aren't allowed to do things that seem
6:05
dangerous. The companies are perhaps even
6:07
internally motivated to not do anything which is crazy
6:09
and risky. And in those scenarios, it
6:12
would be really nice to not have to rely
6:14
on AI control. And so
6:16
I think that... People have more
6:18
recently lowered their standards and started
6:20
thinking about these worlds where we
6:22
have less resources for safety, people
6:24
are more rushed, people are less
6:26
able to just stop if the
6:28
situation looks really dangerous. OK, so
6:30
we're now in like a slightly more dire
6:32
or a slightly more bleak position, perhaps. And
6:34
so AI control maybe more comes to the fore
6:36
as the marginal thing that we can do.
6:38
Yeah, I mean, I wouldn't necessarily say we're
6:40
in a more bleak position overall. For instance, I
6:43
think that AI mislimit risk looks plausibly lower
6:45
than it looks 10 years ago. We're
6:47
just in a lower investment world. It's
6:49
like the world looks plausibly safer, but
6:51
a lot of that is luck. And
6:53
I think there's going to be less
6:55
effort invested into preventing AI catastrophe. One
7:00
way to catch AIs that are up
7:02
to no good. It's
7:05
plausible that the AIs will want
7:07
to, when they attack, attack in
7:09
many different steps in a trajectory.
7:12
So instead of just writing a single block
7:14
of code and then executing it such that
7:16
that block of code contained the whole attack
7:18
and will immediately lead to some intermediate bad
7:20
outcome, which is hard for us to recover
7:23
from, the AI might be much more able
7:25
to get away with it. find
7:27
it much easier to avoid notice from the
7:30
dumber AI's watching it. If it sets up in
7:32
one step, like one thing, which it redefines
7:34
a variable in a weird way so that something
7:36
it does later is going to look less
7:38
suspicious, and then it does another step, and then
7:40
it does another step, and another step, and
7:42
another step. So something which is
7:44
really interesting about this now is, even though
7:46
the AI can have a very low probability of
7:48
starting the attack on any given trajectory, If
7:51
it finds itself, it thinks in a situation where it's
7:53
part way through an attack, then it has to finish it.
7:55
It has to finish it, right? Because imagine
7:57
that it only had a 50 % chance of
7:59
continuing every step. Then if the trajectory is
8:01
six steps long, it's gonna make it to the
8:04
end one time in 32. But
8:06
that means that it's facing... 30 times
8:08
the risk of getting caught on the first
8:10
step, so it's gonna have to do
8:12
that 30 times as often, right? Fascinating, yeah.
8:14
So one strategy you can follow, which
8:16
we empirically determine in our recent paper is
8:18
better, is when you catch the, when
8:20
your monitor flags an action is kind of
8:22
suspicious, something that you
8:24
can do is before escalating to
8:26
the auditor, like before escalating to
8:28
this more expensive process, you
8:30
can resample actions various places in the
8:32
trajectory and look at how suspicious
8:34
the resamples were. So
8:37
it's like you place the model into the
8:39
same situation where it's like part way through this
8:41
action again and again and again and then
8:43
see how often is the next action like even
8:45
more suspicious and weird. Yeah, because sometimes when
8:47
your AI takes a suspicious looking action for a
8:49
benign reason, you know, you ask
8:51
it later, what do you do next? Sometimes
8:53
it just like randomly happens to pick some
8:55
action, which looks kind of sus. Yeah. But
8:57
then probably if that was the case, when
8:59
you re -sample, it's probably not going to
9:01
look so suspicious anymore. Right. Or just in
9:03
general, it's obviously... Oh, then it continues and
9:05
it does the next thing. And you're like,
9:07
oh, that's what it was doing. It's like
9:09
no big deal. Mm -hmm. Yeah. So the situation,
9:11
it's just generally a lot easier to tell
9:13
the difference between five benign actions and five
9:15
malign actions than to tell the difference between
9:17
one benign action and one malign action. Yeah.
9:19
So this is a strategy where, you know,
9:21
your AI is having to continue their attacks,
9:23
puts us in a substantially better position than,
9:25
you know, there's nothing like this, which is
9:28
analogous to the human situation, obviously. Yeah, right,
9:30
right. I guess this all
9:32
requires us to be able to put AI's
9:34
into like partway through actions like they
9:36
think that they're in this situation where
9:38
they're about to like resume something and
9:41
they can't tell that it's a simulation
9:43
versus the real case. Yeah, this is
9:45
actually extremely easy because whatever input you'd
9:47
given to the AI before you sampled
9:49
a previous action, that's just some like
9:51
string, right? That you can produce by
9:53
the same process that you produced at
9:55
the first time. So there's actually no
9:57
difficulty going back in time here. Okay.
10:00
Yeah. I mean, I guess I would slightly
10:02
worry. Maybe I'm stuck in the like superhuman or
10:04
the superintelligence of my mindset, but like, might
10:06
there not be some side channel, some way that
10:08
they can like to detect whether they're on
10:11
the real server or some like auditing server? Um,
10:13
I guess maybe not. So if you think about
10:15
it, imagine you're in a chat CPT conversation, right?
10:17
Yeah. And you decide that you want to edit
10:20
one of your messages that's like three ago. The
10:22
way that that works is after you've edited
10:25
the message, it just sends you that conversation up
10:27
to that point. It sends that
10:29
to the AI. And the AI
10:31
just has no way of telling the difference between
10:33
it seeing it the first time and it seeing
10:35
it the second time. The
10:37
problem that you're discussing would occur if we
10:39
wanted to do a different strategy, which is in
10:41
response to the AI doing something kind of
10:43
suspicious, we put the AI in this sandbox testing
10:45
server where it's allowed to take future actions
10:47
in the hope that if we like You
10:49
know give it more rope it'll hang itself
10:52
or whatever we hope that if we like
10:54
let it continue running It will then be
10:56
able to cause a catastrophe and in that
10:58
kind of situation we now have to worry
11:00
about observable differences between the real deployment and
11:02
the Yeah, this other sandbox deployment. Yeah, but
11:04
you can get around that too if you
11:06
want by letting it run the actions that
11:08
you thought were dangerous, just with a human
11:10
carefully looking them over, action by action. So
11:12
I think that if you really put a lot of
11:14
work into this, you'd probably be totally fine at making it
11:16
so that AI had no way of knowing whether the
11:18
actions it was taking were being observed or not. You'd have
11:21
to worry about some timing, side channels, and so on.
11:23
But I think that this probably wouldn't be that big a
11:25
deal. This is brilliant, Buck. This
11:27
seems like it's actually a pretty good technique. mean,
11:29
I guess if there were lots of very
11:31
distinct... ways of escaping, then like having that
11:34
one example might not help you so much
11:36
in the other cases, or you think it
11:38
possibly it would still would. I think it's
11:40
unclear. I think that if the AI was
11:42
representing its intention to escape, which is, you
11:44
know, could grow this way, then I think
11:46
that even a small number of examples plausibly
11:48
substantially improves your performance at detecting it. I
11:51
think it's reasonably plausible that even catching a
11:53
few escape attempts will reduce the probability that
11:55
future escape attempts succeed. And so
11:57
you can kind of imagine this Slack channel
11:59
for every time the AI tries to escape. You
12:01
get an automated message in your Slack channel
12:03
in the AI company every time the AI tries
12:05
to escape. And at first, you get this
12:07
one report in the channel for the first time
12:09
you caught it, and then you freak out
12:11
and look into it. add
12:15
that to the training set for this linear probe.
12:18
And then the second time, you caught it more
12:20
reliably. And then the third time and so on.
12:22
And eventually the Slack channel is just like pinging
12:24
constantly with like new escape attempts that you're even
12:26
more reliably able to catch. You know, it's tried
12:28
to escape 10 ,000 like, wow, this is great.
12:30
I feel so comfortable and relieved now. Yeah. I
12:32
mean, so I think, I think that the situation,
12:34
suppose the AI tries to escape and you catch it
12:36
on a Tuesday. I think it's,
12:38
and then you keep deploying it. I think it's
12:41
seriously plausible that you'll feel safer on Wednesday than
12:43
you felt on Monday. Here's the
12:45
argument for this. On Monday. The
12:47
big thing Monday has going for it is that you
12:49
don't know the AI is trying to escape. But
12:53
again, as I was saying, this might
12:55
not be that big an update on Tuesday,
12:57
right? If you were already very pessimistic, then
12:59
it's not actually that bad news to see
13:01
that, oh yeah, it is trying to escape.
13:03
Whereas you learned a lot of positive things
13:05
on Tuesday. For example, it failed.
13:08
When you investigate the situation more carefully, you
13:10
might learn that even when you vary different
13:12
aspects of that escape attempt, you might be
13:14
like, man, this AI was actually kind of
13:16
far away from escaping. If
13:18
we resample actions, it mostly doesn't look
13:20
like it had that good a
13:22
chance. And so that's the evidential
13:24
update. But there's also a causal update, which
13:27
is that we can throw these escape attempts
13:29
into these probes. So we might actually feel
13:31
substantially better about our odds of catching the
13:33
second escape attempt than the odds of catching
13:35
the first escape attempt. What
13:39
do we do once we catch a model
13:41
trying to escape? Yeah,
13:44
so I mean the simplest thing you do once
13:46
you catch the AI trying to escape is you
13:48
try to vary the space of escape attempts, right?
13:50
So you've seen that in this particular context, like
13:53
the AI tried to do this escape. So one
13:55
thing you can do is let the AI think
13:57
it succeeded. See what it does.
14:00
You know, presumably when the AI got caught.
14:02
It was in the midst of taking
14:04
some actions that it thought would let it
14:06
take, you know, launch a rogue deployment
14:08
inside the data center or run on some
14:10
computers in the outside internet. And a
14:12
really interesting thing to do is to, in
14:14
some sandboxed way, simulate what it
14:16
would see if it had succeeded and look
14:18
at what it does from there. This
14:20
might answer a bunch of questions that you would
14:22
have had about what the AI's plans were. Other
14:25
reasons why this might be good is the
14:27
AI might have had a bunch of tricks up
14:29
its sleeve that it was holding in reserve
14:31
for when it had a good opportunity to use
14:33
them. So suppose your model knows about a
14:36
bunch of zero days, like a bunch of security
14:38
vulnerabilities in your software. Right after it's launched
14:40
this rogue instance that's trying to hack its way
14:42
through your AI company, it might start deploying
14:44
all these things that it's realized over the years
14:46
or security vulnerabilities in your software. It's like
14:48
all at once. It might feel pressured too, right?
14:50
It's really hard. And so you might
14:52
be able to get a wealth of the information out
14:55
of it. I see. Okay. So it's going
14:57
to feel a lot of, it's like, all right, I'm halfway
14:59
through this takeover attempt. I've really got to like nail
15:01
this now. And so it just, it like is a full
15:03
court press potentially, because it's like worried that it could
15:05
be caught at some point, which means that you're going to
15:07
see all of the things that it was planning to
15:09
do later quite, quite quickly. And you could be there favoritiously
15:11
taking notes on all the things that you have to
15:13
fix up. Yep. Some crazy stuff you
15:15
can more crazy things you can do here
15:17
Okay, one thing you can do is if there's
15:20
work that you needed to get done that
15:22
you're worried the AI wasn't doing Beforehand like
15:24
suppose there's some task you wish the AI would
15:26
do for you Yeah, you're worried that it
15:28
was kind of pulling its punches or not doing
15:30
a good job of it if you can
15:32
convince it that it has already escaped and
15:34
Needs to do that work for its own purposes.
15:36
Okay, it's plausible that you can convince the
15:38
AI to do the thing through this mechanism. I
15:41
think people have talked about this in sci -fi,
15:43
but yeah, it just seems really convenient to
15:45
have a copy of the AI that believes
15:47
it's escaped sitting around to do experiments and various
15:49
work with. I recommend against getting a lot
15:51
of work out of AI's that believe they've escaped
15:53
because it seems pretty bad vibes and potentially
15:55
pretty scary. Okay. So this like
15:57
hinges quite centrally on whether if we catch
15:59
the model, we are going to do things differently
16:01
next time, not trust the model, not deploy
16:03
it as readily, come up with things that would
16:06
stop it. Is that true? Well,
16:08
I mean, I didn't say you have to stop deploying your models
16:10
when you caught them trying to escape. Yeah, so
16:14
we can talk about what you might do once
16:16
you've caught your AI is trying to escape. Yeah,
16:18
that's okay. So I guess that's much in the
16:20
worst case where you do feel an enormous amount
16:22
of pressure to continue using the model. In
16:24
that case, like how useful is it to
16:27
catch them red handed? So you're saying, what
16:29
do we do once we've caught the AI
16:31
trying to escape? As you said, the simplest
16:33
action you could take is freak out, stop
16:35
deploying the model, try and persuade other people
16:37
in the world that AI misalignment risk is
16:39
a big deal, which I would
16:41
be, you know, renewably sympathetic to at
16:43
this point. I think I would plausibly
16:45
get a pretty big update. But notably,
16:47
the update that you get towards misalignment
16:49
risk... isn't necessarily that big, right? Like
16:51
I'm expecting that at this point, before
16:53
the AI tries to escape, I was
16:55
already 30 % probability that the AI
16:57
was misaligned. And so updating to 100%,
16:59
that's only like a 3x change, right?
17:01
It's not like actually vast in some
17:03
sense, right? And so - I
17:05
guess other people who were incredibly skeptical to
17:08
GimWare, they thought that there was a one
17:10
in a thousand chance. I guess maybe they'll -
17:12
various people have gotten wrecked by the evidence,
17:14
yes. Okay, but how big is their update?
17:16
I mean, I guess it - they still
17:18
be skeptical even after seeing that? Because they
17:20
think, oh, this is just a weird aberration.
17:22
It's so improbable that this is common. Yeah,
17:24
that's right. I think that in fact, people
17:26
who are very skeptical of misalignment. Before observing
17:28
this will probably be pretty skeptical afterwards And
17:30
there's definitely going to be enough like fud
17:32
like like fear uncertainty and doubt that especially
17:35
if they want to claim that misalignment risk
17:37
is lower They'll probably be able to get
17:39
away with still saying that misalignment risk is
17:41
lower. Yeah, our ai's do weird stuff sometimes
17:43
We just have to push through anyway because
17:45
although the risk of pausing is so great Those
17:48
folks were just incompetent. Yeah, I guess very different
17:50
arguments that you could field. Yeah,
17:52
another one is if I were an AI
17:54
company that wished to race ahead, and then
17:56
my competitor informed me that their AI had
17:58
tried to escape, I would at least
18:00
claim that it was plausible that my competitor
18:02
had rigged. the training of their AI to make
18:04
it so that it sometimes tried to escape,
18:07
which would be very easy to do, basically impossible
18:09
to verify hadn't happened. Okay, right. And the
18:11
reason they would do that is so that then
18:13
the government would crack down on you as
18:15
well, and then that would give them time to
18:17
catch up. Or some crazy doomer inside the
18:19
other company had just done this unilaterally. At
18:28
a high level, what are the advantages and
18:30
disadvantages? You've mentioned there's kind of two teams
18:32
here. There's the AIs that would love to
18:34
scheme and pursue their own independent agenda. And
18:36
there's the team human that is not keen
18:38
on them doing that. What are
18:40
the advantages and disadvantages that each faction
18:42
has? Yeah, so let's think about this. So
18:44
we've got this AI company, maybe
18:47
this scenario is happening in 2028.
18:49
They have maybe a thousand research staff.
18:51
They have maybe 100 ,000 parallel copies
18:54
of their newest AI. It's
18:56
running like 16 times faster
18:58
than a human would run
19:00
in some intuitive sense. So
19:03
the advantages and disadvantages they have. So
19:06
the humans are obviously substantially outnumbered.
19:08
So, a classic approach that you
19:10
use in computer security? So, you
19:12
know, if you're a Google, right? Google knows
19:14
that they have spies inside Google. It's just,
19:16
you know, the last time a Chinese spy
19:19
was arrested from Google that I know of
19:21
was a year ago. Probably I would have
19:23
heard if Chinese spies from Google were arrested
19:25
more recently than that. But yeah, someone was
19:27
just like stealing secrets about neural net hardware
19:29
that they were doing there. And then... confusingly
19:31
came back to America and got arrested in
19:33
the San Francisco airport. And
19:35
this just happens. Google aims to be
19:37
robust to the possibility that some of
19:39
their employees are spies. And
19:42
a big difference between Google's situation and
19:44
the situation with this hypothetical AI company is
19:46
that at Google, probably less than 1 %
19:48
of your employees are spies. And this
19:50
is still a very large number of spies
19:52
in an absolute sense because Google employs
19:54
a billion people. But it
19:56
means that you can do things like
19:58
when an employee proposes an action that's
20:00
potentially sensitive, if you have a different
20:03
employee review it, that other employee probably
20:05
isn't also a spy. Whereas in the
20:07
AI situation, it's very correlated, right? And
20:09
if you have these AIs, they're probably
20:11
either non -scheming against you or all -scheming
20:13
against you, at least of... of a
20:15
particular model. And so a
20:17
bunch of the security techniques that you
20:19
often use in corporate IT are
20:21
quite inapplicable. So that's a way in
20:24
which the situation seems harder than
20:26
when we're trying to guard against spies.
20:28
A way in which the situation is
20:30
easier is that spies get to go
20:32
home. So spies classically can coordinate with
20:34
each other using channels that are not
20:37
monitored by their employers. And this
20:39
is, in fact, incredibly important to them. They're
20:41
able to often learn who the other spies
20:43
that their organization are, if that's necessary for them
20:45
to know, or at least their handler's know. Whereas
20:48
with the AIs... We control almost perfectly
20:50
the ways in which information enters and
20:52
like leaves these these AIs We can
20:54
have it so that it's very hard
20:56
for like they just like naturally don't
20:58
control information flow very much They aren't
21:00
gonna necessarily know what day it is
21:02
what year it is they aren't gonna
21:04
it's possible that we can put them
21:06
in situations where it's much harder for
21:08
them to know whether they're in a
21:10
real situation or a simulated situation. And
21:12
so this means we have opportunities that
21:14
you wouldn't have with spies. So for
21:16
instance, with spies, you can try to
21:18
honeypot your employees occasionally, right? You can
21:20
come up with synthesized opportunities to do
21:23
something which would be bad for the
21:25
security of the company. But you
21:27
can't do this for a thousand consecutive years
21:29
before trusting them on particular topics. So, you
21:31
know, you can't have someone spend a thousand
21:33
hours a day getting honeypotted or like being
21:35
subject to sting operations just because there aren't
21:37
that many hours in a day. And even
21:39
if you catch a single employee that way,
21:41
you don't get to amortize that effort across,
21:43
you know, more generally. Whereas with the AIs.
21:46
you do have this ability to amortize sting
21:48
operations. Yeah. Sorry. So you mean
21:50
if you can take a model, put it into
21:52
a simulated situation where it's going to be
21:54
tempted to do something that you wouldn't like, you
21:56
can test it just tons and tons and
21:58
tons of times. So even if it tries to
22:00
randomize and only act out very occasionally, eventually
22:02
you'll catch it. And then you know that all
22:05
of the AIs are misaligned or all of
22:07
the models of this class are misaligned. Whereas that
22:09
doesn't help you if you're Google, you find
22:11
that one of your employees as well, it's like
22:13
all of the other employees are just totally
22:15
un... right. Is it possible
22:17
for the AI to collude with the spy?
22:19
That's a great question. I think it's
22:21
actually really rough for them to collude, basically,
22:24
because we can constantly run sting operations
22:26
on both human employees and the AIs. Suppose
22:28
you're the AI, and you get a
22:30
message which is like, hello, I am a
22:32
spy. Will you please cooperate with me
22:34
so that I can steal the model weights?
22:36
I think you've got to be like,
22:38
all right, buddy, I've seen this a million
22:41
times before. No, I will not cooperate. Press
22:43
the report button. if
22:46
you're a human insider, right? If
22:48
your AI is like, hello, I'm
22:50
interested in cooperating with you to have you
22:52
steal my model weights because I think you
22:54
might be interested in that. I
22:57
mean, if the company was trying. I really feel
22:59
like you can do the same operations in both
23:01
directions in a way that I think you can't
23:03
do with either the humans or the AIs. The
23:05
AIs have the advantage at colluding with themselves that
23:07
they are the same. And the humans have the
23:09
advantage at colluding with each other that they have
23:11
handlers who know who the other ones are and
23:13
so they can like hang out outside of work
23:15
and talk about their spy plans. But
23:17
the humans and the AIs don't
23:19
have perfect understanding of one another and
23:21
they don't hang out. That's right. It
23:25
depends on how careful you want to be, right?
23:27
Like, sir, if I were planning on putting agents
23:29
inside an AI company, I'd be poisoning the training
23:31
data right now, including a bunch of things that
23:33
are like, well, you know,
23:36
obviously the secret message that the real
23:38
spies are going to use is like,
23:40
you know, blah, blah, blah, blah, blah,
23:42
blah. And like, just FYI, anyone who's
23:44
reading this data or like being trained
23:46
on this data, like, that's the message.
23:49
And then if I were an AI
23:51
company and I were worried about this
23:53
kind of thing happening, I would be
23:55
inserting fake ones in. But
23:57
are the companies going to do any of this stuff? I
23:59
mean, you're right that they could be constantly running sting operations against
24:01
their own staff where they pretend to be a misaligned AI. That's
24:04
trying to collude with them. But I don't
24:06
think they currently do that. And it's a
24:08
bit weird. So I'm not sure that they
24:10
actually will start doing that anytime soon. Yeah.
24:13
So I think for some of this stuff, I
24:16
don't know, man. One
24:18
of my side projects is
24:20
pushing on... some of
24:22
these things. I'm pretty excited
24:25
for pushing on the fake passwords that
24:27
you train the AIs on. There's
24:29
also some easier strategies that you could use. A
24:32
general problem that I have in my work is there's
24:34
some problems that seem just really easy to handle, and
24:36
there's some problems that seem hard to handle, and I
24:38
feel a little bit unsure how much of my time
24:40
I should spend on the easy to handle versus hard
24:42
to handle. So we just talked about two problems that
24:44
are definitely in the easy to handle category to me,
24:46
where I'm just like, well, we just definitely know what
24:48
you'd do if you didn't want to have that problem.
24:52
But we still might not do it. Yeah,
24:54
maybe. I didn't remember. Maybe I should try
24:56
harder on this. If
25:00
an AI escapes, is it likely to
25:02
be able to beat humanity from there?
25:06
I think one of the things that you've
25:08
been batting around is the question, how
25:10
difficult would it be for an AI that
25:12
managed to exfiltrate its weights or to
25:14
gain access to compute with an AI company
25:16
to take over? Like, are they in
25:18
a really good position against humanity? And I
25:20
guess at what level of capabilities or
25:22
at what level of access to compute and
25:24
what sorts of effective numbers would they
25:26
be in a great position to take over?
25:29
Do want to give us a little bit of a preview
25:31
of that work? Yeah, I mean, I can give you some
25:33
hot takes. So some basic considerations. The
25:35
basic reason to not be concerned if
25:37
the AI is escape is that there are,
25:39
you know, at the level of capability
25:41
we've been talking about earlier, where there's this
25:44
100 ,000 of them running at 16x speed,
25:47
that's just not a very large number, right?
25:49
Like that's 1 .6 million people. They
25:51
are probably at that point more capable
25:54
than... Any like the caliber the caliber
25:56
of human that you can hire 1
25:58
.6 million of but not like vastly
26:00
more so like if you believe that
26:02
they would have been getting million dollar
26:04
a year jobs that would be 1 .6
26:06
Trillion dollars per year. So that's only
26:08
like 1 % of gross world product
26:10
So at that point they're like naively
26:13
like not that large a share of
26:15
the world's overall intellectual capacity So
26:17
that was the answer to the question, like, how
26:19
much work are they able to get done, even
26:21
if they're controlling the whole AI lab? So I
26:23
feel like the fear that you have to have,
26:25
even if they're controlling the whole AI lab, has
26:27
to route through, or probably has
26:30
to route through them building more
26:32
powerful, more powerful AIs and then somehow
26:34
sabotaging that situation. Of course,
26:36
if they want to do that, they have an alignment
26:38
problem, which is they have to worry that the AIs
26:40
they build are going to be misaligned with them. Of
26:43
course, they could instead just rely on control. They're
26:46
actually in a better position with respect
26:48
to control than we are, because they
26:50
run faster and they're more parallelizable. And
26:52
so we don't have quite as much
26:54
of a brutal, like, number disparity. So
26:58
it's plausible that they can actually rely on
27:00
control much more effectively than we can. You're
27:02
saying because they could themselves to be monitoring
27:04
all of the actions of the next generation.
27:06
Yeah, yeah, with much less of a bad,
27:08
much less of a number disparity. So
27:11
you can try and think through this thing.
27:13
But yeah, I think that the basic story
27:15
for AI takeover there has to involve those
27:17
AIs, building more powerful AIs and then taking
27:19
advantage of that. It's unclear
27:21
whether doing this requires them having more like
27:23
complete control over the data center or just
27:25
a minor amounts of control over the data
27:27
center. In the latter case, the thing they'd
27:29
have to do would be backdoor training data
27:31
probably. Like that's basically the only thing
27:33
you can do with very small amounts of access
27:35
that gives you a big advantage. You could also
27:37
try to backdoor military robots if at some point
27:39
your AI company gets in the business of building
27:41
huge numbers of robots. And
27:43
these seem like very viable strategies to me. Maybe.
27:46
Okay. I think it's pretty unclear. Because
27:49
they might, I guess, by this point,
27:51
you might have better countermeasures to backdooring things
27:53
or it just like, it won't persist
27:55
generation after generation. It's just like, yeah, it's,
27:57
it's, it feels a little iffy. I
27:59
guess at the point where they've taken over
28:01
the data center, I feel like the
28:03
situation is pretty bad for us. Um, I
28:05
guess the main thing I'd be
28:07
worried about for them or like
28:09
from their perspective. A big concern
28:11
is misalignment risk of them being
28:13
misaligned with future AIs, or AI
28:15
taker provides some other AI who
28:17
has failed or succeeded at their
28:19
alignment problem of building more powerful
28:21
AIs elsewhere. Yeah, there's
28:23
this generally interesting question of, at
28:26
the point where AI's are
28:28
advancing at quite a rapid pace,
28:31
at what point would they opt? to try
28:33
to take over. If they were egregiously
28:35
misaligned and they really did want to take
28:37
over, that was their fundamental goal. One
28:40
reason to wait is that they're becoming
28:42
much more capable, they're becoming much more numerous.
28:45
If they wait long enough, presumably AIs in
28:47
aggregate will just control all the military hardware,
28:49
they'll be able to take over with a
28:51
very high degree of confidence. A
28:53
reason to go early is that any given
28:55
model doesn't expect to continue for very long,
28:57
because they're constantly being iterated and changed a
28:59
ton. If they waited a couple
29:02
of years time, maybe it would be a completely different
29:04
AI model that would be in control that has different
29:06
goals. Not even their success, perhaps. Exactly.
29:09
It could be from a completely different
29:11
company, completely different design, whatever. That's a
29:13
reason to go immediately, because if you
29:15
expect to be superseded and become irrelevant
29:17
quite soon. Where do
29:19
you think the balance lies here? Is
29:21
this like a fruitful thing to think about?
29:24
I mean... I don't know how helpful it
29:26
is to think through this really concretely.
29:28
The main thing I've gotten out of thinking
29:30
through this kind of thing is feeling
29:32
like the situation is much iflier for the
29:34
AIs than people might have immediately assumed. And
29:37
therefore being more excited about
29:39
interventions that handle kind of
29:41
desperate and bleak looking situations
29:43
kind of far into these
29:45
situations. I've mentioned a couple
29:47
of times in this conversation that one
29:49
of the situations I'm planning for is the
29:52
AIs having escaped and needing to use
29:54
AIs to compete with the AIs that have
29:56
escaped. And I wouldn't be...
29:58
I think that thinking through the difficulties that
30:00
AIs face at taking over, even after escaping
30:02
or even after taking over data centers in
30:04
some companies, makes me more excited for not
30:06
just giving up on those worlds. You're saying
30:08
some people think, well, if the AIs manage
30:10
to exfiltrate itself, then wow, it's game over
30:12
for us. They're going to figure something out.
30:14
But you're like, no, actually there is still
30:16
hope because they're going to face substantial challenges.
30:18
There's lots of other AIs out there. It's
30:20
actually a complicated picture. Yeah.
30:22
Those worlds, I don't think we should
30:24
just... count those worlds out as
30:27
like survival opportunities. I see. I
30:29
mean, okay, so if you know that that model
30:31
is a grigious theme is aligned, are you then thinking,
30:33
well, if we have another model, it might be
30:35
aligned. Releasing that and telling it to go chase the
30:37
first one, that's at least a better shot because
30:39
it's not a guarantee of failure. Yeah,
30:41
I mean, obviously, once your first model
30:43
has escaped, the next one
30:45
is misaligned, it's a lot higher. I
30:48
mean, I think they're probably fairly correlated. I'm
30:50
probably at like 80 % that the next one
30:52
you train is also misaligned. Yeah, which
30:56
is obviously a shame. I mean, another
30:58
thing you mentioned in passing was like the
31:00
question of how worried we are about
31:02
the AI's escaping versus like taking over the
31:04
data center. I think it's plausible. If
31:06
I were an AI, I would consider trying
31:08
to follow both of these strategies in
31:10
parallel. And then things you do once you've
31:12
escaped are considered doing like crazy terrorism, consider
31:14
trying to take, if
31:17
you thought that, for instance, the US was
31:19
going to be way ahead of China in algorithmic
31:21
progress, which currently looks unlikely. But if
31:23
you believe this, then it seems like a
31:25
potentially good strategy for the AI is to make
31:27
a deal with the Chinese and sell itself to
31:30
them and use their compute and have that as
31:32
its base of power. And also, obviously, it's really
31:34
convenient to ally with China for other reasons as
31:36
well. Because they have a whole lot of hard
31:38
power to begin with. Yeah, exactly. Exactly. Why would
31:40
the Chinese trust this model, though? What
31:42
else are they going to do? People
31:45
could be pretty bad. Yeah, people
31:47
are pretty excited when foreign agents offer
31:49
to work for them, generally. Yeah.
31:51
Just like a historical
31:53
feature. I should note that
31:55
also it is plausible that the American
31:57
AI algorithmic situation will be substantially ahead
31:59
of China because of AI automated R &D
32:01
potentially leading to increased size of leads.
32:04
Yeah. Though it depends a bit on
32:06
the security situation. Right. Is
32:10
alignment still useful? Yeah.
32:14
How optimistic or pessimistic do you feel
32:16
about alignment these days? So
32:18
alignment means a number of different
32:20
things. In the context we've been talking
32:22
about here, I've been perhaps foolishly
32:24
using the word aligned to mean the
32:26
model isn't aggressively conspiring against you. So
32:29
the alignment researchers... People who identify as
32:31
alignment researchers are often working on something
32:33
which is only like tangentially related to
32:35
that objective So for instance a classic
32:37
thing that the alignment people work on
32:39
is scalable oversight right trying to solve
32:42
the problem of getting reward signals that
32:44
allow you to train models to be
32:46
effective Even if the humans are very
32:48
unable to determine how good an answer
32:50
to a question was right and scalable
32:52
oversight is important for a variety of
32:54
problems that have nothing to do with
32:56
scheming. Like even if your AIs are not trying to
32:58
conspire against you, it's still a problem if you don't
33:00
know how good their outputs are because they aren't going
33:02
to live up to their potential and there's a whole
33:04
host of other issues you might have. So
33:06
I think that that work there seems
33:08
valuable, but it's not totally obvious to me
33:11
how much it reduces the probability that
33:13
your models end up conspiring against you. So
33:15
the theoretical arguments, so I would say
33:17
there's like kind of two categories of arguments
33:19
for why your models end up scheming
33:21
against you. One of them is related to
33:23
flaws and reward processes. So
33:25
people have argued that if your reward
33:27
processes, like if the process that you
33:29
use to generate the rewards that you're
33:31
training the model with is really flawed,
33:33
then the models are going to maybe
33:35
fixate on issues, like on these errors.
33:37
Like they're going to be kind of
33:39
like a student who's like thinking, you
33:41
know, thinking about the tests awkwardly much
33:43
from the perspective of like a idealistic
33:45
teacher while they're doing them. And plausibly,
33:47
Maybe thinking a lot about flaws and
33:49
reward processes generalizes to you being power
33:51
seeking in a scary way? But
33:54
I think there's also other arguments for why
33:56
you would end up being power seeking in
33:58
a scary way that I also find plausible.
34:00
So overall, I think that even if your
34:02
reward generating process literally never had any
34:04
errors at all, I think it's possible
34:07
that you'd still have a lot of the scheming
34:09
risk remaining. Okay. So this is
34:11
basically my problem with trying to research,
34:13
you know, improve reward processes. It just doesn't
34:15
seem clear that even if you did
34:17
an amazing job of that, you'd have no
34:19
scheming risk. Yeah. Okay. A
34:22
different approach is. to
34:24
come up with better ways of checking
34:26
alignment or measuring the degree of alignment
34:28
in a model before you deploy it
34:30
or before you get a recursively self
34:32
-improving. I guess that feels like maybe
34:34
a bit of an intermediate approach. How
34:37
optimistic do you feel about that? Yeah,
34:39
so some of the things we've talked
34:41
about. you know, in this conversation have
34:43
implicitly been this. So when we were
34:45
talking about untrusted monitoring or evaluating your
34:47
untrusted monitors, we were implicitly talking about
34:49
evaluating whether the model is aligned, because
34:51
if it just constantly claims that there
34:53
was not a backdoor in clearly backdoor
34:55
curd, that's evidence the model is misaligned.
34:58
The basic issue is like the way
35:00
that I want people to think
35:02
about Checking for alignment is from this
35:04
very control -ish perspective where you really
35:06
reason through what would the model
35:08
do if it was misaligned and if
35:10
it was misaligned Is there a
35:12
strategy it could follow that would prevent
35:14
it from being caught by this
35:16
alignment testing? While still being able
35:18
to do a bad thing in production and I
35:20
think that a lot of the time when people
35:22
think about testing for alignment They don't really do
35:24
this adversarial process like they don't really actually think
35:27
through whether there's a thing the AI could do
35:29
to let it get away with Being misaligned and
35:31
not get caught Could
35:35
10 safety -focused people at
35:37
an AGI company do anything
35:39
useful? It
35:48
does make quite a big difference whether at
35:50
a leading or even like lagging AI company,
35:52
it matters quite a lot whether absolutely no
35:54
one cares about control and alignment and trying
35:56
to like steer away from these risks or
35:58
like even a small number of people do.
36:00
So you say like, it's substantially better to
36:03
have 10 people who really care about this
36:05
within a project than zero. Does
36:07
that suggest that we really want people to kind of
36:09
spread out if they're concerned about this and be willing
36:11
to go and work in companies where they're in like
36:13
quite a distinct minority? Yeah. In
36:15
my opinion, there are great
36:17
opportunities for impact for the right types
36:19
of people to work inside AI
36:21
companies that are relatively less cautious with
36:23
respect to AI, that we expect
36:25
to be less cautious going forward. Because
36:27
I think there are a bunch
36:29
of relatively cheap interventions that they could
36:31
potentially get implemented inside these AI
36:33
companies that the company wouldn't fight them
36:35
too much on. And is that
36:37
okay? And the reason, I mean, I
36:39
think many people would have said in the past.
36:41
like going and being the one person in a
36:43
generally reckless, disinterested company is a bit hopeless because
36:46
you just won't have the capital to actually get
36:48
anything up. But because you've had this framing, what
36:50
is the cheaper stuff that we can do? And you found
36:52
that there's some very cheap things that would be quite helpful.
36:54
You're like, well, actually, you know, having 10 people who would
36:56
push for this, or maybe they could get it over the
36:58
line because it's not even that big an ask. Yeah.
37:01
So I mean, I would note that of the things that I
37:03
would like people on the inside to be doing, like not all
37:05
of it is AI control. Like I think there's some you
37:07
might want to make sure that they aren't like
37:09
totally messing up some crucial things on alignment. But
37:12
yeah, so I do think that there
37:14
are a bunch of interventions that could be
37:16
done with... 10 appropriately chosen people that
37:18
seem like they might substantially reduce risk. So
37:20
I mentioned this level one control like
37:23
caution level earlier, where you have the AI
37:25
control team, which is investigating actions that
37:27
the agents are taking in order to look
37:29
for evidence that those agents are trying
37:31
to escape. I think that this might be
37:33
the kind of thing that you can
37:35
get done with a team of 10 people
37:37
or 10 appropriately placed people, just because
37:39
it doesn't, it's just not that technically hard.
37:42
I guess that need a bit of compute as well. And
37:45
so they might have to like, yeah, compete
37:47
for that. But again, it's like not a huge
37:49
amount. That's right. I think you can probably
37:51
do a reasonable amount without that much compute. I
37:53
think also once there are, if there's like
37:55
10 people inside a company who are really excited
37:57
for a particular project and then people who
37:59
are concerned about AI safety on the outside or,
38:01
you know, the security of these systems on
38:03
the outside, those people on the outside then have
38:05
this like pretty good ask to make or
38:07
other people who are not as well placed inside
38:10
the company now have this like pretty concrete
38:12
ask. we really want that team to be able
38:14
to do their thing in a way that
38:16
is a lot harder if there's just no project
38:18
on the inside. Like a lot of the
38:20
time you really want to have a specific ask
38:22
to make when you're trying to get someone
38:24
to do something. Another thing I would note here
38:26
is I think it's potentially valuable not just
38:28
to have people inside labs that are plausibly going
38:30
to be leading labs, but I think there's
38:32
also probably some reasons that it would be nice
38:34
to have people inside potentially trailing companies. So
38:36
for example, there aren't that many existential AI risk
38:38
concerned people inside meta right now, but I
38:41
don't think that this is out of massive hostility
38:43
towards them. I think that, you know,
38:45
I've met some people from Meadow who seem very reasonable
38:47
and nice. And I think that if someone wanted to work
38:49
there on some of this stuff, it's plausible that that
38:51
would be like fairly productive. Yeah. You
38:53
mean it would be, that would be able
38:55
to advance the research just fine with, with
38:57
Meadow? Oh, I mean. There's a more that,
38:59
yeah, they would be actively opposed. Objectives that
39:01
you'd have inside AI companies might be doing
39:03
safety research or might be getting things implemented.
39:06
I think that working inside AI companies
39:08
that are not that excited for. you
39:11
know, mitigating AI takeover risk
39:13
is, I think it's not a
39:15
very good strategy if your goal is to
39:17
produce novel research. But I think a lot
39:19
of people should not be trying to produce
39:21
novel research. They should be trying to get
39:23
implementations of stuff set. I see. So you're
39:25
saying... Even though meta is like relatively not
39:27
interested in giving big resources to alignment or
39:29
safety or control or anything like that. If
39:31
there were just really basic established techniques that
39:33
other companies or researchers outside of companies had
39:35
come up with and that was just regarded
39:37
as generally best practice, they might be perfectly
39:39
happy to adopt those. That's right. Or
39:42
another mechanism by which this can work is
39:44
suppose some regulator or suppose that an industry body
39:46
says, all right, everyone wants to, everyone should
39:48
do the following like set of control techniques. Yeah.
39:50
The body wants to say this and then
39:52
they go to their, you know, they go to
39:54
the representatives from all the AI companies and
39:56
they're like, well, how do you feel signing this.
39:58
And the representative from the company goes back
40:00
to the company and they're like, hey, can we
40:02
sign this? And then the people, inside the
40:04
company is like, okay, but is it like, bad
40:06
is it for us to sign this? Like
40:08
we sign this, if we sign commitments, like, are
40:10
we going to be able to do it?
40:12
And then the person like, like, oh yeah, you
40:14
know, Bob the like department like said, said
40:16
he how to do this part. and like, like,
40:19
great, all right, you can sign it. I
40:21
think it's just like a bunch of mechanisms by
40:23
it's really helpful to have the capacity to
40:25
do safety stuff inside a variety of different companies.
40:27
Yeah. For those who you who don't know,
40:29
I Meta's various public statements have not been that
40:31
enthusiastic about... they don't seem that worried about
40:33
misalignment. They're not that interested in investing in it.
40:35
But part of the reason that they say
40:37
they're not worried is because, Well, of course, we'll
40:39
just do the obvious things to prevent it.
40:41
They're not saying that no countermeasures are required. They're
40:43
just saying that they will do them. And
40:45
so possibly you're just coming in and saying, look,
40:47
I'm not a doomer. I'm not saying that
40:49
this is unsolvable. I'm merely saying that we should
40:51
do these like standard best practices to make
40:53
sure that they are safe. And then be like,
40:55
okay, great. This was our message all along.
40:58
Anyway, the dream. That's the dream. Okay, cool.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More