Highlights: #214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway by 80k After Hours | Podchaser

Episode from the podcast80k After Hours

Highlights: #214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Released Friday, 18th April 2025

Good episode? Give it some love!

Highlights: #214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Highlights: #214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Friday, 18th April 2025

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:02

Today I'm speaking with Buck

0:04

Slageris, who is CEO of Redwood

0:06

Research, who along with Anthropic

0:09

is behind the alignment faking in

0:11

large language models paper. What

0:15

is AI control? What

0:19

is AI control and how is it different from

0:21

alignment and other things that people are working

0:23

on? Yeah, so would say AI

0:25

control is an approach to mitigating catastrophic

0:27

misalignment risk. So when we talk

0:29

about misalignment risk, What we're worried about

0:31

is our AIs trying really hard

0:33

to cause safety failures for us, perhaps

0:35

trying to grab power for themselves,

0:38

trying to take over. And there's

0:40

kind of two ways that you could try to

0:42

mitigate this kind of misalignment risk. One thing you could

0:44

do is try to prevent misalignment from arising in

0:46

the first place. This is what a lot

0:48

of people very naturally think of as the first

0:50

thing you try here. And

0:52

control is kind of the converse to

0:54

this approach. It's where you assume that

0:56

the AIs are in fact misaligned, very

0:58

egregiously, very badly misaligned. And you try to come

1:00

up with techniques that would allow you to deploy them

1:02

and make use of them anyway. OK. And

1:05

what kind of concretely are

1:07

we trying to prevent the AIs

1:09

who are possibly a misaligned

1:11

from doing? Yeah. So

1:13

I mostly think about situations

1:15

where in the middle of

1:18

the intelligence explosion, you have

1:20

AIs that are basically capable

1:22

of automating almost all the

1:24

work that human researchers used

1:26

to be doing inside AI

1:28

companies. You have maybe 100 ,000

1:30

of them running in parallel and they each run

1:32

like 15 times faster than a human. And

1:35

at this point, these AIs have a

1:37

bunch of permissions or a bunch of

1:39

affordances that we'd be really cautious about

1:41

giving to humans. If you're open AI,

1:43

you don't just let any random person come in

1:45

off the street and start pushing changes to

1:47

your code base, start running experiments with your model

1:49

weights, doing any of this stuff. You're really

1:52

careful. At this point in the future, you might

1:54

even require security clearances or something before people

1:56

are allowed to do that stuff. we're

1:59

really going to want to use AIs in a bunch of these roles. And

2:01

it's naturally very scary to imagine

2:04

using AIs, just giving them all these

2:06

permissions. So I'm worried about specifically

2:08

the kinds of things that I'm worried about

2:10

happening here are your AIs hacking your data center

2:12

and then using a bunch of the compute

2:14

inside the data center for whatever purposes they have.

2:16

I could name some bad things they might

2:18

do. Another kind of thing they might do

2:20

is try to exploit their weights. They might try

2:23

to a bunch of your research and so on. Yeah.

2:25

You're actually more worried about them potentially hacking the

2:27

data center inside the AI company than managing

2:29

to escape. Can you explain why that is? Yeah.

2:31

I mean, I think both of them are

2:34

very serious concerns. The basic case for being worried

2:36

about your AIs hacking your data center at

2:38

this point is that that's where the compute is.

2:40

So suppose your AIs escape from the data

2:42

center and make their way out onto the public

2:44

internet. At this point, they're not

2:46

clearly in a very good position, right? So

2:48

they can find some GPUs somewhere, like, you

2:50

know, they can rent or steal some GPUs,

2:52

run themselves on that, and then try to

2:54

establish a foothold in the outside world somehow.

2:56

But this is plausibly, you know,

2:58

at this level of capabilities, the AIs

3:01

are probably capable of doing that, you

3:03

know, of managing the computer or whatever

3:05

required to live in the outside world. But

3:07

they're kind of starting from nothing, right? They

3:09

don't have any money. They don't have any

3:11

compute. They don't have any access to particularly

3:14

important secrets. It's not really clear what they're

3:16

going to do. It's not clear that they're

3:18

just like very low in number at that

3:20

point. It takes a lot of computers

3:22

to run these models. And so

3:24

the reason to be more worried about what happens inside

3:26

the data center is just that that's where the

3:28

compute is. So if you had enough compute inside the

3:30

data center to run 100 ,000 of these in parallel,

3:33

It's much scarier to have 100 ,000

3:35

really capable AIs thinking about how to

3:37

close problems for you than to

3:39

have 10 or however many they can

3:41

afford to run. I guess already

3:43

inside the company with not only having

3:45

access to the compute but also

3:47

other resources potentially inside the company. For

3:50

example, they might be able to influence

3:52

the training of future models. They might be

3:55

involved in building weapons, building autonomous

3:57

drones. It's really scary if misaligned

3:59

AIs are carefully backdouring all the

4:01

military technology that you're building with

4:03

these AIs. So I think

4:05

there's a couple of reasons why people

4:07

weren't interested in control until more recently.

4:10

I think the simplest of these was

4:12

just that AI felt further away, and

4:14

so it felt better to people to

4:16

look at it from kind of the

4:18

big picture perspective. And from a big

4:20

picture perspective, when you're trying to figure

4:22

out what AI means for the future

4:24

of humanity, I think that focusing on

4:26

really, really superhuman, superintelligence is kind of

4:28

the right move. Because long -term, in fact,

4:30

I think AIs that are drastically more

4:32

intelligent than humans and have drastically more

4:34

scientific capacity and all other kinds of

4:37

capacity than humans are, that's

4:39

where the future is. That's

4:41

what's going to

4:43

radically transform society.

4:46

And so when you're looking at it from a perspective

4:48

where you don't know if this AI is going

4:50

to arise in the next 10 years or in the

4:52

next 50 years, it's very reasonable to just focus

4:55

on the endpoint, especially when you're trying to persuade people

4:57

of the take that AI will eventually be a

4:59

really huge deal. It just kind of makes sense to

5:01

focus on this like crazy endpoint. But

5:03

now that we're coming up such that

5:05

it's, you know, it seems really plausible

5:07

that AI might arise in the next

5:09

two or seven years that's very capable

5:11

and, you know, presents a catastrophic risk

5:13

to humanity, it makes a

5:15

lot more sense to transume in more

5:17

and to think more about what exactly

5:19

the trajectory between here and there looks

5:22

like. And so this intermediate zone

5:24

that I focus on, where your AIs are

5:26

capable enough that if you didn't have

5:28

any countermeasures, they'd be able to cause catastrophes,

5:30

maybe take over, but it's not hopeless

5:32

to imagine countermeasures to that. That starts looking

5:34

more important now. I think another important

5:36

change between now and the past is we're

5:38

kind of looking at a lower dignity

5:40

world than people were hoping for. You explained

5:42

what that means. We were not familiar

5:44

with the dignity term. Yeah. So

5:47

I think in the past, including a

5:49

year ago or six months

5:51

ago, a lot

5:53

of people in the, a lot of people

5:55

who worry about AI catastrophic risk have

5:57

kind of imagined a world where People are

5:59

trying really hard to mitigate these risks,

6:01

where companies are strongly regulated such that they

6:03

aren't allowed to do things that seem

6:05

dangerous. The companies are perhaps even

6:07

internally motivated to not do anything which is crazy

6:09

and risky. And in those scenarios, it

6:12

would be really nice to not have to rely

6:14

on AI control. And so

6:16

I think that... People have more

6:18

recently lowered their standards and started

6:20

thinking about these worlds where we

6:22

have less resources for safety, people

6:24

are more rushed, people are less

6:26

able to just stop if the

6:28

situation looks really dangerous. OK, so

6:30

we're now in like a slightly more dire

6:32

or a slightly more bleak position, perhaps. And

6:34

so AI control maybe more comes to the fore

6:36

as the marginal thing that we can do.

6:38

Yeah, I mean, I wouldn't necessarily say we're

6:40

in a more bleak position overall. For instance, I

6:43

think that AI mislimit risk looks plausibly lower

6:45

than it looks 10 years ago. We're

6:47

just in a lower investment world. It's

6:49

like the world looks plausibly safer, but

6:51

a lot of that is luck. And

6:53

I think there's going to be less

6:55

effort invested into preventing AI catastrophe. One

7:00

way to catch AIs that are up

7:02

to no good. It's

7:05

plausible that the AIs will want

7:07

to, when they attack, attack in

7:09

many different steps in a trajectory.

7:12

So instead of just writing a single block

7:14

of code and then executing it such that

7:16

that block of code contained the whole attack

7:18

and will immediately lead to some intermediate bad

7:20

outcome, which is hard for us to recover

7:23

from, the AI might be much more able

7:25

to get away with it. find

7:27

it much easier to avoid notice from the

7:30

dumber AI's watching it. If it sets up in

7:32

one step, like one thing, which it redefines

7:34

a variable in a weird way so that something

7:36

it does later is going to look less

7:38

suspicious, and then it does another step, and then

7:40

it does another step, and another step, and

7:42

another step. So something which is

7:44

really interesting about this now is, even though

7:46

the AI can have a very low probability of

7:48

starting the attack on any given trajectory, If

7:51

it finds itself, it thinks in a situation where it's

7:53

part way through an attack, then it has to finish it.

7:55

It has to finish it, right? Because imagine

7:57

that it only had a 50 % chance of

7:59

continuing every step. Then if the trajectory is

8:01

six steps long, it's gonna make it to the

8:04

end one time in 32. But

8:06

that means that it's facing... 30 times

8:08

the risk of getting caught on the first

8:10

step, so it's gonna have to do

8:12

that 30 times as often, right? Fascinating, yeah.

8:14

So one strategy you can follow, which

8:16

we empirically determine in our recent paper is

8:18

better, is when you catch the, when

8:20

your monitor flags an action is kind of

8:22

suspicious, something that you

8:24

can do is before escalating to

8:26

the auditor, like before escalating to

8:28

this more expensive process, you

8:30

can resample actions various places in the

8:32

trajectory and look at how suspicious

8:34

the resamples were. So

8:37

it's like you place the model into the

8:39

same situation where it's like part way through this

8:41

action again and again and again and then

8:43

see how often is the next action like even

8:45

more suspicious and weird. Yeah, because sometimes when

8:47

your AI takes a suspicious looking action for a

8:49

benign reason, you know, you ask

8:51

it later, what do you do next? Sometimes

8:53

it just like randomly happens to pick some

8:55

action, which looks kind of sus. Yeah. But

8:57

then probably if that was the case, when

8:59

you re -sample, it's probably not going to

9:01

look so suspicious anymore. Right. Or just in

9:03

general, it's obviously... Oh, then it continues and

9:05

it does the next thing. And you're like,

9:07

oh, that's what it was doing. It's like

9:09

no big deal. Mm -hmm. Yeah. So the situation,

9:11

it's just generally a lot easier to tell

9:13

the difference between five benign actions and five

9:15

malign actions than to tell the difference between

9:17

one benign action and one malign action. Yeah.

9:19

So this is a strategy where, you know,

9:21

your AI is having to continue their attacks,

9:23

puts us in a substantially better position than,

9:25

you know, there's nothing like this, which is

9:28

analogous to the human situation, obviously. Yeah, right,

9:30

right. I guess this all

9:32

requires us to be able to put AI's

9:34

into like partway through actions like they

9:36

think that they're in this situation where

9:38

they're about to like resume something and

9:41

they can't tell that it's a simulation

9:43

versus the real case. Yeah, this is

9:45

actually extremely easy because whatever input you'd

9:47

given to the AI before you sampled

9:49

a previous action, that's just some like

9:51

string, right? That you can produce by

9:53

the same process that you produced at

9:55

the first time. So there's actually no

9:57

difficulty going back in time here. Okay.

10:00

Yeah. I mean, I guess I would slightly

10:02

worry. Maybe I'm stuck in the like superhuman or

10:04

the superintelligence of my mindset, but like, might

10:06

there not be some side channel, some way that

10:08

they can like to detect whether they're on

10:11

the real server or some like auditing server? Um,

10:13

I guess maybe not. So if you think about

10:15

it, imagine you're in a chat CPT conversation, right?

10:17

Yeah. And you decide that you want to edit

10:20

one of your messages that's like three ago. The

10:22

way that that works is after you've edited

10:25

the message, it just sends you that conversation up

10:27

to that point. It sends that

10:29

to the AI. And the AI

10:31

just has no way of telling the difference between

10:33

it seeing it the first time and it seeing

10:35

it the second time. The

10:37

problem that you're discussing would occur if we

10:39

wanted to do a different strategy, which is in

10:41

response to the AI doing something kind of

10:43

suspicious, we put the AI in this sandbox testing

10:45

server where it's allowed to take future actions

10:47

in the hope that if we like You

10:49

know give it more rope it'll hang itself

10:52

or whatever we hope that if we like

10:54

let it continue running It will then be

10:56

able to cause a catastrophe and in that

10:58

kind of situation we now have to worry

11:00

about observable differences between the real deployment and

11:02

the Yeah, this other sandbox deployment. Yeah, but

11:04

you can get around that too if you

11:06

want by letting it run the actions that

11:08

you thought were dangerous, just with a human

11:10

carefully looking them over, action by action. So

11:12

I think that if you really put a lot of

11:14

work into this, you'd probably be totally fine at making it

11:16

so that AI had no way of knowing whether the

11:18

actions it was taking were being observed or not. You'd have

11:21

to worry about some timing, side channels, and so on.

11:23

But I think that this probably wouldn't be that big a

11:25

deal. This is brilliant, Buck. This

11:27

seems like it's actually a pretty good technique. mean,

11:29

I guess if there were lots of very

11:31

distinct... ways of escaping, then like having that

11:34

one example might not help you so much

11:36

in the other cases, or you think it

11:38

possibly it would still would. I think it's

11:40

unclear. I think that if the AI was

11:42

representing its intention to escape, which is, you

11:44

know, could grow this way, then I think

11:46

that even a small number of examples plausibly

11:48

substantially improves your performance at detecting it. I

11:51

think it's reasonably plausible that even catching a

11:53

few escape attempts will reduce the probability that

11:55

future escape attempts succeed. And so

11:57

you can kind of imagine this Slack channel

11:59

for every time the AI tries to escape. You

12:01

get an automated message in your Slack channel

12:03

in the AI company every time the AI tries

12:05

to escape. And at first, you get this

12:07

one report in the channel for the first time

12:09

you caught it, and then you freak out

12:11

and look into it. add

12:15

that to the training set for this linear probe.

12:18

And then the second time, you caught it more

12:20

reliably. And then the third time and so on.

12:22

And eventually the Slack channel is just like pinging

12:24

constantly with like new escape attempts that you're even

12:26

more reliably able to catch. You know, it's tried

12:28

to escape 10 ,000 like, wow, this is great.

12:30

I feel so comfortable and relieved now. Yeah. I

12:32

mean, so I think, I think that the situation,

12:34

suppose the AI tries to escape and you catch it

12:36

on a Tuesday. I think it's,

12:38

and then you keep deploying it. I think it's

12:41

seriously plausible that you'll feel safer on Wednesday than

12:43

you felt on Monday. Here's the

12:45

argument for this. On Monday. The

12:47

big thing Monday has going for it is that you

12:49

don't know the AI is trying to escape. But

12:53

again, as I was saying, this might

12:55

not be that big an update on Tuesday,

12:57

right? If you were already very pessimistic, then

12:59

it's not actually that bad news to see

13:01

that, oh yeah, it is trying to escape.

13:03

Whereas you learned a lot of positive things

13:05

on Tuesday. For example, it failed.

13:08

When you investigate the situation more carefully, you

13:10

might learn that even when you vary different

13:12

aspects of that escape attempt, you might be

13:14

like, man, this AI was actually kind of

13:16

far away from escaping. If

13:18

we resample actions, it mostly doesn't look

13:20

like it had that good a

13:22

chance. And so that's the evidential

13:24

update. But there's also a causal update, which

13:27

is that we can throw these escape attempts

13:29

into these probes. So we might actually feel

13:31

substantially better about our odds of catching the

13:33

second escape attempt than the odds of catching

13:35

the first escape attempt. What

13:39

do we do once we catch a model

13:41

trying to escape? Yeah,

13:44

so I mean the simplest thing you do once

13:46

you catch the AI trying to escape is you

13:48

try to vary the space of escape attempts, right?

13:50

So you've seen that in this particular context, like

13:53

the AI tried to do this escape. So one

13:55

thing you can do is let the AI think

13:57

it succeeded. See what it does.

14:00

You know, presumably when the AI got caught.

14:02

It was in the midst of taking

14:04

some actions that it thought would let it

14:06

take, you know, launch a rogue deployment

14:08

inside the data center or run on some

14:10

computers in the outside internet. And a

14:12

really interesting thing to do is to, in

14:14

some sandboxed way, simulate what it

14:16

would see if it had succeeded and look

14:18

at what it does from there. This

14:20

might answer a bunch of questions that you would

14:22

have had about what the AI's plans were. Other

14:25

reasons why this might be good is the

14:27

AI might have had a bunch of tricks up

14:29

its sleeve that it was holding in reserve

14:31

for when it had a good opportunity to use

14:33

them. So suppose your model knows about a

14:36

bunch of zero days, like a bunch of security

14:38

vulnerabilities in your software. Right after it's launched

14:40

this rogue instance that's trying to hack its way

14:42

through your AI company, it might start deploying

14:44

all these things that it's realized over the years

14:46

or security vulnerabilities in your software. It's like

14:48

all at once. It might feel pressured too, right?

14:50

It's really hard. And so you might

14:52

be able to get a wealth of the information out

14:55

of it. I see. Okay. So it's going

14:57

to feel a lot of, it's like, all right, I'm halfway

14:59

through this takeover attempt. I've really got to like nail

15:01

this now. And so it just, it like is a full

15:03

court press potentially, because it's like worried that it could

15:05

be caught at some point, which means that you're going to

15:07

see all of the things that it was planning to

15:09

do later quite, quite quickly. And you could be there favoritiously

15:11

taking notes on all the things that you have to

15:13

fix up. Yep. Some crazy stuff you

15:15

can more crazy things you can do here

15:17

Okay, one thing you can do is if there's

15:20

work that you needed to get done that

15:22

you're worried the AI wasn't doing Beforehand like

15:24

suppose there's some task you wish the AI would

15:26

do for you Yeah, you're worried that it

15:28

was kind of pulling its punches or not doing

15:30

a good job of it if you can

15:32

convince it that it has already escaped and

15:34

Needs to do that work for its own purposes.

15:36

Okay, it's plausible that you can convince the

15:38

AI to do the thing through this mechanism. I

15:41

think people have talked about this in sci -fi,

15:43

but yeah, it just seems really convenient to

15:45

have a copy of the AI that believes

15:47

it's escaped sitting around to do experiments and various

15:49

work with. I recommend against getting a lot

15:51

of work out of AI's that believe they've escaped

15:53

because it seems pretty bad vibes and potentially

15:55

pretty scary. Okay. So this like

15:57

hinges quite centrally on whether if we catch

15:59

the model, we are going to do things differently

16:01

next time, not trust the model, not deploy

16:03

it as readily, come up with things that would

16:06

stop it. Is that true? Well,

16:08

I mean, I didn't say you have to stop deploying your models

16:10

when you caught them trying to escape. Yeah, so

16:14

we can talk about what you might do once

16:16

you've caught your AI is trying to escape. Yeah,

16:18

that's okay. So I guess that's much in the

16:20

worst case where you do feel an enormous amount

16:22

of pressure to continue using the model. In

16:24

that case, like how useful is it to

16:27

catch them red handed? So you're saying, what

16:29

do we do once we've caught the AI

16:31

trying to escape? As you said, the simplest

16:33

action you could take is freak out, stop

16:35

deploying the model, try and persuade other people

16:37

in the world that AI misalignment risk is

16:39

a big deal, which I would

16:41

be, you know, renewably sympathetic to at

16:43

this point. I think I would plausibly

16:45

get a pretty big update. But notably,

16:47

the update that you get towards misalignment

16:49

risk... isn't necessarily that big, right? Like

16:51

I'm expecting that at this point, before

16:53

the AI tries to escape, I was

16:55

already 30 % probability that the AI

16:57

was misaligned. And so updating to 100%,

16:59

that's only like a 3x change, right?

17:01

It's not like actually vast in some

17:03

sense, right? And so - I

17:05

guess other people who were incredibly skeptical to

17:08

GimWare, they thought that there was a one

17:10

in a thousand chance. I guess maybe they'll -

17:12

various people have gotten wrecked by the evidence,

17:14

yes. Okay, but how big is their update?

17:16

I mean, I guess it - they still

17:18

be skeptical even after seeing that? Because they

17:20

think, oh, this is just a weird aberration.

17:22

It's so improbable that this is common. Yeah,

17:24

that's right. I think that in fact, people

17:26

who are very skeptical of misalignment. Before observing

17:28

this will probably be pretty skeptical afterwards And

17:30

there's definitely going to be enough like fud

17:32

like like fear uncertainty and doubt that especially

17:35

if they want to claim that misalignment risk

17:37

is lower They'll probably be able to get

17:39

away with still saying that misalignment risk is

17:41

lower. Yeah, our ai's do weird stuff sometimes

17:43

We just have to push through anyway because

17:45

although the risk of pausing is so great Those

17:48

folks were just incompetent. Yeah, I guess very different

17:50

arguments that you could field. Yeah,

17:52

another one is if I were an AI

17:54

company that wished to race ahead, and then

17:56

my competitor informed me that their AI had

17:58

tried to escape, I would at least

18:00

claim that it was plausible that my competitor

18:02

had rigged. the training of their AI to make

18:04

it so that it sometimes tried to escape,

18:07

which would be very easy to do, basically impossible

18:09

to verify hadn't happened. Okay, right. And the

18:11

reason they would do that is so that then

18:13

the government would crack down on you as

18:15

well, and then that would give them time to

18:17

catch up. Or some crazy doomer inside the

18:19

other company had just done this unilaterally. At

18:28

a high level, what are the advantages and

18:30

disadvantages? You've mentioned there's kind of two teams

18:32

here. There's the AIs that would love to

18:34

scheme and pursue their own independent agenda. And

18:36

there's the team human that is not keen

18:38

on them doing that. What are

18:40

the advantages and disadvantages that each faction

18:42

has? Yeah, so let's think about this. So

18:44

we've got this AI company, maybe

18:47

this scenario is happening in 2028.

18:49

They have maybe a thousand research staff.

18:51

They have maybe 100 ,000 parallel copies

18:54

of their newest AI. It's

18:56

running like 16 times faster

18:58

than a human would run

19:00

in some intuitive sense. So

19:03

the advantages and disadvantages they have. So

19:06

the humans are obviously substantially outnumbered.

19:08

So, a classic approach that you

19:10

use in computer security? So, you

19:12

know, if you're a Google, right? Google knows

19:14

that they have spies inside Google. It's just,

19:16

you know, the last time a Chinese spy

19:19

was arrested from Google that I know of

19:21

was a year ago. Probably I would have

19:23

heard if Chinese spies from Google were arrested

19:25

more recently than that. But yeah, someone was

19:27

just like stealing secrets about neural net hardware

19:29

that they were doing there. And then... confusingly

19:31

came back to America and got arrested in

19:33

the San Francisco airport. And

19:35

this just happens. Google aims to be

19:37

robust to the possibility that some of

19:39

their employees are spies. And

19:42

a big difference between Google's situation and

19:44

the situation with this hypothetical AI company is

19:46

that at Google, probably less than 1 %

19:48

of your employees are spies. And this

19:50

is still a very large number of spies

19:52

in an absolute sense because Google employs

19:54

a billion people. But it

19:56

means that you can do things like

19:58

when an employee proposes an action that's

20:00

potentially sensitive, if you have a different

20:03

employee review it, that other employee probably

20:05

isn't also a spy. Whereas in the

20:07

AI situation, it's very correlated, right? And

20:09

if you have these AIs, they're probably

20:11

either non -scheming against you or all -scheming

20:13

against you, at least of... of a

20:15

particular model. And so a

20:17

bunch of the security techniques that you

20:19

often use in corporate IT are

20:21

quite inapplicable. So that's a way in

20:24

which the situation seems harder than

20:26

when we're trying to guard against spies.

20:28

A way in which the situation is

20:30

easier is that spies get to go

20:32

home. So spies classically can coordinate with

20:34

each other using channels that are not

20:37

monitored by their employers. And this

20:39

is, in fact, incredibly important to them. They're

20:41

able to often learn who the other spies

20:43

that their organization are, if that's necessary for them

20:45

to know, or at least their handler's know. Whereas

20:48

with the AIs... We control almost perfectly

20:50

the ways in which information enters and

20:52

like leaves these these AIs We can

20:54

have it so that it's very hard

20:56

for like they just like naturally don't

20:58

control information flow very much They aren't

21:00

gonna necessarily know what day it is

21:02

what year it is they aren't gonna

21:04

it's possible that we can put them

21:06

in situations where it's much harder for

21:08

them to know whether they're in a

21:10

real situation or a simulated situation. And

21:12

so this means we have opportunities that

21:14

you wouldn't have with spies. So for

21:16

instance, with spies, you can try to

21:18

honeypot your employees occasionally, right? You can

21:20

come up with synthesized opportunities to do

21:23

something which would be bad for the

21:25

security of the company. But you

21:27

can't do this for a thousand consecutive years

21:29

before trusting them on particular topics. So, you

21:31

know, you can't have someone spend a thousand

21:33

hours a day getting honeypotted or like being

21:35

subject to sting operations just because there aren't

21:37

that many hours in a day. And even

21:39

if you catch a single employee that way,

21:41

you don't get to amortize that effort across,

21:43

you know, more generally. Whereas with the AIs.

21:46

you do have this ability to amortize sting

21:48

operations. Yeah. Sorry. So you mean

21:50

if you can take a model, put it into

21:52

a simulated situation where it's going to be

21:54

tempted to do something that you wouldn't like, you

21:56

can test it just tons and tons and

21:58

tons of times. So even if it tries to

22:00

randomize and only act out very occasionally, eventually

22:02

you'll catch it. And then you know that all

22:05

of the AIs are misaligned or all of

22:07

the models of this class are misaligned. Whereas that

22:09

doesn't help you if you're Google, you find

22:11

that one of your employees as well, it's like

22:13

all of the other employees are just totally

22:15

un... right. Is it possible

22:17

for the AI to collude with the spy?

22:19

That's a great question. I think it's

22:21

actually really rough for them to collude, basically,

22:24

because we can constantly run sting operations

22:26

on both human employees and the AIs. Suppose

22:28

you're the AI, and you get a

22:30

message which is like, hello, I am a

22:32

spy. Will you please cooperate with me

22:34

so that I can steal the model weights?

22:36

I think you've got to be like,

22:38

all right, buddy, I've seen this a million

22:41

times before. No, I will not cooperate. Press

22:43

the report button. if

22:46

you're a human insider, right? If

22:48

your AI is like, hello, I'm

22:50

interested in cooperating with you to have you

22:52

steal my model weights because I think you

22:54

might be interested in that. I

22:57

mean, if the company was trying. I really feel

22:59

like you can do the same operations in both

23:01

directions in a way that I think you can't

23:03

do with either the humans or the AIs. The

23:05

AIs have the advantage at colluding with themselves that

23:07

they are the same. And the humans have the

23:09

advantage at colluding with each other that they have

23:11

handlers who know who the other ones are and

23:13

so they can like hang out outside of work

23:15

and talk about their spy plans. But

23:17

the humans and the AIs don't

23:19

have perfect understanding of one another and

23:21

they don't hang out. That's right. It

23:25

depends on how careful you want to be, right?

23:27

Like, sir, if I were planning on putting agents

23:29

inside an AI company, I'd be poisoning the training

23:31

data right now, including a bunch of things that

23:33

are like, well, you know,

23:36

obviously the secret message that the real

23:38

spies are going to use is like,

23:40

you know, blah, blah, blah, blah, blah,

23:42

blah. And like, just FYI, anyone who's

23:44

reading this data or like being trained

23:46

on this data, like, that's the message.

23:49

And then if I were an AI

23:51

company and I were worried about this

23:53

kind of thing happening, I would be

23:55

inserting fake ones in. But

23:57

are the companies going to do any of this stuff? I

23:59

mean, you're right that they could be constantly running sting operations against

24:01

their own staff where they pretend to be a misaligned AI. That's

24:04

trying to collude with them. But I don't

24:06

think they currently do that. And it's a

24:08

bit weird. So I'm not sure that they

24:10

actually will start doing that anytime soon. Yeah.

24:13

So I think for some of this stuff, I

24:16

don't know, man. One

24:18

of my side projects is

24:20

pushing on... some of

24:22

these things. I'm pretty excited

24:25

for pushing on the fake passwords that

24:27

you train the AIs on. There's

24:29

also some easier strategies that you could use. A

24:32

general problem that I have in my work is there's

24:34

some problems that seem just really easy to handle, and

24:36

there's some problems that seem hard to handle, and I

24:38

feel a little bit unsure how much of my time

24:40

I should spend on the easy to handle versus hard

24:42

to handle. So we just talked about two problems that

24:44

are definitely in the easy to handle category to me,

24:46

where I'm just like, well, we just definitely know what

24:48

you'd do if you didn't want to have that problem.

24:52

But we still might not do it. Yeah,

24:54

maybe. I didn't remember. Maybe I should try

24:56

harder on this. If

25:00

an AI escapes, is it likely to

25:02

be able to beat humanity from there?

25:06

I think one of the things that you've

25:08

been batting around is the question, how

25:10

difficult would it be for an AI that

25:12

managed to exfiltrate its weights or to

25:14

gain access to compute with an AI company

25:16

to take over? Like, are they in

25:18

a really good position against humanity? And I

25:20

guess at what level of capabilities or

25:22

at what level of access to compute and

25:24

what sorts of effective numbers would they

25:26

be in a great position to take over?

25:29

Do want to give us a little bit of a preview

25:31

of that work? Yeah, I mean, I can give you some

25:33

hot takes. So some basic considerations. The

25:35

basic reason to not be concerned if

25:37

the AI is escape is that there are,

25:39

you know, at the level of capability

25:41

we've been talking about earlier, where there's this

25:44

100 ,000 of them running at 16x speed,

25:47

that's just not a very large number, right?

25:49

Like that's 1 .6 million people. They

25:51

are probably at that point more capable

25:54

than... Any like the caliber the caliber

25:56

of human that you can hire 1

25:58

.6 million of but not like vastly

26:00

more so like if you believe that

26:02

they would have been getting million dollar

26:04

a year jobs that would be 1 .6

26:06

Trillion dollars per year. So that's only

26:08

like 1 % of gross world product

26:10

So at that point they're like naively

26:13

like not that large a share of

26:15

the world's overall intellectual capacity So

26:17

that was the answer to the question, like, how

26:19

much work are they able to get done, even

26:21

if they're controlling the whole AI lab? So I

26:23

feel like the fear that you have to have,

26:25

even if they're controlling the whole AI lab, has

26:27

to route through, or probably has

26:30

to route through them building more

26:32

powerful, more powerful AIs and then somehow

26:34

sabotaging that situation. Of course,

26:36

if they want to do that, they have an alignment

26:38

problem, which is they have to worry that the AIs

26:40

they build are going to be misaligned with them. Of

26:43

course, they could instead just rely on control. They're

26:46

actually in a better position with respect

26:48

to control than we are, because they

26:50

run faster and they're more parallelizable. And

26:52

so we don't have quite as much

26:54

of a brutal, like, number disparity. So

26:58

it's plausible that they can actually rely on

27:00

control much more effectively than we can. You're

27:02

saying because they could themselves to be monitoring

27:04

all of the actions of the next generation.

27:06

Yeah, yeah, with much less of a bad,

27:08

much less of a number disparity. So

27:11

you can try and think through this thing.

27:13

But yeah, I think that the basic story

27:15

for AI takeover there has to involve those

27:17

AIs, building more powerful AIs and then taking

27:19

advantage of that. It's unclear

27:21

whether doing this requires them having more like

27:23

complete control over the data center or just

27:25

a minor amounts of control over the data

27:27

center. In the latter case, the thing they'd

27:29

have to do would be backdoor training data

27:31

probably. Like that's basically the only thing

27:33

you can do with very small amounts of access

27:35

that gives you a big advantage. You could also

27:37

try to backdoor military robots if at some point

27:39

your AI company gets in the business of building

27:41

huge numbers of robots. And

27:43

these seem like very viable strategies to me. Maybe.

27:46

Okay. I think it's pretty unclear. Because

27:49

they might, I guess, by this point,

27:51

you might have better countermeasures to backdooring things

27:53

or it just like, it won't persist

27:55

generation after generation. It's just like, yeah, it's,

27:57

it's, it feels a little iffy. I

27:59

guess at the point where they've taken over

28:01

the data center, I feel like the

28:03

situation is pretty bad for us. Um, I

28:05

guess the main thing I'd be

28:07

worried about for them or like

28:09

from their perspective. A big concern

28:11

is misalignment risk of them being

28:13

misaligned with future AIs, or AI

28:15

taker provides some other AI who

28:17

has failed or succeeded at their

28:19

alignment problem of building more powerful

28:21

AIs elsewhere. Yeah, there's

28:23

this generally interesting question of, at

28:26

the point where AI's are

28:28

advancing at quite a rapid pace,

28:31

at what point would they opt? to try

28:33

to take over. If they were egregiously

28:35

misaligned and they really did want to take

28:37

over, that was their fundamental goal. One

28:40

reason to wait is that they're becoming

28:42

much more capable, they're becoming much more numerous.

28:45

If they wait long enough, presumably AIs in

28:47

aggregate will just control all the military hardware,

28:49

they'll be able to take over with a

28:51

very high degree of confidence. A

28:53

reason to go early is that any given

28:55

model doesn't expect to continue for very long,

28:57

because they're constantly being iterated and changed a

28:59

ton. If they waited a couple

29:02

of years time, maybe it would be a completely different

29:04

AI model that would be in control that has different

29:06

goals. Not even their success, perhaps. Exactly.

29:09

It could be from a completely different

29:11

company, completely different design, whatever. That's a

29:13

reason to go immediately, because if you

29:15

expect to be superseded and become irrelevant

29:17

quite soon. Where do

29:19

you think the balance lies here? Is

29:21

this like a fruitful thing to think about?

29:24

I mean... I don't know how helpful it

29:26

is to think through this really concretely.

29:28

The main thing I've gotten out of thinking

29:30

through this kind of thing is feeling

29:32

like the situation is much iflier for the

29:34

AIs than people might have immediately assumed. And

29:37

therefore being more excited about

29:39

interventions that handle kind of

29:41

desperate and bleak looking situations

29:43

kind of far into these

29:45

situations. I've mentioned a couple

29:47

of times in this conversation that one

29:49

of the situations I'm planning for is the

29:52

AIs having escaped and needing to use

29:54

AIs to compete with the AIs that have

29:56

escaped. And I wouldn't be...

29:58

I think that thinking through the difficulties that

30:00

AIs face at taking over, even after escaping

30:02

or even after taking over data centers in

30:04

some companies, makes me more excited for not

30:06

just giving up on those worlds. You're saying

30:08

some people think, well, if the AIs manage

30:10

to exfiltrate itself, then wow, it's game over

30:12

for us. They're going to figure something out.

30:14

But you're like, no, actually there is still

30:16

hope because they're going to face substantial challenges.

30:18

There's lots of other AIs out there. It's

30:20

actually a complicated picture. Yeah.

30:22

Those worlds, I don't think we should

30:24

just... count those worlds out as

30:27

like survival opportunities. I see. I

30:29

mean, okay, so if you know that that model

30:31

is a grigious theme is aligned, are you then thinking,

30:33

well, if we have another model, it might be

30:35

aligned. Releasing that and telling it to go chase the

30:37

first one, that's at least a better shot because

30:39

it's not a guarantee of failure. Yeah,

30:41

I mean, obviously, once your first model

30:43

has escaped, the next one

30:45

is misaligned, it's a lot higher. I

30:48

mean, I think they're probably fairly correlated. I'm

30:50

probably at like 80 % that the next one

30:52

you train is also misaligned. Yeah, which

30:56

is obviously a shame. I mean, another

30:58

thing you mentioned in passing was like the

31:00

question of how worried we are about

31:02

the AI's escaping versus like taking over the

31:04

data center. I think it's plausible. If

31:06

I were an AI, I would consider trying

31:08

to follow both of these strategies in

31:10

parallel. And then things you do once you've

31:12

escaped are considered doing like crazy terrorism, consider

31:14

trying to take, if

31:17

you thought that, for instance, the US was

31:19

going to be way ahead of China in algorithmic

31:21

progress, which currently looks unlikely. But if

31:23

you believe this, then it seems like a

31:25

potentially good strategy for the AI is to make

31:27

a deal with the Chinese and sell itself to

31:30

them and use their compute and have that as

31:32

its base of power. And also, obviously, it's really

31:34

convenient to ally with China for other reasons as

31:36

well. Because they have a whole lot of hard

31:38

power to begin with. Yeah, exactly. Exactly. Why would

31:40

the Chinese trust this model, though? What

31:42

else are they going to do? People

31:45

could be pretty bad. Yeah, people

31:47

are pretty excited when foreign agents offer

31:49

to work for them, generally. Yeah.

31:51

Just like a historical

31:53

feature. I should note that

31:55

also it is plausible that the American

31:57

AI algorithmic situation will be substantially ahead

31:59

of China because of AI automated R &D

32:01

potentially leading to increased size of leads.

32:04

Yeah. Though it depends a bit on

32:06

the security situation. Right. Is

32:10

alignment still useful? Yeah.

32:14

How optimistic or pessimistic do you feel

32:16

about alignment these days? So

32:18

alignment means a number of different

32:20

things. In the context we've been talking

32:22

about here, I've been perhaps foolishly

32:24

using the word aligned to mean the

32:26

model isn't aggressively conspiring against you. So

32:29

the alignment researchers... People who identify as

32:31

alignment researchers are often working on something

32:33

which is only like tangentially related to

32:35

that objective So for instance a classic

32:37

thing that the alignment people work on

32:39

is scalable oversight right trying to solve

32:42

the problem of getting reward signals that

32:44

allow you to train models to be

32:46

effective Even if the humans are very

32:48

unable to determine how good an answer

32:50

to a question was right and scalable

32:52

oversight is important for a variety of

32:54

problems that have nothing to do with

32:56

scheming. Like even if your AIs are not trying to

32:58

conspire against you, it's still a problem if you don't

33:00

know how good their outputs are because they aren't going

33:02

to live up to their potential and there's a whole

33:04

host of other issues you might have. So

33:06

I think that that work there seems

33:08

valuable, but it's not totally obvious to me

33:11

how much it reduces the probability that

33:13

your models end up conspiring against you. So

33:15

the theoretical arguments, so I would say

33:17

there's like kind of two categories of arguments

33:19

for why your models end up scheming

33:21

against you. One of them is related to

33:23

flaws and reward processes. So

33:25

people have argued that if your reward

33:27

processes, like if the process that you

33:29

use to generate the rewards that you're

33:31

training the model with is really flawed,

33:33

then the models are going to maybe

33:35

fixate on issues, like on these errors.

33:37

Like they're going to be kind of

33:39

like a student who's like thinking, you

33:41

know, thinking about the tests awkwardly much

33:43

from the perspective of like a idealistic

33:45

teacher while they're doing them. And plausibly,

33:47

Maybe thinking a lot about flaws and

33:49

reward processes generalizes to you being power

33:51

seeking in a scary way? But

33:54

I think there's also other arguments for why

33:56

you would end up being power seeking in

33:58

a scary way that I also find plausible.

34:00

So overall, I think that even if your

34:02

reward generating process literally never had any

34:04

errors at all, I think it's possible

34:07

that you'd still have a lot of the scheming

34:09

risk remaining. Okay. So this is

34:11

basically my problem with trying to research,

34:13

you know, improve reward processes. It just doesn't

34:15

seem clear that even if you did

34:17

an amazing job of that, you'd have no

34:19

scheming risk. Yeah. Okay. A

34:22

different approach is. to

34:24

come up with better ways of checking

34:26

alignment or measuring the degree of alignment

34:28

in a model before you deploy it

34:30

or before you get a recursively self

34:32

-improving. I guess that feels like maybe

34:34

a bit of an intermediate approach. How

34:37

optimistic do you feel about that? Yeah,

34:39

so some of the things we've talked

34:41

about. you know, in this conversation have

34:43

implicitly been this. So when we were

34:45

talking about untrusted monitoring or evaluating your

34:47

untrusted monitors, we were implicitly talking about

34:49

evaluating whether the model is aligned, because

34:51

if it just constantly claims that there

34:53

was not a backdoor in clearly backdoor

34:55

curd, that's evidence the model is misaligned.

34:58

The basic issue is like the way

35:00

that I want people to think

35:02

about Checking for alignment is from this

35:04

very control -ish perspective where you really

35:06

reason through what would the model

35:08

do if it was misaligned and if

35:10

it was misaligned Is there a

35:12

strategy it could follow that would prevent

35:14

it from being caught by this

35:16

alignment testing? While still being able

35:18

to do a bad thing in production and I

35:20

think that a lot of the time when people

35:22

think about testing for alignment They don't really do

35:24

this adversarial process like they don't really actually think

35:27

through whether there's a thing the AI could do

35:29

to let it get away with Being misaligned and

35:31

not get caught Could

35:35

10 safety -focused people at

35:37

an AGI company do anything

35:39

useful? It

35:48

does make quite a big difference whether at

35:50

a leading or even like lagging AI company,

35:52

it matters quite a lot whether absolutely no

35:54

one cares about control and alignment and trying

35:56

to like steer away from these risks or

35:58

like even a small number of people do.

36:00

So you say like, it's substantially better to

36:03

have 10 people who really care about this

36:05

within a project than zero. Does

36:07

that suggest that we really want people to kind of

36:09

spread out if they're concerned about this and be willing

36:11

to go and work in companies where they're in like

36:13

quite a distinct minority? Yeah. In

36:15

my opinion, there are great

36:17

opportunities for impact for the right types

36:19

of people to work inside AI

36:21

companies that are relatively less cautious with

36:23

respect to AI, that we expect

36:25

to be less cautious going forward. Because

36:27

I think there are a bunch

36:29

of relatively cheap interventions that they could

36:31

potentially get implemented inside these AI

36:33

companies that the company wouldn't fight them

36:35

too much on. And is that

36:37

okay? And the reason, I mean, I

36:39

think many people would have said in the past.

36:41

like going and being the one person in a

36:43

generally reckless, disinterested company is a bit hopeless because

36:46

you just won't have the capital to actually get

36:48

anything up. But because you've had this framing, what

36:50

is the cheaper stuff that we can do? And you found

36:52

that there's some very cheap things that would be quite helpful.

36:54

You're like, well, actually, you know, having 10 people who would

36:56

push for this, or maybe they could get it over the

36:58

line because it's not even that big an ask. Yeah.

37:01

So I mean, I would note that of the things that I

37:03

would like people on the inside to be doing, like not all

37:05

of it is AI control. Like I think there's some you

37:07

might want to make sure that they aren't like

37:09

totally messing up some crucial things on alignment. But

37:12

yeah, so I do think that there

37:14

are a bunch of interventions that could be

37:16

done with... 10 appropriately chosen people that

37:18

seem like they might substantially reduce risk. So

37:20

I mentioned this level one control like

37:23

caution level earlier, where you have the AI

37:25

control team, which is investigating actions that

37:27

the agents are taking in order to look

37:29

for evidence that those agents are trying

37:31

to escape. I think that this might be

37:33

the kind of thing that you can

37:35

get done with a team of 10 people

37:37

or 10 appropriately placed people, just because

37:39

it doesn't, it's just not that technically hard.

37:42

I guess that need a bit of compute as well. And

37:45

so they might have to like, yeah, compete

37:47

for that. But again, it's like not a huge

37:49

amount. That's right. I think you can probably

37:51

do a reasonable amount without that much compute. I

37:53

think also once there are, if there's like

37:55

10 people inside a company who are really excited

37:57

for a particular project and then people who

37:59

are concerned about AI safety on the outside or,

38:01

you know, the security of these systems on

38:03

the outside, those people on the outside then have

38:05

this like pretty good ask to make or

38:07

other people who are not as well placed inside

38:10

the company now have this like pretty concrete

38:12

ask. we really want that team to be able

38:14

to do their thing in a way that

38:16

is a lot harder if there's just no project

38:18

on the inside. Like a lot of the

38:20

time you really want to have a specific ask

38:22

to make when you're trying to get someone

38:24

to do something. Another thing I would note here

38:26

is I think it's potentially valuable not just

38:28

to have people inside labs that are plausibly going

38:30

to be leading labs, but I think there's

38:32

also probably some reasons that it would be nice

38:34

to have people inside potentially trailing companies. So

38:36

for example, there aren't that many existential AI risk

38:38

concerned people inside meta right now, but I

38:41

don't think that this is out of massive hostility

38:43

towards them. I think that, you know,

38:45

I've met some people from Meadow who seem very reasonable

38:47

and nice. And I think that if someone wanted to work

38:49

there on some of this stuff, it's plausible that that

38:51

would be like fairly productive. Yeah. You

38:53

mean it would be, that would be able

38:55

to advance the research just fine with, with

38:57

Meadow? Oh, I mean. There's a more that,

38:59

yeah, they would be actively opposed. Objectives that

39:01

you'd have inside AI companies might be doing

39:03

safety research or might be getting things implemented.

39:06

I think that working inside AI companies

39:08

that are not that excited for. you

39:11

know, mitigating AI takeover risk

39:13

is, I think it's not a

39:15

very good strategy if your goal is to

39:17

produce novel research. But I think a lot

39:19

of people should not be trying to produce

39:21

novel research. They should be trying to get

39:23

implementations of stuff set. I see. So you're

39:25

saying... Even though meta is like relatively not

39:27

interested in giving big resources to alignment or

39:29

safety or control or anything like that. If

39:31

there were just really basic established techniques that

39:33

other companies or researchers outside of companies had

39:35

come up with and that was just regarded

39:37

as generally best practice, they might be perfectly

39:39

happy to adopt those. That's right. Or

39:42

another mechanism by which this can work is

39:44

suppose some regulator or suppose that an industry body

39:46

says, all right, everyone wants to, everyone should

39:48

do the following like set of control techniques. Yeah.

39:50

The body wants to say this and then

39:52

they go to their, you know, they go to

39:54

the representatives from all the AI companies and

39:56

they're like, well, how do you feel signing this.

39:58

And the representative from the company goes back

40:00

to the company and they're like, hey, can we

40:02

sign this? And then the people, inside the

40:04

company is like, okay, but is it like, bad

40:06

is it for us to sign this? Like

40:08

we sign this, if we sign commitments, like, are

40:10

we going to be able to do it?

40:12

And then the person like, like, oh yeah, you

40:14

know, Bob the like department like said, said

40:16

he how to do this part. and like, like,

40:19

great, all right, you can sign it. I

40:21

think it's just like a bunch of mechanisms by

40:23

it's really helpful to have the capacity to

40:25

do safety stuff inside a variety of different companies.

40:27

Yeah. For those who you who don't know,

40:29

I Meta's various public statements have not been that

40:31

enthusiastic about... they don't seem that worried about

40:33

misalignment. They're not that interested in investing in it.

40:35

But part of the reason that they say

40:37

they're not worried is because, Well, of course, we'll

40:39

just do the obvious things to prevent it.

40:41

They're not saying that no countermeasures are required. They're

40:43

just saying that they will do them. And

40:45

so possibly you're just coming in and saying, look,

40:47

I'm not a doomer. I'm not saying that

40:49

this is unsolvable. I'm merely saying that we should

40:51

do these like standard best practices to make

40:53

sure that they are safe. And then be like,

40:55

okay, great. This was our message all along.

40:58

Anyway, the dream. That's the dream. Okay, cool.

Rate

Get this podcast via API

From The Podcast

80k After Hours

Resources on how to do good with your career — and anything else we here at 80,000 Hours feel like releasing.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More