Nicholas Carlini (Google DeepMind)

Nicholas Carlini (Google DeepMind)

Released Saturday, 25th January 2025
Good episode? Give it some love!
Nicholas Carlini (Google DeepMind)

Nicholas Carlini (Google DeepMind)

Nicholas Carlini (Google DeepMind)

Nicholas Carlini (Google DeepMind)

Saturday, 25th January 2025
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

The fact that it can

0:02

make valid moves almost always

0:04

means that it must in

0:06

some sense have something internally

0:08

that is accurately modeling the

0:10

world. I don't like to

0:13

ascribe intentionality or these things,

0:15

these kinds of things. But

0:17

it's doing something that allows

0:19

it to make these moves

0:21

knowing what the current board

0:24

state is and understanding what

0:26

it's supposed to be doing.

0:29

Everyone needs something different by reasoning.

0:31

And so the answer to the question,

0:33

is that reasoning, is entirely what you

0:35

define as reasoning. And so you find

0:38

some people who are very much in

0:40

the world of, I don't think models

0:42

are smart, I don't think that they're

0:44

good, they can't solve my problems, and

0:46

so they say, no, it's not reasoning,

0:49

because to me, reasoning means, and then

0:51

they give a definition which excludes language

0:53

models. And then you ask someone who's

0:55

very sort of much on the AGI,

0:57

you know, language models are going to

1:00

solve everything. By 2027, they're going

1:02

to be displaced all human

1:04

jobs. You ask them, what is

1:06

reasoning? And they say reasoning is.

1:09

Hi. So I'm Nicholas Carlini. I'm

1:11

a research scientist at Google Deep Mind.

1:13

And I like to try and make

1:15

models do bad things. and understand the

1:17

security implications of the attacks that we

1:19

can get on these models. I really

1:21

enjoy breaking things and I've been doing

1:23

this for a long time, but I'm

1:26

just very worried that because they're impressive,

1:28

we're going to have them applied in

1:30

all kinds of areas where they ought

1:32

not be, and why, as a result,

1:34

the attacks that we have on these

1:36

things are going to end up with

1:38

bad security consequences. MLST is

1:40

sponsored by Sentinel which is the

1:43

compute platform specifically optimized for AI

1:45

workloads. They support all of the

1:47

latest open source language models out

1:49

of the box like Lama for

1:51

example. You can just choose the

1:53

pricing points, choose the model that

1:55

you want, it spins up, it

1:57

elastic auto scale, you can pay

1:59

on consumption. where you

2:01

can have a model which is

2:03

always working or it can be

2:05

freeze-dried when you're not using it.

2:07

So what are you waiting for?

2:09

Go to sentML. AI and sign

2:12

up now. To find out this

2:14

in a new AI research lab,

2:16

I'm starting in Zurich, it is

2:18

funded from Paz Ventures, involving AI

2:20

as well. We are hiring both

2:22

chief scientists and deep learning engineer

2:24

researchers and so we are Swiss

2:26

version of deep sick. And so

2:28

a small group of people, very,

2:31

very motivated, very hardworking, and we

2:33

try to do some research studying

2:35

with LLLM and Einstein models. We

2:37

want to investigate, reverse engineer, and

2:39

explore the techniques ourselves. Nicholas Carlini,

2:41

welcome to MLST. Thank you. Folks

2:43

at home, Nicholas won't need any

2:45

introduction whatsoever, definitely by far the

2:47

most famous security researcher in ML.

2:50

and working at Google and it's

2:52

so amazing to have you here

2:54

for the second time. Yeah, the

2:56

first time yeah was a nice

2:58

pandemic one, but no, it was

3:00

great. Yes, MLST is one of

3:02

the few projects that survived the

3:04

pandemic, which is pretty cool. But

3:06

why don't we kick off then?

3:09

So do you think we'll ever

3:11

converge to a state in the

3:13

future where our systems are insecure

3:15

and we're just going to learn

3:17

to live with it? I mean,

3:19

that's what we do right now,

3:21

right? In normal security. There

3:23

is no perfect security for anything.

3:26

If someone really wanted you to

3:28

have something bad happen on your

3:30

computer, like they would win. There's

3:32

very little you could do to

3:35

stop that. We just rely on

3:37

the fact that probably the government

3:39

does not want you in particular

3:41

to have something bad happen. Right?

3:44

Like, if they decided that, like,

3:46

I'm sure that they have something

3:48

that they could do, that they

3:50

would succeed on. Well, we can

3:53

get into a world of is...

3:55

The average person probably can't succeed

3:57

in most cases. This is not

3:59

where we are with machine learning

4:02

yet. With machine learning the average

4:04

person can succeed almost always. So

4:06

I don't think our objective should

4:08

be perfection in some sense, but

4:11

we need to get to somewhere

4:13

where it's at least the case

4:15

that a random person off the

4:17

street can't just really easily run

4:20

some off-the-shelf get-hop code that makes

4:22

it so that some model does

4:24

arbitrary bad things and arbitrary settings.

4:26

Now I think getting there is going to

4:28

be very very hard. We've tried,

4:31

especially in vision, for the

4:33

last 10 years or something, to get

4:35

models that are robust, and we've

4:37

made progress. We've learned a lot, but

4:39

if you look at the objective metrics,

4:41

like they have not gone up by

4:43

very much in like the last four

4:46

or five years at all, and this makes

4:48

it seems somewhat unlikely that

4:50

we're going to get perfect robustness

4:52

here in this foreseeable future.

4:54

But at least... We can still

4:56

hope that we can do research

4:58

and make things better and eventually

5:01

we'll get there. And I think

5:03

we will, but it just is going

5:05

to take a lot of work. So I'll

5:07

ask me to ask you this question.

5:09

Do you ever think in the future

5:12

that it will become illegal

5:14

to hack email systems? I

5:16

have no idea. I mean, it's

5:18

very hard to predict these kinds

5:20

of things. It's very hard to

5:22

know, is it already, especially in the

5:24

United States, the Computer Fraud and Abuse

5:26

Act, covers who knows what in whatever

5:29

settings? I don't know. I think this

5:31

is a question for the policy and

5:33

the lawyer peoples. And my view on

5:35

policy and law is, as long as

5:38

people are making these decisions, coming from

5:40

a place of what is true in

5:42

the world, they can make their decisions. The

5:44

only thing that I... try and make comments

5:46

on here is like, let's make sure that

5:48

at least we're making decisions based on what

5:50

is true and not decisions based on what

5:53

we think the world should look like. And

5:55

so, you know, if they base their decisions

5:57

around the fact that we can attack these

5:59

models. and various bad things could

6:02

happen, then I'm, they're more experts

6:04

at this than me and they can decide,

6:06

you know, what they should do. But yeah,

6:08

I don't know. In the context

6:10

of ML security, I mean,

6:12

really open-ended questions, just to

6:15

start with. Sure. Can you

6:17

predict the future? What's gonna

6:19

happen? Future for ML security. My,

6:21

okay, let me give you a

6:23

guess. I think the probability of

6:25

this happening is very small, but

6:28

like the... the median prediction, I

6:30

think, in some sense. I think

6:32

models will remain vulnerable

6:34

to fairly simple attacks for

6:37

a very long time, and

6:39

we will have to find ways

6:41

of building systems so that

6:43

we can rely on an

6:45

unreliable model and still have

6:48

a system that remains secure.

6:50

And what this probably means

6:52

is we need to figure

6:54

out... a way to design the rest of the

6:56

world, the thing that operates around the model,

6:59

so that if it decides that it's going

7:01

to just randomly classify something

7:03

completely incorrectly, even if just

7:06

for random chance alone, the system

7:08

is not going to go and perform

7:10

a terribly misguided action, and that

7:12

you can correct for this, but that we're

7:14

going to have to live with a world

7:16

where the models remain very

7:19

vulnerable for, yeah, I don't know. for

7:21

the foreseeable future, at least as far as

7:23

I can see, and you know, especially machine

7:26

learning time, five years is an eternity.

7:28

I have no idea what's going to

7:30

happen with, you know, what the world will

7:32

look like in this machine learning language models

7:34

who know something else might happen. Language models

7:37

are only like, you know, seven years of

7:39

like real significant progress, so like predicting five

7:41

years out is like almost doubling this. So

7:44

I don't know how the world there will

7:46

look, but at least as long as we're

7:48

in this world where things are... fairly

7:50

vulnerable. But then again, language

7:52

models are only, you know, seven years

7:55

and we've only been trying to attack

7:57

them for like really two or three.

7:59

So... give five years, that's twice as

8:02

long as we've been trying to

8:04

attack these language models. Maybe we

8:06

just figure everything out. Maybe language

8:08

models are fundamentally different and

8:10

things aren't this way, but

8:12

my prior just tends to be the

8:14

case of other vision models we've been

8:16

trying to study for 10 years and

8:18

at least there things have been proven

8:20

very hard and so I my expectation

8:23

this things will be hard and so I'll

8:25

have to just rely on building systems

8:27

that end up working. And actually, when

8:30

you first put out this article about

8:32

chess playing, I've cited it on the

8:34

show about 10 times. So it's really,

8:37

really interesting. But let me read a

8:39

bit out of it. By the way,

8:41

it's called playing chess with large language

8:43

models. So you said, until this week,

8:46

in order to be good at chess,

8:48

a large language model. So you said,

8:50

until this week, in order to be

8:53

good at chess, a machine learning model,

8:55

a machine learning, and then it would

8:57

win. And you said that this all

8:59

changed at the time on Monday when

9:01

Open AI released GPT 3.5 turbo instruct.

9:03

Can you tell me about that? What

9:06

GPT 3.5 turbo instruct and later other

9:08

people have done with open source models

9:10

that you can verify they're not doing

9:12

something weird behind the scenes because I

9:14

think some people speculated, well maybe they're

9:17

just cheating in various ways, but like

9:19

there are open source models that replicate

9:21

this now. What you have is you

9:23

have a language model that... can

9:25

play chess to a fairly high degree.

9:28

And yeah, okay, so when you first

9:30

tell someone, I have a machine

9:32

learning system that can play chess.

9:34

The immediate reaction you get is

9:36

like, why should I care? You

9:38

know, we had deep blue, whatever,

9:40

30 years ago. that could beat the

9:42

best humans and like isn't that some form

9:45

of like you know a little bit of

9:47

AI at the time like why should I

9:49

be at all surprised by the fact that

9:51

I have some system like this that complicates

9:54

us and that yeah so the fundamental difference

9:56

here I think is very interesting

9:58

is that the model was trained on

10:00

a sequence of moves. So in

10:02

chess you represent moves, you know,

10:04

one, E4, means, you know, move

10:06

the king's pawn, E4, and then

10:08

you have, you know, E5, a

10:10

black responds, and then two, whatever,

10:12

an F3, white plays the night,

10:14

whatever. You train on these sequences

10:16

of moves, and then you just

10:19

say six dot. language model, do

10:21

your prediction task. It's like just

10:23

a language model, it is being

10:25

trained to predict the next token,

10:28

and it can play a move

10:30

that not only is valid, but

10:32

also is very high quality. And

10:34

this is interesting because it

10:37

means that the model can play

10:39

moves that accurately, like, let's just

10:41

talk about the valid part of

10:43

the first place. Like, valid is

10:46

interesting in and by itself, because...

10:48

What is a valid chess move

10:50

is like a complicated

10:53

program to write. It's not

10:55

an easy thing to do

10:57

to describe what moves are

10:59

valid in what situations. You

11:01

can't just be dumping out

11:04

random characters and stumble

11:06

upon valid moves. And you

11:09

have this model that makes

11:11

valid moves every time. And

11:13

so I don't like talking a

11:15

lot about what's the model

11:17

doing internally, because I don't

11:19

think that's all that helpful. I

11:22

think, you know, just look at

11:24

the input-out behavior of the system

11:26

as the way to understand these things.

11:28

But the fact that it can

11:30

make valid moves almost always means

11:32

that it must in some sense

11:35

have something internally that is

11:37

accurately modeling the world. I don't

11:39

like to ascribe, you know, intentionality

11:41

or any of these things, these

11:44

kinds of things. But it's doing

11:46

something. that allows it to make

11:48

these moves knowing what the current board

11:50

state is and understanding what it's

11:52

supposed to be doing. And this

11:55

by itself I think is interesting.

11:57

And then not only can it

11:59

do... it could actually play high

12:01

quality moves. And so I

12:03

think, you know, taken together, it

12:06

in some sense tells me

12:08

that the model has a

12:10

relatively good understanding of what

12:12

the actual position looks like.

12:15

Because, you know, okay, so I play

12:17

chess at a modest level, like

12:19

I'm not terrible, I understand, you

12:21

know, more or less what I should

12:24

be doing, but if you just gave

12:26

me a sequence of 40 moves in a

12:28

row, and then said, you know, 41 point,

12:30

like, what's the next move? Like, I

12:32

could not reconstruct in my mind what

12:34

the board looked like at that point

12:36

in time. Somehow the model has figured

12:38

out a way to do this, like, having never

12:40

been told anything about the rules that,

12:43

like, they even exist as poor, like, it's

12:45

sort of reconstructed all of that, and

12:47

it can put the pieces on the

12:49

board correctly in whatever way that it

12:51

does it internally, who knows how that

12:53

happens, and then it can place the

12:55

valid move, like... It's sort of very

12:57

interesting that this is something you can

13:00

do. And I just like, for me

13:02

it changed the way that I think

13:04

about what models can and can't do

13:06

in like surface level statistics

13:08

or more deeper statistics about like actually

13:11

what's going on. And I don't

13:13

know, this is I guess mainly why

13:15

I think this is an interesting

13:17

thing about the world. Yeah, we

13:19

have this weird form of human

13:22

chauvinism around the abstractness of our

13:24

understanding. and these artifacts have a

13:26

surface level of understanding, but it's

13:28

at such a great scale that

13:30

at some point it becomes a

13:32

weird distinction without a difference. But

13:34

you said something very interesting in

13:36

the article. You said that the model

13:39

was not playing to win, right? And you were

13:41

talking about, and I've said this on the show,

13:43

that the models are a reflection of you. Right,

13:45

so you play like a good chess player and

13:48

it responds like a good chess player and it's

13:50

like that whether you're doing coding, whether you're doing

13:52

it. And it might even explain some of the

13:54

differential experiences people have because you go on LinkedIn

13:57

and that those guys over there aren't, clearly aren't

13:59

getting very good. responses out of LLLMs,

14:01

but then folks like yourself, you're

14:03

using LLLMs and you're at the

14:05

sort of the galaxy brain level

14:07

where you're sort of like pushing

14:10

the frontier and people don't even

14:12

know you're using LLLM. So there's

14:14

a very differential experience. Yeah, okay,

14:17

so let me explain what I mean

14:19

when I said that. So if you take a given

14:21

chess board, you can find multiple ways of reaching that.

14:23

You know, you could take a board that happened because

14:25

of a normal game between two chess grandmasters, and you

14:27

can find a sequence of absurd moves that no one

14:29

would ever play that actually brings you to the board

14:31

state. So what you do is like piece by piece,

14:33

you know, is like piece by piece, you say, well,

14:35

the night, you say, piece by piece, you say, well,

14:37

the night, you say, well, the night, you say, well,

14:39

piece by piece, you say, well, the night, the night,

14:41

you say, well, you say, the, you know, you know,

14:44

you know, you know, you know, you know, you know,

14:46

you know, you know, you, you, you, you, you, you,

14:48

you, you, you, you, you, you, you, you, you, you,

14:50

you, you, you, you, you, you, you, you, you, you,

14:52

you, you, you, what you, you, what you, what you,

14:54

what you, that ends up in the correct board state,

14:56

and then you could ask the model, now play a

14:58

move. Okay, and then what happens? The

15:01

model plays a valid move. Still,

15:03

most of the time, it knows what

15:05

the board state looks like, but the

15:07

move that it plays is very, very

15:09

bizarre. It's like a very weird move. Why?

15:11

Because what has the model been trained

15:14

to do? The model was never told

15:16

the game of chess to win. The

15:18

model was told, make things that are like

15:20

what you saw before. It saw a sequence

15:22

of moves that looked like two people who

15:25

were rated like negative 50 playing a game

15:27

of chess. And it's like, well, OK, I

15:29

guess the game is to just make valid

15:31

moves and just see what happens. And

15:33

they're very good at doing this. And you

15:36

can do this in this bot in the

15:38

synthetic way. And also what you can do

15:40

is you can just find some explicit

15:42

cases where you can just get models

15:44

to make terrible move decisions, just

15:47

because that's what people. do

15:49

commonly when they're playing and you know

15:51

most people fall for this trap and

15:53

I was modeled to play like whatever

15:55

the training did it looked like and so

15:57

I guess I ought to fall for this trap

15:59

and This is one of the

16:01

problems of these models is they're

16:04

not initially trained to do the

16:06

play-to-win thing. Now, as far as

16:08

how this applies to actual language

16:11

models that we use, we

16:13

almost always post-train the models

16:15

with RLHF and SFT instruction

16:17

fine-tuning things. And a big

16:19

part of why we do that is so

16:21

that we... don't have to deal with

16:23

this mismatch between what the model was

16:26

initially trained on and what we actually

16:28

want to use it for. And this

16:30

is why GPT-3 is exceptionally hard to

16:33

use, and the sequence of instruct papers

16:35

was very important, is that it takes

16:37

the capabilities that the model has somewhere

16:39

behind the scenes and makes it much

16:42

easier to reproduce. And so when

16:44

you're using a bunch of the chat models

16:46

today, most of the time, you don't have

16:48

to worry nearly as much. exactly how

16:50

you frame the question because of this,

16:52

you know, they were designed to give

16:55

you the right answer even when you

16:57

ask the silly question, but I think

16:59

they still do have some of this,

17:01

but I think it's maybe less than

17:03

if you just have the raw base

17:05

model that was being trained on whatever

17:07

data happened to be trained on. Yeah, I'd

17:09

love to do a tiny digression on

17:11

RLHF because I was speaking with Max

17:14

from Cohere yesterday. they've done some amazing

17:16

research talking all about you know how

17:18

this preference steering works and and they

17:21

say that like humans are actually really

17:23

bad at kind of like distinguishing a

17:25

good thing from another thing you know

17:27

so we like confidence we like verbosity

17:30

we like complexity and for example I

17:32

really hate the chat gPT model because of

17:34

the style I can't stand the style so

17:36

I even though it's right I think it's

17:38

wrong you know so when we do that

17:40

kind of kind of post training on the

17:42

language models how does that affect the competence

17:45

I don't know. Yeah, I mean, I feel like

17:47

it's very hard to answer some of these

17:49

questions because oftentimes you don't have

17:51

access to the models before they've

17:53

been post-trained. You can look at

17:56

these numbers from the papers, so

17:58

they're like in the GPT4. technical

18:00

report. One of these reports,

18:03

they have some numbers that

18:05

show that the model before

18:07

it's been post-trained. So just

18:09

the raw-based model is very

18:11

well calibrated. And what this means

18:13

is when it gives an answer with some

18:15

probability it's right about that probability of the

18:17

time. So if it says, you know, if

18:19

it gives an answer for like, you know,

18:21

give us a mass question and it says

18:24

the answer is five and the token probability

18:26

is 30% it's right about 30% of the

18:28

time, but then when you do the post-training

18:30

process, the calibration gets all messed up and

18:32

it doesn't have this behavior anymore. So I

18:34

like that some things change. you know, you

18:36

can often have the models that just like

18:38

get fantastically better when you do post-training because

18:40

now they follow instructions much better. You haven't

18:42

really taught them all much new, but it

18:44

looks like it's much smarter. Yeah, I think

18:46

this is all a very confusing thing. I

18:48

don't have a good understanding of how all

18:50

of these things fit together. I mean, given

18:52

you know, these models, they make valid

18:55

moves, they appear to be competent, but

18:57

sometimes they have these catastrophic weird failure

18:59

modes. Yes. Do we call that process

19:01

reasoning or reasoning or reasoning or not? I'm

19:06

very big on not

19:08

ascribing intentionality

19:10

or I don't want to,

19:12

everyone needs something different

19:15

by reasoning. And so the

19:17

answer to the question, is that

19:19

reasoning is entirely. what you define

19:21

as reasoning. And so you find

19:23

some people who are very much

19:25

in the world of, I don't

19:27

think models are smart, I don't

19:29

think that they're good, they can't

19:31

solve my problems, and so they

19:33

say, no, it's not reasoning, because

19:35

to me, reasoning means, and then

19:38

they give a definition, which excludes

19:40

language models. And then you ask

19:42

someone who's very much on the

19:44

AGI, you know, language models are

19:46

going to solve everything, by 2027,

19:48

they're going to be displaced all

19:50

human jobs. whatever the process is

19:52

that the model is doing, and then

19:54

they tell you yes, their reasoning. And

19:56

so I think, you know, it's very

19:58

hard to talk about. whether it's actually

20:01

reasoning or not, I think the

20:03

thing that we can talk about is

20:05

like, what is the input upper

20:07

behavior? And, you know, does the model do

20:09

the thing that answers the question,

20:11

solves the task, and was challenging

20:14

in some way, and like, did it

20:16

get it right? And then we can

20:18

go from there, and I think this

20:20

is an easier way to try and

20:22

answer these questions than to. describe

20:24

intentionality to something like it, like,

20:27

I don't know, it's just really

20:29

hard to have these debates with

20:31

people when you start off without

20:33

having the same definitions. I know, I'm really

20:35

torn on this because, as you say,

20:37

the deflationary methodology is it's an input-output

20:39

mapping. You could go one step up,

20:42

so Ben Joe said that the reasoning

20:44

is basically knowledge plus inference, you know,

20:46

and some probabilistic sense. I think it's

20:48

about knowledge acquisition or the recombination of

20:50

knowledge and then it's the same thing

20:53

with agency, right? You know, the simplistic

20:55

form is that it's just like, you

20:57

know, an automata. It's just like a,

20:59

you know, you have like an environment

21:01

and you have some computation and you

21:04

have an action space and it's just

21:06

this thing. You know, but it feels necessary

21:08

to me to have things like autonomy and

21:10

emergence and intentionality in the definition. But you

21:12

could just argue, well, why are you saying

21:14

all of these words, then it does the

21:16

thing, then it does the thing. Yeah, and this

21:19

is sort of how I feel. I mean, I

21:21

think it's very interesting to

21:23

consider this, like, is it reasoning?

21:25

If you have a background in

21:27

philosophy and that's what you're going

21:29

for. I don't have that. So I don't

21:31

feel like I have any qualification to

21:33

tell you whether or not the model

21:36

is reasoning. I feel like the thing

21:38

that I can do is say, here is how

21:40

you're using the model, you want

21:42

it to perform this behavior, let's

21:44

just to check. Like, did it

21:46

perform the behavior yes or no?

21:49

And if it turns out that it's

21:51

doing the right thing in all of

21:53

the cases, I don't know that

21:55

I care too much about whether

21:57

or not the model reasoned...

22:00

its way there or it used

22:02

a lookup table? Like if it's

22:04

giving me the right answer every

22:06

time, like, let's, I don't know,

22:08

I tend to not focus too

22:10

much on how it got there.

22:12

We have this entrenched sense that

22:14

we have parsimony and robustness, you

22:17

know, for example, in this chess

22:19

notation, if you change the syntax

22:21

of the notation, it probably would

22:23

break, right? Yes. And yeah, if

22:25

you, like, there are multiple chess

22:27

notations. And right, and I have

22:29

tried this, so before there was

22:31

the current notation we use, that

22:34

like you would in like old

22:36

chess books, notation was like, you

22:38

know, you know, King's Bishop moves

22:40

to like, you know, so it's

22:42

a Queen's Street, whatever, like that

22:44

you just number the squares differently.

22:46

If you ask a model in

22:49

this notation, it has no idea

22:51

what's happening, and it will write

22:53

something that looks surface level like.

22:55

a sequence of moves, but has

22:57

nothing to do with the correct

22:59

board state. And of course, yeah,

23:01

a human would not do this

23:03

if you ask them to produce

23:06

the sequence of moves. It would

23:08

take me a long time to

23:10

remember which squares, which things, how

23:12

to write these things down, I

23:14

would have to think harder. But

23:16

like, I understand what the board

23:18

is, and like, I can get

23:21

that correct. And the model doesn't

23:23

do that right now. And

23:25

so maybe this is your definition of reasoning

23:28

and you say the reasoning doesn't happen. But

23:30

like someone else could have said, why should

23:32

you expect the model to generalize this thing

23:34

that's never seen before? Like I know, like,

23:36

it's like interesting to me. We've gone from

23:38

a world where we wrote papers about the

23:40

fact that if you trained a model on

23:42

image net, then like, well, obviously it's going

23:45

to have this failure mode that when you

23:47

corrupt the images, the accuracy goes down or

23:49

you can't like... Suppose I wrote a paper

23:51

seven years ago. I trained my model in

23:53

ImageNet, and I tested it on Sufartan that

23:55

didn't work. Isn't this model so bad? people

23:57

would like laugh at you. Like, well, of

24:00

course, you trained it on image net, one

24:02

distribution, you tested it under, it's a different

24:04

one, you never asked it to generalize, and

24:06

it didn't do it. Like, good job. Like, of course,

24:08

it didn't solve the problem. But today, what do we

24:10

do with language models? We train them on one distribution.

24:12

We test them on different distribution that it wasn't trained

24:14

on sometimes, and then we laugh at the model, like,

24:16

like, isn't it's so dumb. It's like, like, like, like,

24:19

like, you didn't trained on the thing, you

24:21

didn't trained on the thing. you know, maybe

24:23

some future model, you could have the fact

24:25

that it could just magically generalize across domains,

24:27

but like we're still using machine learning. Like,

24:29

you need to train it on the kind

24:31

of data that you want to test it on,

24:33

and then the thing will behave much better than

24:35

if you don't do that. So in an email correspondence

24:37

to me, you said something you didn't

24:40

use these exact words, but you said

24:42

that... there are so many instances where

24:44

you kind of feel a bit noobed

24:46

because you made a statement, you know,

24:48

your intuition is you're a bit skeptical,

24:51

you said there's stochastic parrots, and then

24:53

you got proven wrong a bunch of

24:55

times, and it's the same for me.

24:57

Now, one school of thought is, you

24:59

know, Rich Sutton, you just throw more

25:02

data and compute at the thing, and

25:04

the other school of thought is that we

25:06

need to have completely different

25:08

methods. Yeah. Right,

25:11

so there are some people I

25:13

feel like who have good visions

25:15

about the future might look

25:17

like, and then there are people

25:19

like me who just look at

25:22

what the world looks like and

25:24

then try to say, well, let's

25:26

just do interesting work here. I

25:29

feel like this works for me

25:31

because for security in

25:33

particular, it really only matters

25:36

if people are doing the thing to

25:38

attack the thing. And so I'm fine

25:40

just saying, like, let's look at what

25:42

is true about the world and write

25:44

the security papers. And then if the

25:46

world significantly changes, we can try and

25:48

change. And we can try and be

25:50

a couple years ahead looking where things

25:53

are going so that we can do security

25:55

ahead of when we need to. But I tend,

25:57

because of the area that I'm in, not to

25:59

spend on. a lot of time trying to think

26:01

about where are things going to be in the

26:03

far future. I think a lot of people try

26:05

to do this and some of them are good

26:07

at it and some of them are not and I

26:10

have no evidence that I'm good at it. So I

26:12

try and mostly reason based on what

26:14

I can observe right now. And if what

26:16

I can observe changes, then I ought to

26:18

change what I'm thinking about these things and

26:21

do things differently. And that's the best that

26:23

I can hope for. On this chess thing, has

26:25

anyone studied, you know like in the head

26:27

is for the chess notation, you could say

26:30

this player had an elo of 2,500 or

26:32

something like that. And I guess the first

26:34

thing is like, do you see some

26:36

commensurate change in performance? But what

26:38

would happen if you said elo

26:40

4,000? Right. Yes, we've actually trained

26:43

some models trying to do this, it

26:45

doesn't work very well. It's like

26:47

you can't, like, you can't trivially.

26:49

At least, yeah, if you just change

26:51

the number, we've trained some models ourselves

26:53

on headers that we expect to have

26:55

an even better chance of doing this,

26:57

and it did not directly give this

26:59

kind of immediate wins, which again is

27:02

not to say that I am not good

27:04

at training models. Someone else who knows what

27:06

they're doing might have been able to

27:08

make it have this behavior, but when

27:10

we trained it and when we tested

27:12

3.5 term or instruct, it like, it

27:14

might have a statistically significant

27:16

difference on the outcome. but

27:18

it's nowhere near the case that you tell

27:20

the model is playing like a 1,000 rated

27:23

player and all of a sudden it's 1,000

27:25

rated. People have worked very hard to

27:27

try and train models that will let

27:29

you match the skill to an arbitrary

27:31

level and like it's like research paper

27:33

level thing not just like change three

27:35

numbers and how to hope for the best.

27:38

Right so you wrote another article

27:40

called Why I Attack. Sure. And

27:42

you said that you enjoy attacking systems

27:44

for the fun of solving puzzles

27:46

rather than altruistic reasons. Can you

27:48

tell me more about that, but also

27:50

why did you write that article?

27:52

Yeah, okay. So let me answer them in

27:54

the opposite order to ask them. So why

27:57

do I write the article? Some people

27:59

were mad at... for breaking

28:01

defenses that like they

28:04

they they said that what

28:06

I I don't care about

28:08

humanity I just

28:11

I don't know want to

28:13

make them look bad or

28:15

something and half of

28:18

that statement is true

28:20

I don't do security

28:23

because like I

28:25

want to do maximum good and

28:27

and therefore I'm going to think

28:29

about, like, what are all of

28:31

the careers that I could do

28:34

and try and find the one

28:36

that's most likely to, like, save

28:38

the most lives. You know, if

28:40

I had done that, I probably

28:42

would, I don't know, be a

28:44

doctor or something like, you know,

28:46

actually, like, immediately helps people, or

28:48

you could research on cancer, like,

28:51

find whatever domain that you

28:53

wanted, where you could, like, like,

28:55

measure, like, maximum good. I can't.

28:57

motivate myself to do

28:59

them. And so if I was a

29:01

different person, maybe I could do

29:04

that. Maybe I could be

29:06

someone who could meaningfully solve

29:09

challenging problems in biology by

29:11

saying like, I'm waking

29:13

up every morning knowing that

29:15

I'm sort of like saving

29:17

lives or something. But this

29:20

is not. how I work and I feel

29:22

like it's not how lots of people work,

29:24

you know, there are lots of people who

29:26

I feel like are in computer science and

29:28

or you want to go even further in

29:31

like quant fields where like you're clearly brilliant

29:33

and you could be doing something a lot

29:35

better with your life. And some of

29:37

them probably legitimately just would

29:39

just have zero productivity if

29:42

they were doing something that they just really

29:44

did not find any enjoyment in. And

29:46

so I feel like the thing that

29:48

I try and do is, okay, find

29:50

the set of things that you can

29:52

motivate yourself to do and like will

29:54

do a really good job in, and

29:56

then solve those as good as possible,

29:59

subject to the... that like you're actually

30:01

net positive moving things forwards. And for

30:03

whatever reason, I've always enjoyed attacking things

30:05

that I'm, I feel like I'm differentially

30:07

much better at that than at anything

30:10

else. And like, I feel like I'm

30:12

pretty good at doing the adversarial machine

30:14

learning stuff, but I have no evidence

30:16

that I would be at all good

30:18

at the other, you know, 90% of

30:21

things that exist in the world that

30:23

might do better. And so. I don't

30:25

know, the way that I get maybe

30:27

in one sentence that I think about

30:29

this is the like, that's how good

30:32

you are at the thing, multiply by

30:34

how much the thing matters, and you're

30:36

trying to sort of maximize that product,

30:38

and if there's something that you're really

30:40

good at, that at least direction, moves

30:43

things in the right direction, you can

30:45

have a better, higher impact, then taking

30:47

whatever field happens to be the one

30:49

that is like maximally good and moving

30:51

things forwards by a very small amount.

30:54

And so that's why I do attacks

30:56

is because I feel like generally they

30:58

move things forward and I feel like

31:00

I'm better than most other things that

31:02

I could be doing. Now you also

31:05

said that attacking is often easier than

31:07

defending. Certainly. Tell me more. I mean,

31:09

this is the stand-up thing in security.

31:11

You need to find one attack that

31:13

works. And you need to fix all

31:16

of the attacks if you're defending. And

31:18

so if you're attacking something. The only

31:20

thing that I have to do is

31:22

find one place where you've forgotten to

31:24

handle some corner case, and I can

31:27

arrange for the adversary to hit that

31:29

as many times as they need until

31:31

they succeed. You know, this is why

31:33

you have normal software security. You can

31:35

have a perfect program in everywhere except

31:38

one line of code, where you forget

31:40

to check the bounds exactly once. And

31:42

what does this mean? The attacker will

31:44

make it so that that happens every

31:47

single time in the security of your

31:49

product is essentially zero. Under random settings,

31:51

this is never going to happen. Like

31:53

it's never going to happen that the

31:55

hash... of the file is exactly the

31:58

power of like, you know, is equal

32:00

to 2 to the 32, which overflows

32:02

the integer, which causes the bad stuff

32:04

to happen, this is not going to

32:06

happen by random chance, but the attacker

32:09

can just arrange for this happen every

32:11

time, which means that it's much easier

32:13

for the attacker than the defender who

32:15

has to fix all of the things. Has this?

32:17

And then in machine learning it gets even

32:19

worse, because at least in normal security and

32:22

software security or other areas, like we understand

32:24

the classes of attacks. In machine learning we

32:26

just constantly discover new categories of bad things

32:28

that could happen. And so not only do

32:30

you have to be robust to the things

32:32

that we know about, you have to be

32:34

robust to someone like coming up with a

32:36

new clever just like type of attack that

32:38

we hadn't even thought of before and be

32:40

robust there. And this is not happening because

32:42

of... the way, I mean, it's a very

32:44

new field, and so of course it's just

32:46

much easier for these attacks than

32:48

defenses. Let's talk about disclosure norms,

32:51

how should they change now that

32:53

we're in the ML world? Okay,

32:55

yeah. So in standard software

32:57

security, we've basically figured out

33:00

how we how things should go.

33:02

So for a very long time,

33:04

you know, for 20 years, there was

33:06

a big back and forth between When

33:08

someone finds a bug in some

33:10

software that can be exploited, like what

33:13

should they do? And let's say, I don't know,

33:15

late 90s, early 2000s, there were people who were

33:17

on the full disclosure. Which they thought, I find

33:19

a bug in some program, what should I do?

33:22

I should tell it to everyone so that

33:24

we can make sure that people don't make

33:26

a similar mistake and we can put pressure

33:28

on the person to fix it and do

33:30

all that stuff. And then there were the

33:32

people who were on the like... don't disclose

33:34

anything. Like, you should report the bug to

33:36

the person who's responsible and wait until they

33:39

fix it, and then you should tell no

33:41

one about it, and because, you know, this

33:43

was a bug that they made, and you

33:45

don't want to give anyone else ideas

33:48

for how to exploit. And in

33:50

software security, we landed on this,

33:52

you know, what was called responsible

33:54

disclosure, and is now coordinated disclosure,

33:57

which is the idea that you should give the

33:59

person if the... one person, a reasonable

34:01

heads-up for some amount of time.

34:03

Google Project Zero has a 90-day

34:06

policy, for example. And you have

34:08

that many days to fix your

34:10

thing, and then after that, or

34:12

once it's fixed, then it gets

34:15

published to everyone. And the idea

34:17

here in normal security is that you

34:19

give the person some time to protect

34:21

their users. You don't want to

34:24

immediately disclose a new attack that

34:26

allows people to cause a lot

34:28

of harm. But you put a deadline on

34:30

it and you stick to the deadline to

34:32

put pressure on the company to actually fix

34:34

the thing. Because what often happens if you

34:36

don't say you're going to release things publicly

34:38

is no one else knows about it. You're

34:41

the only one who knows the exploit. They're

34:43

just going to not do it because they're in

34:45

the business of making a product not fixing bugs. And

34:47

so why would they fix it if no one else

34:49

knows about it? And so when you say, like, no,

34:51

this will go live in 90 days, like, you better

34:54

fix it before, then they have the time. It's just

34:56

like, now, if they don't do it, it's on them

34:58

because they just didn't put in the work to fix

35:00

the thing. And there are, of course, exceptions. You

35:02

know, Specter and Meltdown are two of the most

35:04

common exploits, or like one of the biggest

35:06

attacks in the last 10 20 years in

35:09

software security and software security, and software security,

35:11

and they gave Intel and they gave Intel

35:13

and related people a year. to fix this,

35:15

because it was a really important bug. It

35:17

was a hard bug to fix. There were

35:20

like legitimate reasons why you should do this.

35:22

There's good evidence that like it's probably not

35:24

going to be independently discovered by the bad

35:26

people for a very long time. And so

35:28

they gave them a long time to fix

35:31

it. And you know, similarly, Google Project Zero

35:33

also says if they find evidence the bug

35:35

is being actively exploited, they'll give you

35:37

seven days. You know, if there's someone

35:39

actually exploiting it, then you have seven days

35:42

before they'll patch. And so they might as

35:44

well tell everyone about that harm is being

35:46

done because if they don't then it's like

35:48

just going to delay the things. Okay, so

35:50

with that long preamble, how should things change

35:52

for machine learning? The short answer is I

35:55

don't know because on one hand I want

35:57

to say that this is like how things

35:59

are in software. security. And sometimes it

36:01

is, where someone has some bug in

36:03

their software, and there exists a way

36:06

that they can patch it and fix

36:08

the problems. And in many cases,

36:10

this happens. So we've written papers

36:12

recently, for example, we've shown how

36:14

to do some like model stealing

36:16

stuff. So Open AI has a

36:18

model, and we could query Open

36:20

AI services and allow us to

36:22

steal part of their model, only very

36:24

small part, but we could steal part

36:27

of it. So we disclose this to

36:29

them because there was a way that they

36:31

could fix it. They could make a change

36:33

to the API to prevent this attack from

36:36

working, and then we write the paper and

36:38

put it online. This feels very much

36:40

like software security. On the

36:42

other hand, there are some other kinds

36:45

of problems that are not the kinds

36:47

that you can patch. Let's think in

36:49

the broadest sense, adversarial

36:51

examples. If I disclosed to

36:53

you, here is an adversarial example

36:56

on your image classifier.

36:58

What is the point of doing the

37:00

responsible disclosure period here? Because there

37:02

is nothing you can do to fix

37:04

this in the short term. We have been

37:07

trying to solve this problem for 10 years.

37:09

Another 90 days is not going to help you

37:11

at all. Maybe I'll tell you out of

37:13

courtesy to let you know this thing that

37:16

I'm doing. I'm going to write this paper

37:18

here, so I'm going to describe it. Do

37:20

you want to put in place a couple

37:22

of filters ahead of time to make this

37:25

particular attack not work? But you're not going

37:27

to solve the underlying problem. like biology things

37:29

the argument they make is you know suppose

37:31

someone came up with a way that's create some

37:34

novel pathogen or something like a disclosure period

37:36

doesn't help you here and so is it

37:38

more like that or is it more like

37:40

software security I don't know I mean I'm

37:42

more biased a little bit towards the software

37:44

security it's because that's what I came from

37:47

but it's yeah hard to say exactly which

37:49

one we should be modeling things after I

37:51

think we do probably need to come up

37:53

with new norms for how we handle this

37:55

There are a lot of people I know who are talking

37:57

about this trying to write these things down and I think

38:00

I think in a year or two, if you ask

38:02

me this again, we will have set processes

38:04

in place, we will have established norms for

38:06

how to handle these things now, I think

38:08

this is just like very early and right

38:11

now we're just looking for analogies in other

38:13

areas and trying to come up with what

38:15

sounds most likely to be good, but I

38:17

don't have a good answer for you immediately

38:20

now. Are there any vulnerabilities that you've

38:22

decided not to pursue for

38:24

ethical reasons? No,

38:30

not that I can think of,

38:33

but I think mostly because

38:35

I tend to only try

38:37

and think of the

38:39

exploits that would be ethical

38:41

in the first place. So

38:43

I just like, it may

38:45

happen that I, like, I

38:47

stumble upon this, but I tend

38:50

to, like, I think research

38:52

ideas, you, you, some, in some

38:54

very small fraction

38:56

of the time, research ideas

38:58

happen just by. random inspiration.

39:00

Most of the time, though,

39:02

research ideas is not something

39:05

that just happens. Like, you

39:07

have spent conscious effort trying

39:09

to figure out what new

39:11

thing I'm going to try and do.

39:13

And I think it's pretty easy

39:15

to just, like, not think about

39:17

the things that seem morally fraught

39:19

and just focus on the ones

39:21

that seem like they actually have

39:24

potential to be good and useful.

39:26

But... It very well may happen at

39:28

some point that this is something

39:30

that happens, but this is not a

39:33

thing that I... I can't think of any

39:35

examples of attacks that we've

39:37

found that we've decided not to

39:40

publish because of the harms

39:42

that they would cause, but I can

39:44

imagine that this might be something

39:46

that I can't rule out this something

39:48

that wouldn't happen, but I

39:50

tend to just like bias my search

39:52

of problems in the direction of... things

39:54

that I think are actually beneficial. I

39:56

mean, maybe going back to like the

39:58

why I attack things. You

40:02

want the product of how good you

40:04

are and how much good it does

40:06

for humanity to be maximally positive. You

40:08

can choose what problems you work on to

40:11

not be the ones that are negative.

40:13

And so I don't have lots of

40:15

respect for people where the direction of

40:17

the goodness of the world is like

40:19

just a negative number. Because you can

40:21

choose to make that a very least zero,

40:23

just like don't do anything. And so

40:26

I try and pick the problems that

40:28

I think are generally positive. do

40:30

as good as possible on those ones.

40:32

So you work on traditional security

40:34

and demo security. What are the

40:37

significant differences? Yeah, okay, so I

40:39

don't work too much on traditional

40:41

security anymore. So I started my

40:43

PhD in the traditional security.

40:45

Yeah, I did very very low-level

40:47

returning to programming. I was at

40:49

Intel for a summer on some

40:52

hardware level defense stuff. And then I

40:54

started machine learning. shortly after that.

40:56

So I haven't worked on the

40:59

very traditional security in like the

41:01

last, let's say, eight something seven

41:03

something years. But yeah, I still

41:06

follow it very closely. I still

41:08

go to the system security conferences

41:10

all the time because I think it's

41:12

like a great community. But yeah,

41:15

what are the similarities and differences?

41:17

I feel like the systems security

41:20

people are very good at really

41:22

trying to make sure that what

41:24

they're doing is like a very

41:26

rigorous thing and like evaluated it

41:28

really thoroughly and properly You know

41:30

you see this even in like the

41:32

length of the papers. So a system

41:35

security paper is like 13, 14 pages long

41:37

to call them a paper that's a submission

41:39

for I-clear is like seven or eight or

41:41

something, one column. Like, you know, the system

41:43

security papers will all start with like a

41:46

very long explanation of exactly what's happening. The

41:48

results are expected to be really rigorously done.

41:50

A machine learning paper often is here is

41:52

a new cool idea, maybe it works. And

41:55

like this is good for like, you know,

41:57

move fast and break things. This is not

41:59

good for... like really systematic studies,

42:01

you know, when I

42:04

was doing system security

42:06

papers, I would get like, you know,

42:08

one, one and a half, two a year.

42:10

And now, like, a similar kind

42:12

of thing, I could, of machine

42:15

learning papers, like, you know, you

42:17

could probably do five or six

42:19

or something, like, to the same

42:21

level of rigor. And so I

42:23

feel like this is, like, it's maybe

42:26

the biggest thing I see in

42:28

my mind as like, And I think it's

42:30

worked empirically in the machine learning space. Like

42:32

it would not be good if every research

42:34

result in machine learning needed to have the

42:37

kind of rigor you would have expected

42:39

for a systems paper, because we would have

42:41

had like five iteration cycles in total.

42:44

And, you know, at machine learning conferences, you often see the paper, the paper

42:46

that approved upon the paper, and the paper that improved upon that paper,

42:48

all at the same conference, because the first person put it on archive, and

42:50

the next person found the tweak that made it, and the third person

42:52

found the tweak that made it, and the third person found the tweak that

42:54

made it, and the third person found the tweak that made it, and the

42:56

third person found the tweak that made it, and the third person found

42:58

the tweak, that made it made it, and that made it made it made

43:00

it even better. And when it made it made it made it made

43:02

it made it made it even better. And like, even better. And like, and,

43:04

even better. And like, even better. And like, even better. And like, even better.

43:07

And like, and like, and like, and like, it's better. And like, it's

43:09

better. And like, it's better. And like, and the third person found the third

43:11

person found the three person found And so I think,

43:13

yeah, having some kind of balance and

43:15

mix between the two is useful. And

43:17

this, I think, is maybe the biggest

43:19

difference that I see. And this is, I guess,

43:21

maybe if there's some differential advantage that

43:24

I have in the machine learning space,

43:26

I think some of it comes from

43:28

this where in systems, you know, you

43:30

were trained very heavily on this kind

43:32

of rigorous thinking and how to do

43:35

attacks very thoroughly, look at all of

43:37

the details. And when you're doing security,

43:39

this is what you need to do. And

43:41

so I think some of this training

43:43

has been very beneficial for me in

43:45

writing machine learning papers, thinking about

43:48

all of the little details to get

43:50

these points right, because I had a

43:52

paper recently where the way that I broke

43:54

some defense and the way that the thing

43:56

broke is because there was a negative sign

43:58

in the wrong spot. And like, it's

44:01

like, this is not the kind of

44:03

thing that like, I could have reasoned

44:05

from first principles about

44:07

the code, like if I had been

44:09

advising someone, like, I don't know how

44:11

I would have told them, check all

44:14

the negative signs. It's like,

44:16

you don't know, like, you know,

44:18

like, you just, like, what you

44:20

should be doing is, like, you

44:22

should be, like, understanding everything that's

44:24

going on and find the one

44:26

part where the mistake was made so

44:28

that you can. It was called Why

44:30

I Use AI, and it was about a couple

44:33

of months ago you wrote this.

44:35

And you say that you've been

44:37

using language models, you find them

44:39

very useful, they improve your programming

44:41

productivity by about 50%. I can

44:44

say the same myself. Maybe let's

44:46

start there. I mean, can you

44:48

break down specifically like the kind

44:50

of tasks where it's really uplifted

44:53

your productivity? So I

44:55

am not someone who like believes in

44:57

these kinds of things. You know, I

44:59

don't, there are some people who

45:01

their job is to hype things

45:04

up and their job is to

45:06

get attention on these kinds

45:08

of things. And I feel

45:10

like the, the thing that was

45:13

annoyed about is that these

45:15

people The same people who were,

45:17

you know, Bitcoin is going to

45:19

change the world, whatever, whatever, whatever.

45:22

As soon as language models come

45:24

about, they all go language models

45:26

are going to change the world,

45:28

they're very useful, whatever, whatever, whatever.

45:31

And the problem is that if you're

45:33

just looking at this from afar,

45:35

it looks like you have the people

45:37

who are the grifters just finding the

45:39

new thing. And they are, right? Like,

45:41

this is what they're doing. This is

45:44

what they're doing. But at the same time,

45:46

I think that the models that we

45:48

have now are actually useful. And

45:50

they're not useful for merely as

45:52

many things as people like to say

45:55

that they are, but for a particular

45:57

kind of person, the person who

45:59

under what is going on in these

46:01

models and knows how to code and

46:03

can review the output, they're useful. And

46:05

so I wanted to say is like, I'm

46:07

not going to try and argue that they're

46:10

good for everyone, but I want to

46:12

say like here is an n equals

46:14

one, me anecdote, that I think they're

46:16

useful for me, and if you have

46:19

a background similar to me, then maybe

46:21

they're useful for you too. And you

46:23

know, I've got a number of people

46:26

who... are like, you know, security-style people

46:28

who have contacted me, it's like, you

46:30

know, thanks for writing this, like, you

46:32

know, they have been useful for me,

46:35

and yeah, now there's a question of,

46:37

does my spirit generalize anyone else?

46:39

I don't know, this is not

46:42

my job to try and understand

46:44

this, but at least what I

46:46

wanted to say was, yeah, they're

46:48

useful for people who behave like

46:50

I do. Okay, now, why are they

46:52

useful? The current models

46:55

we have now are good enough

46:57

that the kinds of things that

46:59

where I want to answer

47:01

to this question, whether it's

47:03

right this function for me or

47:05

whatever, it's like I know how

47:07

to check it, I know that I

47:09

could get the answer. It's like something

47:11

I know how to do, I just

47:14

don't want to do it. The

47:16

analogy I think is maybe

47:18

most useful is... Imagine

47:20

that you had to write all of

47:23

your programs in C or in Assembly.

47:25

Would this make it so that you

47:27

couldn't do anything that you can do

47:29

now? No, probably not. You could do

47:31

all of the same research results in

47:33

C instead of Python if you really

47:36

had to. It would take you a lot

47:38

longer because you have an idea in your

47:40

mind. I want to implement, you

47:43

know, something trivial, you know, some binary

47:45

search thing. And then in C

47:47

you have to start reasoning about pointers

47:49

and memory allocation and all these little

47:51

details that are at a much lower

47:53

level than the problem you want to

47:55

solve. And the thing I think is

47:57

useful for language models is that if you

47:59

know... the problem you want to solve and you

48:02

can check that the answer is right, then

48:04

you can just ask the model to implement

48:06

for you the thing that you want in

48:08

the words that you want to just type

48:10

them in, which are not terribly well defined,

48:12

and then it will give you the answer and

48:14

you could just check that it's correct and

48:17

then put it in your code and

48:19

then continue solving the problem you want

48:21

to be solving and not... the problem

48:23

that you had to do to

48:25

actually type out all the details.

48:27

That's maybe the biggest class, I think,

48:30

of things that I find useful.

48:32

And the other class of things I

48:34

find useful are the cases where you

48:36

rely on the fact that the model

48:38

has just enormous knowledge about

48:41

the world, and about all kinds

48:43

of things. And if you understand

48:45

the fundamentals, but like, I don't

48:47

know the API to this thing, just

48:50

like... make the thing work under the API

48:52

and I can check check that easily or

48:54

you know I don't understand how to write

48:56

something in some particular language like give

48:58

me the code like if you if

49:00

you give me code in any language

49:02

even if I've never seen it before

49:04

I can basically reason about what it's

49:06

doing like you know I may make

49:08

mistakes around the border but like I

49:11

could never have typed it because I

49:13

don't know the syntax whatever the models

49:15

are very good at giving you the

49:17

correct syntax and just like getting everything

49:19

else out of the way and then I

49:21

can figure out the rest about how

49:23

to do this and you know if

49:25

I if I couldn't ask the model

49:27

I would have had to have learned

49:29

the syntax for the language to type

49:31

out all the things or do what

49:33

people would do you know five years

49:35

ago copy and paste some other person's

49:37

code from stack overflow and you know

49:39

to make adaptations and it was like

49:41

a strictly worse version of just asking

49:43

the model because now I'm relying on

49:45

me. my view is that for these

49:47

kinds of problems that they're currently plenty

49:49

useful. If you already understand and

49:52

by that I mean an abstract understanding then

49:54

then they're a superpower which explains why you

49:56

know like the smarter you are actually the

49:58

more you can get out of out of

50:00

a language model. But how is

50:02

your usage evolved over time? And

50:04

just what's your methodology? I mean,

50:07

you know, speaking personally, I know

50:09

that specificity is important. So going

50:11

to source material and constructing the

50:13

prompt, you know, imbuing my understanding

50:15

and reasoning process into the prompt.

50:17

I mean, how do you think

50:19

about that? Yeah. I guess I try and ask

50:22

questions that I think have a reasonable

50:24

probability of working. And I

50:26

don't ask questions where I feel like

50:28

this was going to slow me down.

50:30

But if I think it has, you

50:33

know, a 50% chance of working,

50:35

I'll ask the model first. And

50:37

then I'll look at the output

50:39

and see, like, does this direction

50:41

look correct? And if it seems

50:43

like, does this direction look

50:45

correct, and if it seems like,

50:48

the directionally look correct,

50:50

and if it seems like, the

50:52

directionally, like, the direction, like,

50:55

great. And then I learn,

50:57

like, like, who say like they

50:59

can't get models to do anything useful

51:01

for them. Yeah, it may be the

51:03

case that models are just really bad

51:06

at a particular kind of problem. It

51:08

may also just be you don't have

51:10

a good understanding what the

51:12

models can do yet. You know, if you,

51:14

like, I think most people, you know, today

51:16

have forgotten how much they had to

51:18

learn about how to use Google search.

51:21

You know, like. People today, if I tell

51:23

you to look something up, you implicitly know

51:25

the way that you should look something up

51:27

is to like use the words that appear

51:29

in the answer. You don't ask it as

51:31

a form of a question. There's a way

51:33

that you type things into the search

51:35

engines to get the right answer. And

51:37

this requires some amount of skill and

51:40

understanding of how to reliably

51:42

find answers to something online. I feel

51:44

like it's the same thing for language

51:46

models. They have a natural language

51:49

interface. So like technically you could

51:51

type. whatever thing that you wanted,

51:53

there are some ways of doing it that

51:55

are much more useful in others, and I

51:57

don't know how to teach this as a skill.

52:00

other than just saying, like, try

52:02

the thing. And maybe it turns

52:04

out they're not good at your

52:06

task and then just don't use

52:08

them. But if you are able

52:11

to make them useful, then this

52:13

seems like a free productivity

52:15

win. But, you know, this is the

52:17

kind of thing where, yeah, again, caveat

52:20

it on. You have to

52:22

have some understanding what's

52:24

actually going on with these

52:26

things because, you know, there are

52:28

people who don't who I feel

52:30

like have who can try and

52:32

do these similar kinds of things

52:35

and then I'm worried about they

52:37

don't like are you gonna learn anything

52:39

is the you won't catch the

52:41

bugs when the bugs happen

52:44

all kinds of problems that

52:46

I'm worried about from from that

52:48

perspective but like for the

52:50

like practitioner who wants to

52:52

get work done I feel

52:54

like In the same way that I wouldn't

52:56

say you need to use C over Python,

52:58

I wouldn't say you need to use just

53:01

Python or Python bus language models. Yes, yes.

53:03

I agree that, you know, laziness and acquiescence

53:05

is a problem. Vives and intuition are really

53:07

important. I mean, I consider myself a Jedi

53:09

of using LLLMs and sometimes it frustrates me

53:12

because I say to people, oh, you know,

53:14

just use an LLLA. I seem to be

53:16

able to get so much more at LLLMs

53:18

than than other people and I'm not entirely

53:21

sure why that. Maybe it's just because I

53:23

understand the thing that I'm prompting or something

53:25

like that, but it seems to be something

53:27

that we need to learn. Yeah, I mean,

53:29

every time a new tool comes about, you have

53:32

to spend some time, you know, I remember

53:34

when people would say, real programmers

53:36

write code and see and don't write

53:38

it in a high-level language. Why would

53:40

you trust the garbage collector to do

53:42

a good job? Real programs manage their

53:44

own memory. Real

53:47

programmers write their own Python. Why would you

53:50

trust the language model to output code that's

53:52

correct? Why would you trust it to be

53:54

able to have this recall? Real programmers understand

53:56

the API and don't need to look up

53:59

the reference manual. can we draw the

54:01

same analogies here? And now I

54:03

think this is the case of like

54:05

when the tools change and make it

54:08

possible for you to be more productive

54:10

in certain settings you should be willing

54:12

to look at them into the new

54:14

tools. I know I'm always trying

54:16

to rationalize this because it comes

54:19

down to this notion of is

54:21

the intelligence in the eye of the

54:23

prompter? You know, does it matter?

54:25

I think the answer is no, in

54:27

some cases I think the answer is

54:29

yes, but I'm not using things

54:32

that my people is... The

54:34

thing makes me more productive

54:36

and solves the task for

54:38

me. Was it the case that

54:40

I put the intelligence in? Maybe?

54:42

I think, in many cases, I

54:44

think the answer is yes, but I

54:47

don't... I'm not going to look

54:49

at it this way. I'm going to

54:51

look at it as like, is it?

54:53

solving the questions that I want

54:55

in a way that's useful for

54:57

me. I think here the answer

55:00

is definitely yes, but yeah,

55:02

I don't know how to answer this

55:04

in some real way. So obviously,

55:06

as a security researcher, how does

55:09

that influence the way that

55:11

you use LLLMs? Oh yeah, yeah,

55:13

no, this is why I'm scared about

55:15

the people who are going to use

55:17

them and not understand things because, you

55:19

know, you ask them to write an

55:21

encryption function for you and the answer

55:24

really ought to be, you should not

55:26

do that. You should be calling this

55:28

API. And oftentimes they'll be like, sure,

55:30

you want me to write encryption function,

55:32

here's an answer to an encryption function,

55:34

and it's going to have all of the

55:36

bugs that everyone normally writes, and this

55:39

is going to a database. And

55:41

what did the model do? It wrote

55:43

the thing that was vulnerable to sequel

55:45

injection. And this is terrible. If

55:47

someone was not being careful, they

55:49

would not have caught this. And

55:52

now they'll introduce all kinds of bad

55:54

bugs. Because I'm reasonably

55:56

competent at programming, I can read

55:58

the output of the model. and just

56:00

like correct the things where it made

56:02

these mistakes. Like it's not hard to

56:05

fix the sequel injection and replace the

56:07

string concatenation with the, you know, the

56:09

templates. The model just didn't do it

56:11

correctly. And yeah, so I'm very worried about the

56:14

kind of person who's not going to do

56:16

this. There have been a couple of papers

56:18

by people showing that people do write

56:20

very insecure code when using language models

56:22

when they're not being careful for

56:24

these things. And yeah, this is

56:26

something I'm worried about. It looks

56:28

like it might be the case

56:31

that it's differentially more vulnerable when

56:33

people use language models versus when they

56:35

don't. And yeah, this is, I think, a big concern.

56:37

I think the reason why I tend to

56:39

think about this utility question

56:41

is often just from the perspective

56:43

of, yeah, security of things that people

56:45

use actually matters. And so I want

56:48

to know what are the things that

56:50

people are going to do so you

56:52

can then write the papers and study

56:54

what people actually going to do. So

56:56

I feel like it's important to separate.

56:58

can the model solve the problem for

57:00

me? And the answer for the language

57:02

models using it is oftentimes, yes, it gives

57:04

you the right answer for the common case.

57:06

And this means most people don't care

57:08

about the security question. And so they'll

57:11

just use the thing anyway, because it

57:13

gave them the ability to do this new

57:15

thing, not understanding the security

57:17

piece. And so that means we should then go

57:19

and do security around this other

57:22

question of like. We know people are going

57:24

to use these things, we ought to do

57:26

the security to make sure that the security

57:28

is there, so that they can use them

57:30

correctly. And so I often try and use

57:32

things that are at the frontier of what

57:34

people are going to do next, just to

57:37

try and put myself in their frame of

57:39

mind and to understand this. And yeah,

57:41

this worries me quite a lot, because, yeah,

57:43

things could go very bad here. How and

57:45

when do you verify the outputs of LLC?

57:48

the same way that you put the, I

57:50

mean, like, this is the other thing. People

57:52

say, like, you know, maybe the model's gonna

57:54

be wrong, but like, half of the answers

57:56

on Stack Overflow are wrong anyway. So, like,

57:58

it's not the case. that's like, like,

58:01

if you've been programmed for a

58:03

long time, like, you're used to

58:05

the fact that you read code

58:07

that's wrong, I'm not going to

58:09

copy and paste some function on

58:11

stack overflow and just assume that

58:13

it's right, because, like, maybe the

58:15

person asked a question that was

58:17

different than the question that I

58:20

asked, like, whatever, like, so I

58:22

feel like I'm not, I don't

58:24

feel like I'm doing anything terribly

58:26

different, Maybe the only difference

58:28

is that I'm using the models more

58:30

often, and so I have to be

58:33

more careful in checking, like, you know,

58:35

if you're using something twice as often,

58:37

then if you're redifying bugs with something,

58:39

you know, you're going to have twice

58:41

as many bugs, and you're using it

58:44

twice as much, and so you have

58:46

to be a little more careful. But

58:48

I don't feel like it's anything I'm

58:50

doing. 95% solutions are still 95% solutions.

58:52

You take the thing, it does almost

58:55

everything that you wanted, then like, it

58:57

maxed out its capability, it's good. You're

58:59

an intelligent person, now you finish the

59:01

last 5% fix what the problem is,

59:03

and then there, you have a 20%

59:06

performance increase there. Yeah, you touched on

59:08

something very interesting here, because actually most

59:10

of us are wrong most of the

59:12

time. And that's why it's really good

59:14

to have at least one very smart

59:16

friend, because they constantly point out all

59:19

of the ways in which your stuff

59:21

is wrong. Most code is wrong. It's

59:23

your job to point out how things

59:25

are wrong. And I guess we're always

59:27

just kind of on the boundary of

59:30

wrongness unwittingly. And that's just the way

59:32

the world works anyway. Yeah, right. And

59:34

so I think that... there's a potential

59:36

for massive increases in quantity of wrongness,

59:38

you know, if with language models, like

59:41

this is I think it's like, there

59:43

are lots of things that could go

59:45

very wrong, but we'll go very bad

59:47

with language models, you know, the ability

59:49

of them, you know, previously the amount

59:52

of bad code that could be written

59:54

was like limited to like the amount

59:56

of number of humans who could write

59:58

bad code, because like... There's

1:00:01

only so many people who could write software and

1:00:03

like you had to have at least some training

1:00:05

and so you the number like some bounded

1:00:07

amount of bad code One of the other things

1:00:09

I'm worried about is you know You have people

1:00:11

who? Look at these people saying models can

1:00:13

solve all your problems for you and now you

1:00:16

have ten times as much code Which is great

1:00:18

from one perspective because isn't it

1:00:20

fantastic that anyone in the world can

1:00:22

go and write whatever software they need

1:00:25

to solve their particular problem? That's fantastic

1:00:27

But at the same time, security in person and

1:00:29

me is kind of scared about this because

1:00:31

now you have 10 times as much stuff

1:00:33

that is probably very insecure. And you are

1:00:36

not going to be able to have, you

1:00:38

don't have 10 times as many security experts

1:00:40

to study all of this. Like you're going

1:00:42

to have a massive increase in this and

1:00:44

some potential futures. And this is one of

1:00:46

the many things that I'm, I think, I'm worried about

1:00:48

and like is why I try and use these

1:00:50

things to understand like. Does this seem like something

1:00:53

people will try and do? It seems to me

1:00:55

the answer is yes right now and yeah this

1:00:57

worries me. So I spoke with some Google

1:00:59

guys yesterday and they've been studying some

1:01:01

of the failure modes of LLM so

1:01:03

like just really crazy stuff that people

1:01:05

don't know about like they can't copy

1:01:07

they can't count you know because of

1:01:09

the soft Max and the topological representation

1:01:11

squashing in this particular loads and loads

1:01:13

of stuff they can't do. In your

1:01:16

experience have you noticed some kind of

1:01:18

tasks that LLLM's just really struggle struggle

1:01:20

on? I'm sure that there are many

1:01:22

of them. I have sort of learned to

1:01:24

just not ask those questions. And so

1:01:26

I have a hard time, like coming

1:01:28

up like, you know, like in the same

1:01:31

sense, like, what are the things that

1:01:33

search engines are bad for? You

1:01:35

know, I'm sure that there are a

1:01:37

million things that search engines are, like,

1:01:39

completely the wrong answer for, but if

1:01:41

I sort of pressed you for a

1:01:44

question, answer for, you know, you'd have

1:01:46

a little bit of a hard time, because

1:01:48

the way that you use them. all of

1:01:50

these things like whenever you want like

1:01:53

correctness in some senses the model

1:01:55

is not the thing for you like

1:01:57

in terms of like specific tasks that

1:01:59

they're particularly bad at? I mean, of course,

1:02:01

you can say anything that requires some kind

1:02:04

of, if it would take you more than,

1:02:06

you know, 20 minutes to write the program,

1:02:08

probably the model can't get that. But the

1:02:11

problem with this, like, this was changing. You

1:02:13

know, like, I, so, okay, so this is,

1:02:15

like, the other thing, like, there are things

1:02:17

that, like, I thought, would be hard that

1:02:20

end up becoming easier. So there was a

1:02:22

random problem that I wanted that I wanted

1:02:24

for unrelated reasons that it's a hard dynamic

1:02:26

programming problem to solve. It took me like,

1:02:29

I don't know, two or three hours to

1:02:31

solve it, the first time that I had

1:02:33

to do it. And so 01 just launched

1:02:35

a couple days ago, I gave the problem

1:02:38

to 01, and it gave me an implementation

1:02:40

that was 10 times faster than one I

1:02:42

wrote in like two minutes. And so I

1:02:44

can test it because I have a reference

1:02:47

solution, and like it's correct. It's like, okay,

1:02:49

so now I've learned, like, here's a thing

1:02:51

that I previously would have thought, like, I

1:02:53

would never ask models to solve something, because

1:02:56

this was, like, a challenging enough algorithmic problem

1:02:58

for me, that I would have no hope

1:03:00

for the model solving, and now I can.

1:03:02

But there are other things that, you know,

1:03:05

seem trivial to me that the models get

1:03:07

wrong, but I mostly have just, like, not

1:03:09

ask the answers are right and wrong, and

1:03:12

they'll... Just apply the wrong answer as many

1:03:14

times as they can and that seems concerning

1:03:16

Yeah, I mean this is part of the

1:03:18

anthropomorphization process because I find it fascinating that

1:03:21

I think you know we have vibes We

1:03:23

have intuitions and we actually know and we've

1:03:25

learned to skirt around the failure mode You

1:03:27

know the long tail of failure modes and

1:03:30

we just smooth it over in our supervised

1:03:32

usage of language models and the amazing thing

1:03:34

is we we don't seem to be consciously

1:03:36

aware of it Yeah, but like programmers do

1:03:39

this all the time, right? Like you have

1:03:41

a language, the language has some, like, okay,

1:03:43

so. let's suppose you're

1:03:45

someone who writes Rust.

1:03:48

Rust has a very,

1:03:50

very weird model of

1:03:52

memory. If you go

1:03:54

to someone who's very

1:03:57

good at writing Rust,

1:03:59

they will structure the

1:04:01

program differently so they

1:04:04

don't encounter all of

1:04:06

the problems because of

1:04:08

the fact that you

1:04:10

have this weird memory

1:04:13

model. But if I

1:04:15

were to do it, like I'm not very good

1:04:17

at Rust, like I try and use it and like

1:04:19

I try and write my C code in Rust

1:04:21

and like the borrower checker just like yells at me

1:04:23

to no end and I can't write my program. And

1:04:25

like I look at Rust and go like, I

1:04:28

see that this could be very good but I just don't

1:04:30

know how to get my code right because I haven't done

1:04:32

it enough. And so I look at the language and go, okay,

1:04:35

if I was not being charitable,

1:04:37

I would say why would anyone use this as

1:04:39

impossible to write my C code in Rust? Like you're supposed

1:04:41

to have all these nice guarantees but like no, you have

1:04:43

to change the way you write your code in order to

1:04:45

get, change your frame of mind and then

1:04:47

the problems will just go away. Like it's sort of like you

1:04:49

can do all of the nice things just

1:04:52

accept the way that the paradigm you're supposed to

1:04:54

be operating in and the thing goes very

1:04:56

well. I

1:04:58

see the same kind of analogy for some of

1:05:00

these kinds of things here where the models are not

1:05:02

very good in certain ways and you're trying to

1:05:04

imagine that the thing is a human and ask it

1:05:06

the things you would ask another person but it's

1:05:08

not. And you need to ask it

1:05:10

in the right way or ask the right kinds

1:05:12

of questions and then you can get the value and

1:05:15

if you don't do this then you'll end up

1:05:17

very disappointed because it's not super human. What

1:05:20

are your thoughts on benchmarks? Okay,

1:05:23

yes, I have thoughts here. This

1:05:27

I guess is the problem with language models

1:05:29

is we used to be in a world

1:05:31

where benchmarking was very easy because we wanted

1:05:33

models to solve exactly one task and

1:05:36

so what you do is you measure it on

1:05:38

that task and you see can it solve the

1:05:40

task and the answer is yes and so great,

1:05:42

you figure it out. The problem with this is

1:05:45

like that task was never the task we actually

1:05:47

cared about and this is why no one used

1:05:49

models. Like no ImageNet models ever made it

1:05:51

out into like the real world

1:05:53

to solve actual problems because we just

1:05:55

don't care about classifying between 200

1:05:57

different breeds of dogs. You know, the

1:05:59

model may be good at this,

1:06:01

but. this is not the thing we actually want. We want something different.

1:06:03

And it would have been absurd at the time

1:06:06

to say the image net model can't

1:06:08

solve this actual task I care about

1:06:10

in the real world, because of course

1:06:12

it wasn't strange for that. Language models,

1:06:14

the claim that people make for language

1:06:16

models and what people who train them

1:06:18

is, I'm going to train this one

1:06:21

general purpose model that can solve arbitrary

1:06:23

tasks. And then they'll go

1:06:25

test it on some small number of

1:06:27

tasks and say, see, it's good because

1:06:29

I can solve these tasks very

1:06:31

well. And the challenge here is that

1:06:34

if I trained a model to solve

1:06:36

any one of those tasks in

1:06:38

particular, I could probably get really

1:06:40

good scores. The challenge is that you

1:06:42

don't want the person who has trained

1:06:44

the model to have done this. You

1:06:47

wanted them to just train a good

1:06:49

model and use this as an independent,

1:06:51

you know, just... Here's a

1:06:53

task that you could train

1:06:55

the model, you could evaluate

1:06:58

the model on completely independent

1:07:01

from the initial training

1:07:03

objective in order to get like

1:07:06

an unbiased view of how well

1:07:08

the model does. But people

1:07:10

who train models are incentivized

1:07:13

to make them do well on benchmarks.

1:07:16

And while in the old

1:07:18

world, you know, I trust researchers

1:07:20

not to cheat. In

1:07:23

principle, I could have trained on

1:07:25

the test set. But this is

1:07:27

actually cheating. You're not trained on

1:07:30

the test set. So I trust

1:07:32

that people want to do this.

1:07:34

But suppose that I give you a

1:07:36

language model, and I want to

1:07:38

evaluate it on coding, which I'm

1:07:40

going to use, you know, a

1:07:43

terrible benchmark, but human evil, whatever.

1:07:45

I'm going to use MMOU. I'm

1:07:47

going to use MMOU, whatever the

1:07:49

cases may be. I may not

1:07:51

actually. train my model in particular

1:07:54

to be good on these benchmarks. And

1:07:56

so you may have a model that

1:07:58

is not very capable. in general,

1:08:00

but on these specific 20 benchmarks

1:08:02

that people use, it's fantastic. And

1:08:05

this is what everyone is incentivized

1:08:07

to do, because you want your

1:08:09

model to have maximum scores on

1:08:11

benchmarks. And so I think I

1:08:14

would like to be in a

1:08:16

world where there were a lot

1:08:18

more benchmarks, so that is not

1:08:20

the kind of thing that you

1:08:23

can... that you can easily do

1:08:25

and you can more easily trust

1:08:27

that these models are going to

1:08:29

write answers, but they accurately reflect

1:08:31

what their skill level is in

1:08:34

some way that is not being

1:08:36

designed by the model trainer to

1:08:38

maximize the scores. So at the

1:08:40

moment, you know, like the hyperscale

1:08:43

is that they put incredible amounts

1:08:45

of work into benchmarking and so

1:08:47

on and... Now we're moving to

1:08:49

a world where we've got, you

1:08:51

know, test time inference, test time

1:08:54

active fine tuning, you know, people

1:08:56

are fine tuning, quantizing, fragmenting and

1:08:58

so on. And a lot of

1:09:00

the people doing this in a

1:09:03

practical sense can't really benchmark in

1:09:05

the same way. How do you

1:09:07

see that playing out? Okay, that,

1:09:09

I don't know. I don't know.

1:09:12

I don't know. It just seems

1:09:14

very hard. because, you know, to

1:09:16

test what these things are, you

1:09:18

can just, you can use the

1:09:20

average benchmarks and hope for the

1:09:23

best, but I don't, I feel

1:09:25

like the thing I'm more worried

1:09:27

about is people who are actively

1:09:29

fine-tuning models to show that they

1:09:32

can make them better on better

1:09:34

on-tuning models to show that they

1:09:36

can make them better on certain

1:09:38

tasks. So you have lots of

1:09:40

fine-tunes of llama, for example. I

1:09:43

think that's the thing I'm more

1:09:45

worried about. But yeah, for the

1:09:47

other cases, I don't know. I

1:09:49

agree this is hard, but I

1:09:52

don't have any great solutions here.

1:09:54

That's okay. We can't let you

1:09:56

go before talking about one of

1:09:58

your actual papers. I mean if

1:10:01

this has been amazing talking about

1:10:03

general stuff, but I decided

1:10:05

to pick this one, stealing

1:10:07

part of a production language

1:10:10

model, so this is from July,

1:10:12

could you just give us a bit

1:10:14

of an elevator picture on that? For

1:10:16

a very long time, when we did

1:10:19

papers in security, what we did

1:10:21

was we would think about how a

1:10:23

model might be used in

1:10:25

some hypothetical future, and then

1:10:27

say, well, maybe we have a

1:10:30

certain kinds of attacks that

1:10:32

are possible. Let's try and

1:10:34

show in some theoretical

1:10:36

setting, this is something that

1:10:38

could happen. And so there's

1:10:41

a line of work called model

1:10:43

stealing, which tries to answer

1:10:45

the question, can someone take

1:10:47

the model that you have?

1:10:50

And without, and just like

1:10:52

by making standard queries

1:10:54

to your API, steal a copy

1:10:56

of it. This was started by

1:10:58

Florian Tramer and others in

1:11:00

2016, where they did this

1:11:02

on like very very simple

1:11:05

linear models over APIs. And

1:11:07

then it became a thing

1:11:09

that people started studying

1:11:11

in deep neural networks. And

1:11:13

there were several papers in

1:11:15

a row by a bunch of other

1:11:17

people. And then in 2020, we wrote

1:11:19

a paper that we put at crypto

1:11:22

that said, well, here is a

1:11:24

way to steal an exact copy

1:11:26

of your model. Like,

1:11:29

whatever the model you have is, I

1:11:31

can get exact copy. As long as,

1:11:33

you have a long list of assumptions,

1:11:35

it's only using a value activation.

1:11:37

The whole thing is evaluated

1:11:39

in floating point 64. I can

1:11:41

feed floating point 64 of values in,

1:11:43

I can see floating point 64 of

1:11:46

values out, the model is only fully

1:11:48

connected, its depth is no greater than

1:11:50

three, it has like no more than

1:11:52

32 units wide on any given layer,

1:11:54

like it just has a long list

1:11:56

of things that are... Never true

1:11:58

in practice. But it's a

1:12:01

very cool theoretical result. And there

1:12:03

are other papers of this kind that show

1:12:05

how to do this kind of, I steal

1:12:07

an exact copy of your model, but

1:12:09

it only works in these really contrived

1:12:11

settings. This is why we submitted

1:12:13

the paper to crypto, because they

1:12:16

have all these kinds of theoretical

1:12:18

results that are very cool, but

1:12:20

are not immediately practical in many

1:12:22

ways. And then there was a

1:12:24

line of work continuing extending upon

1:12:26

this. And the question that I

1:12:28

wanted to answer is, like, now

1:12:31

we have these language models. And if

1:12:33

I list all of the assumptions, all

1:12:35

of them are false. It's not just

1:12:37

value-only activations. It's not just fully

1:12:39

connected. I can't send floating point

1:12:41

64 inputs. I can't view floating

1:12:43

point 64 outputs. They're like a

1:12:45

billion neurons, not 500. So like,

1:12:48

all these things that are true.

1:12:50

And so I wanted to answer

1:12:52

the question, like, what's the best

1:12:54

attack that we can come up

1:12:56

with? that actually I can

1:12:58

implement in practice on a real

1:13:01

API. And so this is what we

1:13:03

tried to do. We tried to come

1:13:05

up with the best attack that

1:13:07

works against the most real API

1:13:09

that we have. And so what we

1:13:11

did is we looked at the open

1:13:13

AI and some other companies, Google had

1:13:15

the same kind of things. And

1:13:17

because of the way the API

1:13:19

was set up, it allowed us

1:13:22

to get some degree of control

1:13:24

over the outputs. that let us

1:13:26

do some fancy math that would steal

1:13:28

one layer of a model. It's like

1:13:30

among the layers in the model, it's

1:13:33

probably the least interesting, it's a very

1:13:35

small amount of data, but like I

1:13:37

can actually recover one of the layers

1:13:40

of the model. And so it's real

1:13:42

in that sense that I can do it.

1:13:44

It's also real in the sense of I have

1:13:46

the layer correctly. But it's not

1:13:48

everything. And so I think what I

1:13:50

was trying to advocate for in this

1:13:53

paper is... I think we should be pursuing

1:13:55

both directions of research at the same

1:13:57

time. One is write the papers that are

1:13:59

true. in some theoretical sets, but are

1:14:02

not the kinds of results that

1:14:04

you can actually implement in any

1:14:06

real system, and likely for the

1:14:08

foreseeable future, are not the kinds

1:14:10

we'll be able to implement in

1:14:12

any real systems. And also, at

1:14:14

the same time, do the thing

1:14:16

that most security researchers do today,

1:14:18

which is look at the systems

1:14:20

as they're deployed and try and

1:14:22

answer, given this system as it

1:14:24

exists right now, what are the

1:14:26

kinds of attacks? that you can

1:14:28

actually really get the model to

1:14:30

do and try and write papers

1:14:32

on that pieces of it. And

1:14:34

I don't know what you're going

1:14:36

to do with the last layer

1:14:38

of the model. You know, we

1:14:40

have the some things you can

1:14:42

do, but one thing that tells

1:14:44

you like the width of the

1:14:46

model, which is not something that

1:14:48

people disclose. So in our paper

1:14:50

we have, I think, the first

1:14:52

public confirmation of the width. of

1:14:54

like the the GPT-3 Ada and

1:14:56

Babbage models, which is not something

1:14:58

that Open Area ever said publicly.

1:15:00

They had the GPT-3 paper that

1:15:02

gave the width of a couple

1:15:04

of models in the paper, but

1:15:06

then they never really directly said

1:15:08

what the sizes of Ada and

1:15:10

Babbage were. People speculated, but we

1:15:12

could actually write on GPT-3. And

1:15:14

we... correctly stole the last layer,

1:15:16

and I know the size of

1:15:18

the model, and it is correct.

1:15:20

As closely close to responsible disclosure,

1:15:22

like we talked about the beginning,

1:15:24

we agreed with them ahead of

1:15:26

time, we were going to do

1:15:28

this. This is a fun conversation

1:15:30

to have with, you know, not

1:15:32

only Google lawyers, but open-air lawyers,

1:15:34

like, hi, I would like to

1:15:36

steal your model. May I please

1:15:38

please do this? You know, open-eyed

1:15:40

people were very nice. And they

1:15:42

said yes. The Google lawyers initially

1:15:44

were also very like, you know,

1:15:46

before I, even the Google lawyers,

1:15:48

like I would like to steal

1:15:50

open-air as data, like under no

1:15:52

circumstances. But like I said, like,

1:15:54

if I get the open-out general

1:15:56

counsel to agree, are you, are

1:15:58

you, are you know, We ran

1:16:01

everything, we destroyed the data, whatever.

1:16:03

But, like, part of the agreement was,

1:16:05

like, they would confirm that we got

1:16:07

the right, that we did the right thing,

1:16:09

but they asked us not to release

1:16:11

the actual data we stole. Which, like,

1:16:13

makes sense, right? Like, you want to

1:16:16

make, you want to show here's an

1:16:18

attack that works, but, like, that's not

1:16:20

actually released the stolen stuff. And so,

1:16:22

yeah, so, so, you know, if you were to

1:16:24

write down a list of, like all the people

1:16:26

in the people in the people in the

1:16:28

world, The list includes all courage and

1:16:31

former employees of open AI and me. And

1:16:33

so like it sounds like this is like

1:16:35

a very real attack because like this is

1:16:37

like, this is the easiest, like how else

1:16:39

would you learn this? The other way to learn this

1:16:41

would be like to like hack and open

1:16:43

AI servers and try and like learn this

1:16:45

thing or like you know blackmail on the

1:16:48

employees or you can do like an actual

1:16:50

episode machine learning attack and recover the size

1:16:52

of those models and the last layer. And

1:16:54

so that's like the sort

1:16:56

of motivation behind why we

1:16:58

want to write this paper was

1:17:01

to get examples and try

1:17:03

and encourage other people to

1:17:05

get examples of attacks that

1:17:07

even if they don't solve all

1:17:10

of the problems will let

1:17:12

us make them increasingly real in

1:17:14

this sense. And I think this

1:17:16

is something that we'll start to

1:17:19

need to see more of as we start

1:17:21

to get. systems deployed into more

1:17:23

and more settings. So that was

1:17:25

like why we did the paper.

1:17:28

I don't know, if you want

1:17:30

to talk about the technical methods

1:17:32

behind how we did it or

1:17:34

something, but it's, yeah. Do you want to

1:17:36

go to that? Okay, I'm sure. I can try.

1:17:38

Yeah, okay. So, for the next,

1:17:40

two minutes, let's assume some level

1:17:42

of linear algebra knowledge. If this

1:17:44

is not you, then I apologize.

1:17:47

I will try and explain it

1:17:49

in a way that makes some

1:17:51

sense. So the way that the

1:17:53

models work is they

1:17:55

have a sequence of layers,

1:17:57

and each layer is a

1:17:59

trace. transformation of the previous layer.

1:18:01

And the layers have some size, some

1:18:03

width. And it turns out that the

1:18:06

last layer of a model goes from

1:18:08

a small dimension to a big dimension.

1:18:10

So this is like the internal dimension

1:18:12

of these models is, I don't know,

1:18:14

let's say 2048 or something. And the

1:18:17

output dimension is the number of tokens

1:18:19

in the vocabulary. This is like 50,000.

1:18:21

And so what this means is that

1:18:23

if you look at the vectors that

1:18:26

are the outputs of the model, Even

1:18:28

though it's in this big giant dimensional

1:18:30

space, this 50,000 dimensional space, actually the

1:18:32

vectors, because this was a linear transformation,

1:18:34

are only in this 2,000-4,000-dimensional subspace. And

1:18:37

what this means is that if you

1:18:39

look at this space, you can actually

1:18:41

compute what's called the singular value decomposition

1:18:43

to recover how the space was indebted

1:18:46

into this bigger space. this directly, like

1:18:48

the number of, okay, I'll say a

1:18:50

phrase, the number of non-zero singular values

1:18:52

tells you the size of the model.

1:18:54

Again, like, it's like this, it's not

1:18:57

challenging math. It's like, this is, you

1:18:59

know, the last time I used this

1:19:01

was an undergrad in math. But, you

1:19:03

know, it's, if you work out the

1:19:06

details, it ends up working out. And

1:19:08

it turns out that, yeah, this is

1:19:10

an exciting. It's like a very nice

1:19:12

math to these kinds of these kinds

1:19:14

of things. And I think part of

1:19:17

the reason why I like the details

1:19:19

here is this is like the kind

1:19:21

of thing that like it's doesn't require

1:19:23

an expert in any one area. Like

1:19:26

it's like undergrad knowledge math. Like I

1:19:28

could explain this to anyone who has

1:19:30

completed the first course in linear algebra.

1:19:32

But like you need to be that

1:19:34

person and you need to also understand

1:19:37

how language models work and you need

1:19:39

to also be thinking about the security.

1:19:41

and you need to be thinking about

1:19:43

what the actual API is that it

1:19:46

provides because you can't get the standard

1:19:48

stuff like you have to like be

1:19:50

thinking about all the pieces this why

1:19:52

I think you know like I think

1:19:54

it was interesting is like this is

1:19:57

what a security person does is like

1:19:59

it's not the case that we're looking

1:20:01

at anything like sometimes you look at

1:20:03

something far deeper than any one thing

1:20:06

but most often what these exploits how

1:20:08

they happen is that you have a

1:20:10

fairly broad level of knowledge and you're

1:20:12

looking at how the details of the

1:20:14

API interacts with how the specific architecture

1:20:17

of the language model set up, using

1:20:19

techniques from linear algebra, and if you

1:20:21

were missing any one of those pieces,

1:20:23

you wouldn't have seen this attack as

1:20:25

possible, which is why the opening I

1:20:28

API had this for three years, and

1:20:30

no one else found it first. It's

1:20:32

like they were not looking for this

1:20:34

kind of thing. You don't stumble upon

1:20:37

these kinds of vulnerabilities, like you need

1:20:39

people to actually go look for them,

1:20:41

and then, you know, again, responsible disclosure,

1:20:43

we gave them 90 days to fix

1:20:45

it. They patched it, Google patched it,

1:20:48

a couple of other companies who we

1:20:50

won't, we won't name because they asked

1:20:52

not to, patched it, and it works,

1:20:54

and so that was, it was a

1:20:57

fun paper to write. Amazing. Well, Nicholas

1:20:59

Colleen, thank you so much for joining

1:21:01

us today. It's been an honor having

1:21:03

you on. Thank you.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features