Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
The fact that it can
0:02
make valid moves almost always
0:04
means that it must in
0:06
some sense have something internally
0:08
that is accurately modeling the
0:10
world. I don't like to
0:13
ascribe intentionality or these things,
0:15
these kinds of things. But
0:17
it's doing something that allows
0:19
it to make these moves
0:21
knowing what the current board
0:24
state is and understanding what
0:26
it's supposed to be doing.
0:29
Everyone needs something different by reasoning.
0:31
And so the answer to the question,
0:33
is that reasoning, is entirely what you
0:35
define as reasoning. And so you find
0:38
some people who are very much in
0:40
the world of, I don't think models
0:42
are smart, I don't think that they're
0:44
good, they can't solve my problems, and
0:46
so they say, no, it's not reasoning,
0:49
because to me, reasoning means, and then
0:51
they give a definition which excludes language
0:53
models. And then you ask someone who's
0:55
very sort of much on the AGI,
0:57
you know, language models are going to
1:00
solve everything. By 2027, they're going
1:02
to be displaced all human
1:04
jobs. You ask them, what is
1:06
reasoning? And they say reasoning is.
1:09
Hi. So I'm Nicholas Carlini. I'm
1:11
a research scientist at Google Deep Mind.
1:13
And I like to try and make
1:15
models do bad things. and understand the
1:17
security implications of the attacks that we
1:19
can get on these models. I really
1:21
enjoy breaking things and I've been doing
1:23
this for a long time, but I'm
1:26
just very worried that because they're impressive,
1:28
we're going to have them applied in
1:30
all kinds of areas where they ought
1:32
not be, and why, as a result,
1:34
the attacks that we have on these
1:36
things are going to end up with
1:38
bad security consequences. MLST is
1:40
sponsored by Sentinel which is the
1:43
compute platform specifically optimized for AI
1:45
workloads. They support all of the
1:47
latest open source language models out
1:49
of the box like Lama for
1:51
example. You can just choose the
1:53
pricing points, choose the model that
1:55
you want, it spins up, it
1:57
elastic auto scale, you can pay
1:59
on consumption. where you
2:01
can have a model which is
2:03
always working or it can be
2:05
freeze-dried when you're not using it.
2:07
So what are you waiting for?
2:09
Go to sentML. AI and sign
2:12
up now. To find out this
2:14
in a new AI research lab,
2:16
I'm starting in Zurich, it is
2:18
funded from Paz Ventures, involving AI
2:20
as well. We are hiring both
2:22
chief scientists and deep learning engineer
2:24
researchers and so we are Swiss
2:26
version of deep sick. And so
2:28
a small group of people, very,
2:31
very motivated, very hardworking, and we
2:33
try to do some research studying
2:35
with LLLM and Einstein models. We
2:37
want to investigate, reverse engineer, and
2:39
explore the techniques ourselves. Nicholas Carlini,
2:41
welcome to MLST. Thank you. Folks
2:43
at home, Nicholas won't need any
2:45
introduction whatsoever, definitely by far the
2:47
most famous security researcher in ML.
2:50
and working at Google and it's
2:52
so amazing to have you here
2:54
for the second time. Yeah, the
2:56
first time yeah was a nice
2:58
pandemic one, but no, it was
3:00
great. Yes, MLST is one of
3:02
the few projects that survived the
3:04
pandemic, which is pretty cool. But
3:06
why don't we kick off then?
3:09
So do you think we'll ever
3:11
converge to a state in the
3:13
future where our systems are insecure
3:15
and we're just going to learn
3:17
to live with it? I mean,
3:19
that's what we do right now,
3:21
right? In normal security. There
3:23
is no perfect security for anything.
3:26
If someone really wanted you to
3:28
have something bad happen on your
3:30
computer, like they would win. There's
3:32
very little you could do to
3:35
stop that. We just rely on
3:37
the fact that probably the government
3:39
does not want you in particular
3:41
to have something bad happen. Right?
3:44
Like, if they decided that, like,
3:46
I'm sure that they have something
3:48
that they could do, that they
3:50
would succeed on. Well, we can
3:53
get into a world of is...
3:55
The average person probably can't succeed
3:57
in most cases. This is not
3:59
where we are with machine learning
4:02
yet. With machine learning the average
4:04
person can succeed almost always. So
4:06
I don't think our objective should
4:08
be perfection in some sense, but
4:11
we need to get to somewhere
4:13
where it's at least the case
4:15
that a random person off the
4:17
street can't just really easily run
4:20
some off-the-shelf get-hop code that makes
4:22
it so that some model does
4:24
arbitrary bad things and arbitrary settings.
4:26
Now I think getting there is going to
4:28
be very very hard. We've tried,
4:31
especially in vision, for the
4:33
last 10 years or something, to get
4:35
models that are robust, and we've
4:37
made progress. We've learned a lot, but
4:39
if you look at the objective metrics,
4:41
like they have not gone up by
4:43
very much in like the last four
4:46
or five years at all, and this makes
4:48
it seems somewhat unlikely that
4:50
we're going to get perfect robustness
4:52
here in this foreseeable future.
4:54
But at least... We can still
4:56
hope that we can do research
4:58
and make things better and eventually
5:01
we'll get there. And I think
5:03
we will, but it just is going
5:05
to take a lot of work. So I'll
5:07
ask me to ask you this question.
5:09
Do you ever think in the future
5:12
that it will become illegal
5:14
to hack email systems? I
5:16
have no idea. I mean, it's
5:18
very hard to predict these kinds
5:20
of things. It's very hard to
5:22
know, is it already, especially in the
5:24
United States, the Computer Fraud and Abuse
5:26
Act, covers who knows what in whatever
5:29
settings? I don't know. I think this
5:31
is a question for the policy and
5:33
the lawyer peoples. And my view on
5:35
policy and law is, as long as
5:38
people are making these decisions, coming from
5:40
a place of what is true in
5:42
the world, they can make their decisions. The
5:44
only thing that I... try and make comments
5:46
on here is like, let's make sure that
5:48
at least we're making decisions based on what
5:50
is true and not decisions based on what
5:53
we think the world should look like. And
5:55
so, you know, if they base their decisions
5:57
around the fact that we can attack these
5:59
models. and various bad things could
6:02
happen, then I'm, they're more experts
6:04
at this than me and they can decide,
6:06
you know, what they should do. But yeah,
6:08
I don't know. In the context
6:10
of ML security, I mean,
6:12
really open-ended questions, just to
6:15
start with. Sure. Can you
6:17
predict the future? What's gonna
6:19
happen? Future for ML security. My,
6:21
okay, let me give you a
6:23
guess. I think the probability of
6:25
this happening is very small, but
6:28
like the... the median prediction, I
6:30
think, in some sense. I think
6:32
models will remain vulnerable
6:34
to fairly simple attacks for
6:37
a very long time, and
6:39
we will have to find ways
6:41
of building systems so that
6:43
we can rely on an
6:45
unreliable model and still have
6:48
a system that remains secure.
6:50
And what this probably means
6:52
is we need to figure
6:54
out... a way to design the rest of the
6:56
world, the thing that operates around the model,
6:59
so that if it decides that it's going
7:01
to just randomly classify something
7:03
completely incorrectly, even if just
7:06
for random chance alone, the system
7:08
is not going to go and perform
7:10
a terribly misguided action, and that
7:12
you can correct for this, but that we're
7:14
going to have to live with a world
7:16
where the models remain very
7:19
vulnerable for, yeah, I don't know. for
7:21
the foreseeable future, at least as far as
7:23
I can see, and you know, especially machine
7:26
learning time, five years is an eternity.
7:28
I have no idea what's going to
7:30
happen with, you know, what the world will
7:32
look like in this machine learning language models
7:34
who know something else might happen. Language models
7:37
are only like, you know, seven years of
7:39
like real significant progress, so like predicting five
7:41
years out is like almost doubling this. So
7:44
I don't know how the world there will
7:46
look, but at least as long as we're
7:48
in this world where things are... fairly
7:50
vulnerable. But then again, language
7:52
models are only, you know, seven years
7:55
and we've only been trying to attack
7:57
them for like really two or three.
7:59
So... give five years, that's twice as
8:02
long as we've been trying to
8:04
attack these language models. Maybe we
8:06
just figure everything out. Maybe language
8:08
models are fundamentally different and
8:10
things aren't this way, but
8:12
my prior just tends to be the
8:14
case of other vision models we've been
8:16
trying to study for 10 years and
8:18
at least there things have been proven
8:20
very hard and so I my expectation
8:23
this things will be hard and so I'll
8:25
have to just rely on building systems
8:27
that end up working. And actually, when
8:30
you first put out this article about
8:32
chess playing, I've cited it on the
8:34
show about 10 times. So it's really,
8:37
really interesting. But let me read a
8:39
bit out of it. By the way,
8:41
it's called playing chess with large language
8:43
models. So you said, until this week,
8:46
in order to be good at chess,
8:48
a large language model. So you said,
8:50
until this week, in order to be
8:53
good at chess, a machine learning model,
8:55
a machine learning, and then it would
8:57
win. And you said that this all
8:59
changed at the time on Monday when
9:01
Open AI released GPT 3.5 turbo instruct.
9:03
Can you tell me about that? What
9:06
GPT 3.5 turbo instruct and later other
9:08
people have done with open source models
9:10
that you can verify they're not doing
9:12
something weird behind the scenes because I
9:14
think some people speculated, well maybe they're
9:17
just cheating in various ways, but like
9:19
there are open source models that replicate
9:21
this now. What you have is you
9:23
have a language model that... can
9:25
play chess to a fairly high degree.
9:28
And yeah, okay, so when you first
9:30
tell someone, I have a machine
9:32
learning system that can play chess.
9:34
The immediate reaction you get is
9:36
like, why should I care? You
9:38
know, we had deep blue, whatever,
9:40
30 years ago. that could beat the
9:42
best humans and like isn't that some form
9:45
of like you know a little bit of
9:47
AI at the time like why should I
9:49
be at all surprised by the fact that
9:51
I have some system like this that complicates
9:54
us and that yeah so the fundamental difference
9:56
here I think is very interesting
9:58
is that the model was trained on
10:00
a sequence of moves. So in
10:02
chess you represent moves, you know,
10:04
one, E4, means, you know, move
10:06
the king's pawn, E4, and then
10:08
you have, you know, E5, a
10:10
black responds, and then two, whatever,
10:12
an F3, white plays the night,
10:14
whatever. You train on these sequences
10:16
of moves, and then you just
10:19
say six dot. language model, do
10:21
your prediction task. It's like just
10:23
a language model, it is being
10:25
trained to predict the next token,
10:28
and it can play a move
10:30
that not only is valid, but
10:32
also is very high quality. And
10:34
this is interesting because it
10:37
means that the model can play
10:39
moves that accurately, like, let's just
10:41
talk about the valid part of
10:43
the first place. Like, valid is
10:46
interesting in and by itself, because...
10:48
What is a valid chess move
10:50
is like a complicated
10:53
program to write. It's not
10:55
an easy thing to do
10:57
to describe what moves are
10:59
valid in what situations. You
11:01
can't just be dumping out
11:04
random characters and stumble
11:06
upon valid moves. And you
11:09
have this model that makes
11:11
valid moves every time. And
11:13
so I don't like talking a
11:15
lot about what's the model
11:17
doing internally, because I don't
11:19
think that's all that helpful. I
11:22
think, you know, just look at
11:24
the input-out behavior of the system
11:26
as the way to understand these things.
11:28
But the fact that it can
11:30
make valid moves almost always means
11:32
that it must in some sense
11:35
have something internally that is
11:37
accurately modeling the world. I don't
11:39
like to ascribe, you know, intentionality
11:41
or any of these things, these
11:44
kinds of things. But it's doing
11:46
something. that allows it to make
11:48
these moves knowing what the current board
11:50
state is and understanding what it's
11:52
supposed to be doing. And this
11:55
by itself I think is interesting.
11:57
And then not only can it
11:59
do... it could actually play high
12:01
quality moves. And so I
12:03
think, you know, taken together, it
12:06
in some sense tells me
12:08
that the model has a
12:10
relatively good understanding of what
12:12
the actual position looks like.
12:15
Because, you know, okay, so I play
12:17
chess at a modest level, like
12:19
I'm not terrible, I understand, you
12:21
know, more or less what I should
12:24
be doing, but if you just gave
12:26
me a sequence of 40 moves in a
12:28
row, and then said, you know, 41 point,
12:30
like, what's the next move? Like, I
12:32
could not reconstruct in my mind what
12:34
the board looked like at that point
12:36
in time. Somehow the model has figured
12:38
out a way to do this, like, having never
12:40
been told anything about the rules that,
12:43
like, they even exist as poor, like, it's
12:45
sort of reconstructed all of that, and
12:47
it can put the pieces on the
12:49
board correctly in whatever way that it
12:51
does it internally, who knows how that
12:53
happens, and then it can place the
12:55
valid move, like... It's sort of very
12:57
interesting that this is something you can
13:00
do. And I just like, for me
13:02
it changed the way that I think
13:04
about what models can and can't do
13:06
in like surface level statistics
13:08
or more deeper statistics about like actually
13:11
what's going on. And I don't
13:13
know, this is I guess mainly why
13:15
I think this is an interesting
13:17
thing about the world. Yeah, we
13:19
have this weird form of human
13:22
chauvinism around the abstractness of our
13:24
understanding. and these artifacts have a
13:26
surface level of understanding, but it's
13:28
at such a great scale that
13:30
at some point it becomes a
13:32
weird distinction without a difference. But
13:34
you said something very interesting in
13:36
the article. You said that the model
13:39
was not playing to win, right? And you were
13:41
talking about, and I've said this on the show,
13:43
that the models are a reflection of you. Right,
13:45
so you play like a good chess player and
13:48
it responds like a good chess player and it's
13:50
like that whether you're doing coding, whether you're doing
13:52
it. And it might even explain some of the
13:54
differential experiences people have because you go on LinkedIn
13:57
and that those guys over there aren't, clearly aren't
13:59
getting very good. responses out of LLLMs,
14:01
but then folks like yourself, you're
14:03
using LLLMs and you're at the
14:05
sort of the galaxy brain level
14:07
where you're sort of like pushing
14:10
the frontier and people don't even
14:12
know you're using LLLM. So there's
14:14
a very differential experience. Yeah, okay,
14:17
so let me explain what I mean
14:19
when I said that. So if you take a given
14:21
chess board, you can find multiple ways of reaching that.
14:23
You know, you could take a board that happened because
14:25
of a normal game between two chess grandmasters, and you
14:27
can find a sequence of absurd moves that no one
14:29
would ever play that actually brings you to the board
14:31
state. So what you do is like piece by piece,
14:33
you know, is like piece by piece, you say, well,
14:35
the night, you say, piece by piece, you say, well,
14:37
the night, you say, well, the night, you say, well,
14:39
piece by piece, you say, well, the night, the night,
14:41
you say, well, you say, the, you know, you know,
14:44
you know, you know, you know, you know, you know,
14:46
you know, you know, you, you, you, you, you, you,
14:48
you, you, you, you, you, you, you, you, you, you,
14:50
you, you, you, you, you, you, you, you, you, you,
14:52
you, you, you, what you, you, what you, what you,
14:54
what you, that ends up in the correct board state,
14:56
and then you could ask the model, now play a
14:58
move. Okay, and then what happens? The
15:01
model plays a valid move. Still,
15:03
most of the time, it knows what
15:05
the board state looks like, but the
15:07
move that it plays is very, very
15:09
bizarre. It's like a very weird move. Why?
15:11
Because what has the model been trained
15:14
to do? The model was never told
15:16
the game of chess to win. The
15:18
model was told, make things that are like
15:20
what you saw before. It saw a sequence
15:22
of moves that looked like two people who
15:25
were rated like negative 50 playing a game
15:27
of chess. And it's like, well, OK, I
15:29
guess the game is to just make valid
15:31
moves and just see what happens. And
15:33
they're very good at doing this. And you
15:36
can do this in this bot in the
15:38
synthetic way. And also what you can do
15:40
is you can just find some explicit
15:42
cases where you can just get models
15:44
to make terrible move decisions, just
15:47
because that's what people. do
15:49
commonly when they're playing and you know
15:51
most people fall for this trap and
15:53
I was modeled to play like whatever
15:55
the training did it looked like and so
15:57
I guess I ought to fall for this trap
15:59
and This is one of the
16:01
problems of these models is they're
16:04
not initially trained to do the
16:06
play-to-win thing. Now, as far as
16:08
how this applies to actual language
16:11
models that we use, we
16:13
almost always post-train the models
16:15
with RLHF and SFT instruction
16:17
fine-tuning things. And a big
16:19
part of why we do that is so
16:21
that we... don't have to deal with
16:23
this mismatch between what the model was
16:26
initially trained on and what we actually
16:28
want to use it for. And this
16:30
is why GPT-3 is exceptionally hard to
16:33
use, and the sequence of instruct papers
16:35
was very important, is that it takes
16:37
the capabilities that the model has somewhere
16:39
behind the scenes and makes it much
16:42
easier to reproduce. And so when
16:44
you're using a bunch of the chat models
16:46
today, most of the time, you don't have
16:48
to worry nearly as much. exactly how
16:50
you frame the question because of this,
16:52
you know, they were designed to give
16:55
you the right answer even when you
16:57
ask the silly question, but I think
16:59
they still do have some of this,
17:01
but I think it's maybe less than
17:03
if you just have the raw base
17:05
model that was being trained on whatever
17:07
data happened to be trained on. Yeah, I'd
17:09
love to do a tiny digression on
17:11
RLHF because I was speaking with Max
17:14
from Cohere yesterday. they've done some amazing
17:16
research talking all about you know how
17:18
this preference steering works and and they
17:21
say that like humans are actually really
17:23
bad at kind of like distinguishing a
17:25
good thing from another thing you know
17:27
so we like confidence we like verbosity
17:30
we like complexity and for example I
17:32
really hate the chat gPT model because of
17:34
the style I can't stand the style so
17:36
I even though it's right I think it's
17:38
wrong you know so when we do that
17:40
kind of kind of post training on the
17:42
language models how does that affect the competence
17:45
I don't know. Yeah, I mean, I feel like
17:47
it's very hard to answer some of these
17:49
questions because oftentimes you don't have
17:51
access to the models before they've
17:53
been post-trained. You can look at
17:56
these numbers from the papers, so
17:58
they're like in the GPT4. technical
18:00
report. One of these reports,
18:03
they have some numbers that
18:05
show that the model before
18:07
it's been post-trained. So just
18:09
the raw-based model is very
18:11
well calibrated. And what this means
18:13
is when it gives an answer with some
18:15
probability it's right about that probability of the
18:17
time. So if it says, you know, if
18:19
it gives an answer for like, you know,
18:21
give us a mass question and it says
18:24
the answer is five and the token probability
18:26
is 30% it's right about 30% of the
18:28
time, but then when you do the post-training
18:30
process, the calibration gets all messed up and
18:32
it doesn't have this behavior anymore. So I
18:34
like that some things change. you know, you
18:36
can often have the models that just like
18:38
get fantastically better when you do post-training because
18:40
now they follow instructions much better. You haven't
18:42
really taught them all much new, but it
18:44
looks like it's much smarter. Yeah, I think
18:46
this is all a very confusing thing. I
18:48
don't have a good understanding of how all
18:50
of these things fit together. I mean, given
18:52
you know, these models, they make valid
18:55
moves, they appear to be competent, but
18:57
sometimes they have these catastrophic weird failure
18:59
modes. Yes. Do we call that process
19:01
reasoning or reasoning or reasoning or not? I'm
19:06
very big on not
19:08
ascribing intentionality
19:10
or I don't want to,
19:12
everyone needs something different
19:15
by reasoning. And so the
19:17
answer to the question, is that
19:19
reasoning is entirely. what you define
19:21
as reasoning. And so you find
19:23
some people who are very much
19:25
in the world of, I don't
19:27
think models are smart, I don't
19:29
think that they're good, they can't
19:31
solve my problems, and so they
19:33
say, no, it's not reasoning, because
19:35
to me, reasoning means, and then
19:38
they give a definition, which excludes
19:40
language models. And then you ask
19:42
someone who's very much on the
19:44
AGI, you know, language models are
19:46
going to solve everything, by 2027,
19:48
they're going to be displaced all
19:50
human jobs. whatever the process is
19:52
that the model is doing, and then
19:54
they tell you yes, their reasoning. And
19:56
so I think, you know, it's very
19:58
hard to talk about. whether it's actually
20:01
reasoning or not, I think the
20:03
thing that we can talk about is
20:05
like, what is the input upper
20:07
behavior? And, you know, does the model do
20:09
the thing that answers the question,
20:11
solves the task, and was challenging
20:14
in some way, and like, did it
20:16
get it right? And then we can
20:18
go from there, and I think this
20:20
is an easier way to try and
20:22
answer these questions than to. describe
20:24
intentionality to something like it, like,
20:27
I don't know, it's just really
20:29
hard to have these debates with
20:31
people when you start off without
20:33
having the same definitions. I know, I'm really
20:35
torn on this because, as you say,
20:37
the deflationary methodology is it's an input-output
20:39
mapping. You could go one step up,
20:42
so Ben Joe said that the reasoning
20:44
is basically knowledge plus inference, you know,
20:46
and some probabilistic sense. I think it's
20:48
about knowledge acquisition or the recombination of
20:50
knowledge and then it's the same thing
20:53
with agency, right? You know, the simplistic
20:55
form is that it's just like, you
20:57
know, an automata. It's just like a,
20:59
you know, you have like an environment
21:01
and you have some computation and you
21:04
have an action space and it's just
21:06
this thing. You know, but it feels necessary
21:08
to me to have things like autonomy and
21:10
emergence and intentionality in the definition. But you
21:12
could just argue, well, why are you saying
21:14
all of these words, then it does the
21:16
thing, then it does the thing. Yeah, and this
21:19
is sort of how I feel. I mean, I
21:21
think it's very interesting to
21:23
consider this, like, is it reasoning?
21:25
If you have a background in
21:27
philosophy and that's what you're going
21:29
for. I don't have that. So I don't
21:31
feel like I have any qualification to
21:33
tell you whether or not the model
21:36
is reasoning. I feel like the thing
21:38
that I can do is say, here is how
21:40
you're using the model, you want
21:42
it to perform this behavior, let's
21:44
just to check. Like, did it
21:46
perform the behavior yes or no?
21:49
And if it turns out that it's
21:51
doing the right thing in all of
21:53
the cases, I don't know that
21:55
I care too much about whether
21:57
or not the model reasoned...
22:00
its way there or it used
22:02
a lookup table? Like if it's
22:04
giving me the right answer every
22:06
time, like, let's, I don't know,
22:08
I tend to not focus too
22:10
much on how it got there.
22:12
We have this entrenched sense that
22:14
we have parsimony and robustness, you
22:17
know, for example, in this chess
22:19
notation, if you change the syntax
22:21
of the notation, it probably would
22:23
break, right? Yes. And yeah, if
22:25
you, like, there are multiple chess
22:27
notations. And right, and I have
22:29
tried this, so before there was
22:31
the current notation we use, that
22:34
like you would in like old
22:36
chess books, notation was like, you
22:38
know, you know, King's Bishop moves
22:40
to like, you know, so it's
22:42
a Queen's Street, whatever, like that
22:44
you just number the squares differently.
22:46
If you ask a model in
22:49
this notation, it has no idea
22:51
what's happening, and it will write
22:53
something that looks surface level like.
22:55
a sequence of moves, but has
22:57
nothing to do with the correct
22:59
board state. And of course, yeah,
23:01
a human would not do this
23:03
if you ask them to produce
23:06
the sequence of moves. It would
23:08
take me a long time to
23:10
remember which squares, which things, how
23:12
to write these things down, I
23:14
would have to think harder. But
23:16
like, I understand what the board
23:18
is, and like, I can get
23:21
that correct. And the model doesn't
23:23
do that right now. And
23:25
so maybe this is your definition of reasoning
23:28
and you say the reasoning doesn't happen. But
23:30
like someone else could have said, why should
23:32
you expect the model to generalize this thing
23:34
that's never seen before? Like I know, like,
23:36
it's like interesting to me. We've gone from
23:38
a world where we wrote papers about the
23:40
fact that if you trained a model on
23:42
image net, then like, well, obviously it's going
23:45
to have this failure mode that when you
23:47
corrupt the images, the accuracy goes down or
23:49
you can't like... Suppose I wrote a paper
23:51
seven years ago. I trained my model in
23:53
ImageNet, and I tested it on Sufartan that
23:55
didn't work. Isn't this model so bad? people
23:57
would like laugh at you. Like, well, of
24:00
course, you trained it on image net, one
24:02
distribution, you tested it under, it's a different
24:04
one, you never asked it to generalize, and
24:06
it didn't do it. Like, good job. Like, of course,
24:08
it didn't solve the problem. But today, what do we
24:10
do with language models? We train them on one distribution.
24:12
We test them on different distribution that it wasn't trained
24:14
on sometimes, and then we laugh at the model, like,
24:16
like, isn't it's so dumb. It's like, like, like, like,
24:19
like, you didn't trained on the thing, you
24:21
didn't trained on the thing. you know, maybe
24:23
some future model, you could have the fact
24:25
that it could just magically generalize across domains,
24:27
but like we're still using machine learning. Like,
24:29
you need to train it on the kind
24:31
of data that you want to test it on,
24:33
and then the thing will behave much better than
24:35
if you don't do that. So in an email correspondence
24:37
to me, you said something you didn't
24:40
use these exact words, but you said
24:42
that... there are so many instances where
24:44
you kind of feel a bit noobed
24:46
because you made a statement, you know,
24:48
your intuition is you're a bit skeptical,
24:51
you said there's stochastic parrots, and then
24:53
you got proven wrong a bunch of
24:55
times, and it's the same for me.
24:57
Now, one school of thought is, you
24:59
know, Rich Sutton, you just throw more
25:02
data and compute at the thing, and
25:04
the other school of thought is that we
25:06
need to have completely different
25:08
methods. Yeah. Right,
25:11
so there are some people I
25:13
feel like who have good visions
25:15
about the future might look
25:17
like, and then there are people
25:19
like me who just look at
25:22
what the world looks like and
25:24
then try to say, well, let's
25:26
just do interesting work here. I
25:29
feel like this works for me
25:31
because for security in
25:33
particular, it really only matters
25:36
if people are doing the thing to
25:38
attack the thing. And so I'm fine
25:40
just saying, like, let's look at what
25:42
is true about the world and write
25:44
the security papers. And then if the
25:46
world significantly changes, we can try and
25:48
change. And we can try and be
25:50
a couple years ahead looking where things
25:53
are going so that we can do security
25:55
ahead of when we need to. But I tend,
25:57
because of the area that I'm in, not to
25:59
spend on. a lot of time trying to think
26:01
about where are things going to be in the
26:03
far future. I think a lot of people try
26:05
to do this and some of them are good
26:07
at it and some of them are not and I
26:10
have no evidence that I'm good at it. So I
26:12
try and mostly reason based on what
26:14
I can observe right now. And if what
26:16
I can observe changes, then I ought to
26:18
change what I'm thinking about these things and
26:21
do things differently. And that's the best that
26:23
I can hope for. On this chess thing, has
26:25
anyone studied, you know like in the head
26:27
is for the chess notation, you could say
26:30
this player had an elo of 2,500 or
26:32
something like that. And I guess the first
26:34
thing is like, do you see some
26:36
commensurate change in performance? But what
26:38
would happen if you said elo
26:40
4,000? Right. Yes, we've actually trained
26:43
some models trying to do this, it
26:45
doesn't work very well. It's like
26:47
you can't, like, you can't trivially.
26:49
At least, yeah, if you just change
26:51
the number, we've trained some models ourselves
26:53
on headers that we expect to have
26:55
an even better chance of doing this,
26:57
and it did not directly give this
26:59
kind of immediate wins, which again is
27:02
not to say that I am not good
27:04
at training models. Someone else who knows what
27:06
they're doing might have been able to
27:08
make it have this behavior, but when
27:10
we trained it and when we tested
27:12
3.5 term or instruct, it like, it
27:14
might have a statistically significant
27:16
difference on the outcome. but
27:18
it's nowhere near the case that you tell
27:20
the model is playing like a 1,000 rated
27:23
player and all of a sudden it's 1,000
27:25
rated. People have worked very hard to
27:27
try and train models that will let
27:29
you match the skill to an arbitrary
27:31
level and like it's like research paper
27:33
level thing not just like change three
27:35
numbers and how to hope for the best.
27:38
Right so you wrote another article
27:40
called Why I Attack. Sure. And
27:42
you said that you enjoy attacking systems
27:44
for the fun of solving puzzles
27:46
rather than altruistic reasons. Can you
27:48
tell me more about that, but also
27:50
why did you write that article?
27:52
Yeah, okay. So let me answer them in
27:54
the opposite order to ask them. So why
27:57
do I write the article? Some people
27:59
were mad at... for breaking
28:01
defenses that like they
28:04
they they said that what
28:06
I I don't care about
28:08
humanity I just
28:11
I don't know want to
28:13
make them look bad or
28:15
something and half of
28:18
that statement is true
28:20
I don't do security
28:23
because like I
28:25
want to do maximum good and
28:27
and therefore I'm going to think
28:29
about, like, what are all of
28:31
the careers that I could do
28:34
and try and find the one
28:36
that's most likely to, like, save
28:38
the most lives. You know, if
28:40
I had done that, I probably
28:42
would, I don't know, be a
28:44
doctor or something like, you know,
28:46
actually, like, immediately helps people, or
28:48
you could research on cancer, like,
28:51
find whatever domain that you
28:53
wanted, where you could, like, like,
28:55
measure, like, maximum good. I can't.
28:57
motivate myself to do
28:59
them. And so if I was a
29:01
different person, maybe I could do
29:04
that. Maybe I could be
29:06
someone who could meaningfully solve
29:09
challenging problems in biology by
29:11
saying like, I'm waking
29:13
up every morning knowing that
29:15
I'm sort of like saving
29:17
lives or something. But this
29:20
is not. how I work and I feel
29:22
like it's not how lots of people work,
29:24
you know, there are lots of people who
29:26
I feel like are in computer science and
29:28
or you want to go even further in
29:31
like quant fields where like you're clearly brilliant
29:33
and you could be doing something a lot
29:35
better with your life. And some of
29:37
them probably legitimately just would
29:39
just have zero productivity if
29:42
they were doing something that they just really
29:44
did not find any enjoyment in. And
29:46
so I feel like the thing that
29:48
I try and do is, okay, find
29:50
the set of things that you can
29:52
motivate yourself to do and like will
29:54
do a really good job in, and
29:56
then solve those as good as possible,
29:59
subject to the... that like you're actually
30:01
net positive moving things forwards. And for
30:03
whatever reason, I've always enjoyed attacking things
30:05
that I'm, I feel like I'm differentially
30:07
much better at that than at anything
30:10
else. And like, I feel like I'm
30:12
pretty good at doing the adversarial machine
30:14
learning stuff, but I have no evidence
30:16
that I would be at all good
30:18
at the other, you know, 90% of
30:21
things that exist in the world that
30:23
might do better. And so. I don't
30:25
know, the way that I get maybe
30:27
in one sentence that I think about
30:29
this is the like, that's how good
30:32
you are at the thing, multiply by
30:34
how much the thing matters, and you're
30:36
trying to sort of maximize that product,
30:38
and if there's something that you're really
30:40
good at, that at least direction, moves
30:43
things in the right direction, you can
30:45
have a better, higher impact, then taking
30:47
whatever field happens to be the one
30:49
that is like maximally good and moving
30:51
things forwards by a very small amount.
30:54
And so that's why I do attacks
30:56
is because I feel like generally they
30:58
move things forward and I feel like
31:00
I'm better than most other things that
31:02
I could be doing. Now you also
31:05
said that attacking is often easier than
31:07
defending. Certainly. Tell me more. I mean,
31:09
this is the stand-up thing in security.
31:11
You need to find one attack that
31:13
works. And you need to fix all
31:16
of the attacks if you're defending. And
31:18
so if you're attacking something. The only
31:20
thing that I have to do is
31:22
find one place where you've forgotten to
31:24
handle some corner case, and I can
31:27
arrange for the adversary to hit that
31:29
as many times as they need until
31:31
they succeed. You know, this is why
31:33
you have normal software security. You can
31:35
have a perfect program in everywhere except
31:38
one line of code, where you forget
31:40
to check the bounds exactly once. And
31:42
what does this mean? The attacker will
31:44
make it so that that happens every
31:47
single time in the security of your
31:49
product is essentially zero. Under random settings,
31:51
this is never going to happen. Like
31:53
it's never going to happen that the
31:55
hash... of the file is exactly the
31:58
power of like, you know, is equal
32:00
to 2 to the 32, which overflows
32:02
the integer, which causes the bad stuff
32:04
to happen, this is not going to
32:06
happen by random chance, but the attacker
32:09
can just arrange for this happen every
32:11
time, which means that it's much easier
32:13
for the attacker than the defender who
32:15
has to fix all of the things. Has this?
32:17
And then in machine learning it gets even
32:19
worse, because at least in normal security and
32:22
software security or other areas, like we understand
32:24
the classes of attacks. In machine learning we
32:26
just constantly discover new categories of bad things
32:28
that could happen. And so not only do
32:30
you have to be robust to the things
32:32
that we know about, you have to be
32:34
robust to someone like coming up with a
32:36
new clever just like type of attack that
32:38
we hadn't even thought of before and be
32:40
robust there. And this is not happening because
32:42
of... the way, I mean, it's a very
32:44
new field, and so of course it's just
32:46
much easier for these attacks than
32:48
defenses. Let's talk about disclosure norms,
32:51
how should they change now that
32:53
we're in the ML world? Okay,
32:55
yeah. So in standard software
32:57
security, we've basically figured out
33:00
how we how things should go.
33:02
So for a very long time,
33:04
you know, for 20 years, there was
33:06
a big back and forth between When
33:08
someone finds a bug in some
33:10
software that can be exploited, like what
33:13
should they do? And let's say, I don't know,
33:15
late 90s, early 2000s, there were people who were
33:17
on the full disclosure. Which they thought, I find
33:19
a bug in some program, what should I do?
33:22
I should tell it to everyone so that
33:24
we can make sure that people don't make
33:26
a similar mistake and we can put pressure
33:28
on the person to fix it and do
33:30
all that stuff. And then there were the
33:32
people who were on the like... don't disclose
33:34
anything. Like, you should report the bug to
33:36
the person who's responsible and wait until they
33:39
fix it, and then you should tell no
33:41
one about it, and because, you know, this
33:43
was a bug that they made, and you
33:45
don't want to give anyone else ideas
33:48
for how to exploit. And in
33:50
software security, we landed on this,
33:52
you know, what was called responsible
33:54
disclosure, and is now coordinated disclosure,
33:57
which is the idea that you should give the
33:59
person if the... one person, a reasonable
34:01
heads-up for some amount of time.
34:03
Google Project Zero has a 90-day
34:06
policy, for example. And you have
34:08
that many days to fix your
34:10
thing, and then after that, or
34:12
once it's fixed, then it gets
34:15
published to everyone. And the idea
34:17
here in normal security is that you
34:19
give the person some time to protect
34:21
their users. You don't want to
34:24
immediately disclose a new attack that
34:26
allows people to cause a lot
34:28
of harm. But you put a deadline on
34:30
it and you stick to the deadline to
34:32
put pressure on the company to actually fix
34:34
the thing. Because what often happens if you
34:36
don't say you're going to release things publicly
34:38
is no one else knows about it. You're
34:41
the only one who knows the exploit. They're
34:43
just going to not do it because they're in
34:45
the business of making a product not fixing bugs. And
34:47
so why would they fix it if no one else
34:49
knows about it? And so when you say, like, no,
34:51
this will go live in 90 days, like, you better
34:54
fix it before, then they have the time. It's just
34:56
like, now, if they don't do it, it's on them
34:58
because they just didn't put in the work to fix
35:00
the thing. And there are, of course, exceptions. You
35:02
know, Specter and Meltdown are two of the most
35:04
common exploits, or like one of the biggest
35:06
attacks in the last 10 20 years in
35:09
software security and software security, and software security,
35:11
and they gave Intel and they gave Intel
35:13
and related people a year. to fix this,
35:15
because it was a really important bug. It
35:17
was a hard bug to fix. There were
35:20
like legitimate reasons why you should do this.
35:22
There's good evidence that like it's probably not
35:24
going to be independently discovered by the bad
35:26
people for a very long time. And so
35:28
they gave them a long time to fix
35:31
it. And you know, similarly, Google Project Zero
35:33
also says if they find evidence the bug
35:35
is being actively exploited, they'll give you
35:37
seven days. You know, if there's someone
35:39
actually exploiting it, then you have seven days
35:42
before they'll patch. And so they might as
35:44
well tell everyone about that harm is being
35:46
done because if they don't then it's like
35:48
just going to delay the things. Okay, so
35:50
with that long preamble, how should things change
35:52
for machine learning? The short answer is I
35:55
don't know because on one hand I want
35:57
to say that this is like how things
35:59
are in software. security. And sometimes it
36:01
is, where someone has some bug in
36:03
their software, and there exists a way
36:06
that they can patch it and fix
36:08
the problems. And in many cases,
36:10
this happens. So we've written papers
36:12
recently, for example, we've shown how
36:14
to do some like model stealing
36:16
stuff. So Open AI has a
36:18
model, and we could query Open
36:20
AI services and allow us to
36:22
steal part of their model, only very
36:24
small part, but we could steal part
36:27
of it. So we disclose this to
36:29
them because there was a way that they
36:31
could fix it. They could make a change
36:33
to the API to prevent this attack from
36:36
working, and then we write the paper and
36:38
put it online. This feels very much
36:40
like software security. On the
36:42
other hand, there are some other kinds
36:45
of problems that are not the kinds
36:47
that you can patch. Let's think in
36:49
the broadest sense, adversarial
36:51
examples. If I disclosed to
36:53
you, here is an adversarial example
36:56
on your image classifier.
36:58
What is the point of doing the
37:00
responsible disclosure period here? Because there
37:02
is nothing you can do to fix
37:04
this in the short term. We have been
37:07
trying to solve this problem for 10 years.
37:09
Another 90 days is not going to help you
37:11
at all. Maybe I'll tell you out of
37:13
courtesy to let you know this thing that
37:16
I'm doing. I'm going to write this paper
37:18
here, so I'm going to describe it. Do
37:20
you want to put in place a couple
37:22
of filters ahead of time to make this
37:25
particular attack not work? But you're not going
37:27
to solve the underlying problem. like biology things
37:29
the argument they make is you know suppose
37:31
someone came up with a way that's create some
37:34
novel pathogen or something like a disclosure period
37:36
doesn't help you here and so is it
37:38
more like that or is it more like
37:40
software security I don't know I mean I'm
37:42
more biased a little bit towards the software
37:44
security it's because that's what I came from
37:47
but it's yeah hard to say exactly which
37:49
one we should be modeling things after I
37:51
think we do probably need to come up
37:53
with new norms for how we handle this
37:55
There are a lot of people I know who are talking
37:57
about this trying to write these things down and I think
38:00
I think in a year or two, if you ask
38:02
me this again, we will have set processes
38:04
in place, we will have established norms for
38:06
how to handle these things now, I think
38:08
this is just like very early and right
38:11
now we're just looking for analogies in other
38:13
areas and trying to come up with what
38:15
sounds most likely to be good, but I
38:17
don't have a good answer for you immediately
38:20
now. Are there any vulnerabilities that you've
38:22
decided not to pursue for
38:24
ethical reasons? No,
38:30
not that I can think of,
38:33
but I think mostly because
38:35
I tend to only try
38:37
and think of the
38:39
exploits that would be ethical
38:41
in the first place. So
38:43
I just like, it may
38:45
happen that I, like, I
38:47
stumble upon this, but I tend
38:50
to, like, I think research
38:52
ideas, you, you, some, in some
38:54
very small fraction
38:56
of the time, research ideas
38:58
happen just by. random inspiration.
39:00
Most of the time, though,
39:02
research ideas is not something
39:05
that just happens. Like, you
39:07
have spent conscious effort trying
39:09
to figure out what new
39:11
thing I'm going to try and do.
39:13
And I think it's pretty easy
39:15
to just, like, not think about
39:17
the things that seem morally fraught
39:19
and just focus on the ones
39:21
that seem like they actually have
39:24
potential to be good and useful.
39:26
But... It very well may happen at
39:28
some point that this is something
39:30
that happens, but this is not a
39:33
thing that I... I can't think of any
39:35
examples of attacks that we've
39:37
found that we've decided not to
39:40
publish because of the harms
39:42
that they would cause, but I can
39:44
imagine that this might be something
39:46
that I can't rule out this something
39:48
that wouldn't happen, but I
39:50
tend to just like bias my search
39:52
of problems in the direction of... things
39:54
that I think are actually beneficial. I
39:56
mean, maybe going back to like the
39:58
why I attack things. You
40:02
want the product of how good you
40:04
are and how much good it does
40:06
for humanity to be maximally positive. You
40:08
can choose what problems you work on to
40:11
not be the ones that are negative.
40:13
And so I don't have lots of
40:15
respect for people where the direction of
40:17
the goodness of the world is like
40:19
just a negative number. Because you can
40:21
choose to make that a very least zero,
40:23
just like don't do anything. And so
40:26
I try and pick the problems that
40:28
I think are generally positive. do
40:30
as good as possible on those ones.
40:32
So you work on traditional security
40:34
and demo security. What are the
40:37
significant differences? Yeah, okay, so I
40:39
don't work too much on traditional
40:41
security anymore. So I started my
40:43
PhD in the traditional security.
40:45
Yeah, I did very very low-level
40:47
returning to programming. I was at
40:49
Intel for a summer on some
40:52
hardware level defense stuff. And then I
40:54
started machine learning. shortly after that.
40:56
So I haven't worked on the
40:59
very traditional security in like the
41:01
last, let's say, eight something seven
41:03
something years. But yeah, I still
41:06
follow it very closely. I still
41:08
go to the system security conferences
41:10
all the time because I think it's
41:12
like a great community. But yeah,
41:15
what are the similarities and differences?
41:17
I feel like the systems security
41:20
people are very good at really
41:22
trying to make sure that what
41:24
they're doing is like a very
41:26
rigorous thing and like evaluated it
41:28
really thoroughly and properly You know
41:30
you see this even in like the
41:32
length of the papers. So a system
41:35
security paper is like 13, 14 pages long
41:37
to call them a paper that's a submission
41:39
for I-clear is like seven or eight or
41:41
something, one column. Like, you know, the system
41:43
security papers will all start with like a
41:46
very long explanation of exactly what's happening. The
41:48
results are expected to be really rigorously done.
41:50
A machine learning paper often is here is
41:52
a new cool idea, maybe it works. And
41:55
like this is good for like, you know,
41:57
move fast and break things. This is not
41:59
good for... like really systematic studies,
42:01
you know, when I
42:04
was doing system security
42:06
papers, I would get like, you know,
42:08
one, one and a half, two a year.
42:10
And now, like, a similar kind
42:12
of thing, I could, of machine
42:15
learning papers, like, you know, you
42:17
could probably do five or six
42:19
or something, like, to the same
42:21
level of rigor. And so I
42:23
feel like this is, like, it's maybe
42:26
the biggest thing I see in
42:28
my mind as like, And I think it's
42:30
worked empirically in the machine learning space. Like
42:32
it would not be good if every research
42:34
result in machine learning needed to have the
42:37
kind of rigor you would have expected
42:39
for a systems paper, because we would have
42:41
had like five iteration cycles in total.
42:44
And, you know, at machine learning conferences, you often see the paper, the paper
42:46
that approved upon the paper, and the paper that improved upon that paper,
42:48
all at the same conference, because the first person put it on archive, and
42:50
the next person found the tweak that made it, and the third person
42:52
found the tweak that made it, and the third person found the tweak that
42:54
made it, and the third person found the tweak that made it, and the
42:56
third person found the tweak that made it, and the third person found
42:58
the tweak, that made it made it, and that made it made it made
43:00
it even better. And when it made it made it made it made
43:02
it made it made it even better. And like, even better. And like, and,
43:04
even better. And like, even better. And like, even better. And like, even better.
43:07
And like, and like, and like, and like, it's better. And like, it's
43:09
better. And like, it's better. And like, and the third person found the third
43:11
person found the three person found And so I think,
43:13
yeah, having some kind of balance and
43:15
mix between the two is useful. And
43:17
this, I think, is maybe the biggest
43:19
difference that I see. And this is, I guess,
43:21
maybe if there's some differential advantage that
43:24
I have in the machine learning space,
43:26
I think some of it comes from
43:28
this where in systems, you know, you
43:30
were trained very heavily on this kind
43:32
of rigorous thinking and how to do
43:35
attacks very thoroughly, look at all of
43:37
the details. And when you're doing security,
43:39
this is what you need to do. And
43:41
so I think some of this training
43:43
has been very beneficial for me in
43:45
writing machine learning papers, thinking about
43:48
all of the little details to get
43:50
these points right, because I had a
43:52
paper recently where the way that I broke
43:54
some defense and the way that the thing
43:56
broke is because there was a negative sign
43:58
in the wrong spot. And like, it's
44:01
like, this is not the kind of
44:03
thing that like, I could have reasoned
44:05
from first principles about
44:07
the code, like if I had been
44:09
advising someone, like, I don't know how
44:11
I would have told them, check all
44:14
the negative signs. It's like,
44:16
you don't know, like, you know,
44:18
like, you just, like, what you
44:20
should be doing is, like, you
44:22
should be, like, understanding everything that's
44:24
going on and find the one
44:26
part where the mistake was made so
44:28
that you can. It was called Why
44:30
I Use AI, and it was about a couple
44:33
of months ago you wrote this.
44:35
And you say that you've been
44:37
using language models, you find them
44:39
very useful, they improve your programming
44:41
productivity by about 50%. I can
44:44
say the same myself. Maybe let's
44:46
start there. I mean, can you
44:48
break down specifically like the kind
44:50
of tasks where it's really uplifted
44:53
your productivity? So I
44:55
am not someone who like believes in
44:57
these kinds of things. You know, I
44:59
don't, there are some people who
45:01
their job is to hype things
45:04
up and their job is to
45:06
get attention on these kinds
45:08
of things. And I feel
45:10
like the, the thing that was
45:13
annoyed about is that these
45:15
people The same people who were,
45:17
you know, Bitcoin is going to
45:19
change the world, whatever, whatever, whatever.
45:22
As soon as language models come
45:24
about, they all go language models
45:26
are going to change the world,
45:28
they're very useful, whatever, whatever, whatever.
45:31
And the problem is that if you're
45:33
just looking at this from afar,
45:35
it looks like you have the people
45:37
who are the grifters just finding the
45:39
new thing. And they are, right? Like,
45:41
this is what they're doing. This is
45:44
what they're doing. But at the same time,
45:46
I think that the models that we
45:48
have now are actually useful. And
45:50
they're not useful for merely as
45:52
many things as people like to say
45:55
that they are, but for a particular
45:57
kind of person, the person who
45:59
under what is going on in these
46:01
models and knows how to code and
46:03
can review the output, they're useful. And
46:05
so I wanted to say is like, I'm
46:07
not going to try and argue that they're
46:10
good for everyone, but I want to
46:12
say like here is an n equals
46:14
one, me anecdote, that I think they're
46:16
useful for me, and if you have
46:19
a background similar to me, then maybe
46:21
they're useful for you too. And you
46:23
know, I've got a number of people
46:26
who... are like, you know, security-style people
46:28
who have contacted me, it's like, you
46:30
know, thanks for writing this, like, you
46:32
know, they have been useful for me,
46:35
and yeah, now there's a question of,
46:37
does my spirit generalize anyone else?
46:39
I don't know, this is not
46:42
my job to try and understand
46:44
this, but at least what I
46:46
wanted to say was, yeah, they're
46:48
useful for people who behave like
46:50
I do. Okay, now, why are they
46:52
useful? The current models
46:55
we have now are good enough
46:57
that the kinds of things that
46:59
where I want to answer
47:01
to this question, whether it's
47:03
right this function for me or
47:05
whatever, it's like I know how
47:07
to check it, I know that I
47:09
could get the answer. It's like something
47:11
I know how to do, I just
47:14
don't want to do it. The
47:16
analogy I think is maybe
47:18
most useful is... Imagine
47:20
that you had to write all of
47:23
your programs in C or in Assembly.
47:25
Would this make it so that you
47:27
couldn't do anything that you can do
47:29
now? No, probably not. You could do
47:31
all of the same research results in
47:33
C instead of Python if you really
47:36
had to. It would take you a lot
47:38
longer because you have an idea in your
47:40
mind. I want to implement, you
47:43
know, something trivial, you know, some binary
47:45
search thing. And then in C
47:47
you have to start reasoning about pointers
47:49
and memory allocation and all these little
47:51
details that are at a much lower
47:53
level than the problem you want to
47:55
solve. And the thing I think is
47:57
useful for language models is that if you
47:59
know... the problem you want to solve and you
48:02
can check that the answer is right, then
48:04
you can just ask the model to implement
48:06
for you the thing that you want in
48:08
the words that you want to just type
48:10
them in, which are not terribly well defined,
48:12
and then it will give you the answer and
48:14
you could just check that it's correct and
48:17
then put it in your code and
48:19
then continue solving the problem you want
48:21
to be solving and not... the problem
48:23
that you had to do to
48:25
actually type out all the details.
48:27
That's maybe the biggest class, I think,
48:30
of things that I find useful.
48:32
And the other class of things I
48:34
find useful are the cases where you
48:36
rely on the fact that the model
48:38
has just enormous knowledge about
48:41
the world, and about all kinds
48:43
of things. And if you understand
48:45
the fundamentals, but like, I don't
48:47
know the API to this thing, just
48:50
like... make the thing work under the API
48:52
and I can check check that easily or
48:54
you know I don't understand how to write
48:56
something in some particular language like give
48:58
me the code like if you if
49:00
you give me code in any language
49:02
even if I've never seen it before
49:04
I can basically reason about what it's
49:06
doing like you know I may make
49:08
mistakes around the border but like I
49:11
could never have typed it because I
49:13
don't know the syntax whatever the models
49:15
are very good at giving you the
49:17
correct syntax and just like getting everything
49:19
else out of the way and then I
49:21
can figure out the rest about how
49:23
to do this and you know if
49:25
I if I couldn't ask the model
49:27
I would have had to have learned
49:29
the syntax for the language to type
49:31
out all the things or do what
49:33
people would do you know five years
49:35
ago copy and paste some other person's
49:37
code from stack overflow and you know
49:39
to make adaptations and it was like
49:41
a strictly worse version of just asking
49:43
the model because now I'm relying on
49:45
me. my view is that for these
49:47
kinds of problems that they're currently plenty
49:49
useful. If you already understand and
49:52
by that I mean an abstract understanding then
49:54
then they're a superpower which explains why you
49:56
know like the smarter you are actually the
49:58
more you can get out of out of
50:00
a language model. But how is
50:02
your usage evolved over time? And
50:04
just what's your methodology? I mean,
50:07
you know, speaking personally, I know
50:09
that specificity is important. So going
50:11
to source material and constructing the
50:13
prompt, you know, imbuing my understanding
50:15
and reasoning process into the prompt.
50:17
I mean, how do you think
50:19
about that? Yeah. I guess I try and ask
50:22
questions that I think have a reasonable
50:24
probability of working. And I
50:26
don't ask questions where I feel like
50:28
this was going to slow me down.
50:30
But if I think it has, you
50:33
know, a 50% chance of working,
50:35
I'll ask the model first. And
50:37
then I'll look at the output
50:39
and see, like, does this direction
50:41
look correct? And if it seems
50:43
like, does this direction look
50:45
correct, and if it seems like,
50:48
the directionally look correct,
50:50
and if it seems like, the
50:52
directionally, like, the direction, like,
50:55
great. And then I learn,
50:57
like, like, who say like they
50:59
can't get models to do anything useful
51:01
for them. Yeah, it may be the
51:03
case that models are just really bad
51:06
at a particular kind of problem. It
51:08
may also just be you don't have
51:10
a good understanding what the
51:12
models can do yet. You know, if you,
51:14
like, I think most people, you know, today
51:16
have forgotten how much they had to
51:18
learn about how to use Google search.
51:21
You know, like. People today, if I tell
51:23
you to look something up, you implicitly know
51:25
the way that you should look something up
51:27
is to like use the words that appear
51:29
in the answer. You don't ask it as
51:31
a form of a question. There's a way
51:33
that you type things into the search
51:35
engines to get the right answer. And
51:37
this requires some amount of skill and
51:40
understanding of how to reliably
51:42
find answers to something online. I feel
51:44
like it's the same thing for language
51:46
models. They have a natural language
51:49
interface. So like technically you could
51:51
type. whatever thing that you wanted,
51:53
there are some ways of doing it that
51:55
are much more useful in others, and I
51:57
don't know how to teach this as a skill.
52:00
other than just saying, like, try
52:02
the thing. And maybe it turns
52:04
out they're not good at your
52:06
task and then just don't use
52:08
them. But if you are able
52:11
to make them useful, then this
52:13
seems like a free productivity
52:15
win. But, you know, this is the
52:17
kind of thing where, yeah, again, caveat
52:20
it on. You have to
52:22
have some understanding what's
52:24
actually going on with these
52:26
things because, you know, there are
52:28
people who don't who I feel
52:30
like have who can try and
52:32
do these similar kinds of things
52:35
and then I'm worried about they
52:37
don't like are you gonna learn anything
52:39
is the you won't catch the
52:41
bugs when the bugs happen
52:44
all kinds of problems that
52:46
I'm worried about from from that
52:48
perspective but like for the
52:50
like practitioner who wants to
52:52
get work done I feel
52:54
like In the same way that I wouldn't
52:56
say you need to use C over Python,
52:58
I wouldn't say you need to use just
53:01
Python or Python bus language models. Yes, yes.
53:03
I agree that, you know, laziness and acquiescence
53:05
is a problem. Vives and intuition are really
53:07
important. I mean, I consider myself a Jedi
53:09
of using LLLMs and sometimes it frustrates me
53:12
because I say to people, oh, you know,
53:14
just use an LLLA. I seem to be
53:16
able to get so much more at LLLMs
53:18
than than other people and I'm not entirely
53:21
sure why that. Maybe it's just because I
53:23
understand the thing that I'm prompting or something
53:25
like that, but it seems to be something
53:27
that we need to learn. Yeah, I mean,
53:29
every time a new tool comes about, you have
53:32
to spend some time, you know, I remember
53:34
when people would say, real programmers
53:36
write code and see and don't write
53:38
it in a high-level language. Why would
53:40
you trust the garbage collector to do
53:42
a good job? Real programs manage their
53:44
own memory. Real
53:47
programmers write their own Python. Why would you
53:50
trust the language model to output code that's
53:52
correct? Why would you trust it to be
53:54
able to have this recall? Real programmers understand
53:56
the API and don't need to look up
53:59
the reference manual. can we draw the
54:01
same analogies here? And now I
54:03
think this is the case of like
54:05
when the tools change and make it
54:08
possible for you to be more productive
54:10
in certain settings you should be willing
54:12
to look at them into the new
54:14
tools. I know I'm always trying
54:16
to rationalize this because it comes
54:19
down to this notion of is
54:21
the intelligence in the eye of the
54:23
prompter? You know, does it matter?
54:25
I think the answer is no, in
54:27
some cases I think the answer is
54:29
yes, but I'm not using things
54:32
that my people is... The
54:34
thing makes me more productive
54:36
and solves the task for
54:38
me. Was it the case that
54:40
I put the intelligence in? Maybe?
54:42
I think, in many cases, I
54:44
think the answer is yes, but I
54:47
don't... I'm not going to look
54:49
at it this way. I'm going to
54:51
look at it as like, is it?
54:53
solving the questions that I want
54:55
in a way that's useful for
54:57
me. I think here the answer
55:00
is definitely yes, but yeah,
55:02
I don't know how to answer this
55:04
in some real way. So obviously,
55:06
as a security researcher, how does
55:09
that influence the way that
55:11
you use LLLMs? Oh yeah, yeah,
55:13
no, this is why I'm scared about
55:15
the people who are going to use
55:17
them and not understand things because, you
55:19
know, you ask them to write an
55:21
encryption function for you and the answer
55:24
really ought to be, you should not
55:26
do that. You should be calling this
55:28
API. And oftentimes they'll be like, sure,
55:30
you want me to write encryption function,
55:32
here's an answer to an encryption function,
55:34
and it's going to have all of the
55:36
bugs that everyone normally writes, and this
55:39
is going to a database. And
55:41
what did the model do? It wrote
55:43
the thing that was vulnerable to sequel
55:45
injection. And this is terrible. If
55:47
someone was not being careful, they
55:49
would not have caught this. And
55:52
now they'll introduce all kinds of bad
55:54
bugs. Because I'm reasonably
55:56
competent at programming, I can read
55:58
the output of the model. and just
56:00
like correct the things where it made
56:02
these mistakes. Like it's not hard to
56:05
fix the sequel injection and replace the
56:07
string concatenation with the, you know, the
56:09
templates. The model just didn't do it
56:11
correctly. And yeah, so I'm very worried about the
56:14
kind of person who's not going to do
56:16
this. There have been a couple of papers
56:18
by people showing that people do write
56:20
very insecure code when using language models
56:22
when they're not being careful for
56:24
these things. And yeah, this is
56:26
something I'm worried about. It looks
56:28
like it might be the case
56:31
that it's differentially more vulnerable when
56:33
people use language models versus when they
56:35
don't. And yeah, this is, I think, a big concern.
56:37
I think the reason why I tend to
56:39
think about this utility question
56:41
is often just from the perspective
56:43
of, yeah, security of things that people
56:45
use actually matters. And so I want
56:48
to know what are the things that
56:50
people are going to do so you
56:52
can then write the papers and study
56:54
what people actually going to do. So
56:56
I feel like it's important to separate.
56:58
can the model solve the problem for
57:00
me? And the answer for the language
57:02
models using it is oftentimes, yes, it gives
57:04
you the right answer for the common case.
57:06
And this means most people don't care
57:08
about the security question. And so they'll
57:11
just use the thing anyway, because it
57:13
gave them the ability to do this new
57:15
thing, not understanding the security
57:17
piece. And so that means we should then go
57:19
and do security around this other
57:22
question of like. We know people are going
57:24
to use these things, we ought to do
57:26
the security to make sure that the security
57:28
is there, so that they can use them
57:30
correctly. And so I often try and use
57:32
things that are at the frontier of what
57:34
people are going to do next, just to
57:37
try and put myself in their frame of
57:39
mind and to understand this. And yeah,
57:41
this worries me quite a lot, because, yeah,
57:43
things could go very bad here. How and
57:45
when do you verify the outputs of LLC?
57:48
the same way that you put the, I
57:50
mean, like, this is the other thing. People
57:52
say, like, you know, maybe the model's gonna
57:54
be wrong, but like, half of the answers
57:56
on Stack Overflow are wrong anyway. So, like,
57:58
it's not the case. that's like, like,
58:01
if you've been programmed for a
58:03
long time, like, you're used to
58:05
the fact that you read code
58:07
that's wrong, I'm not going to
58:09
copy and paste some function on
58:11
stack overflow and just assume that
58:13
it's right, because, like, maybe the
58:15
person asked a question that was
58:17
different than the question that I
58:20
asked, like, whatever, like, so I
58:22
feel like I'm not, I don't
58:24
feel like I'm doing anything terribly
58:26
different, Maybe the only difference
58:28
is that I'm using the models more
58:30
often, and so I have to be
58:33
more careful in checking, like, you know,
58:35
if you're using something twice as often,
58:37
then if you're redifying bugs with something,
58:39
you know, you're going to have twice
58:41
as many bugs, and you're using it
58:44
twice as much, and so you have
58:46
to be a little more careful. But
58:48
I don't feel like it's anything I'm
58:50
doing. 95% solutions are still 95% solutions.
58:52
You take the thing, it does almost
58:55
everything that you wanted, then like, it
58:57
maxed out its capability, it's good. You're
58:59
an intelligent person, now you finish the
59:01
last 5% fix what the problem is,
59:03
and then there, you have a 20%
59:06
performance increase there. Yeah, you touched on
59:08
something very interesting here, because actually most
59:10
of us are wrong most of the
59:12
time. And that's why it's really good
59:14
to have at least one very smart
59:16
friend, because they constantly point out all
59:19
of the ways in which your stuff
59:21
is wrong. Most code is wrong. It's
59:23
your job to point out how things
59:25
are wrong. And I guess we're always
59:27
just kind of on the boundary of
59:30
wrongness unwittingly. And that's just the way
59:32
the world works anyway. Yeah, right. And
59:34
so I think that... there's a potential
59:36
for massive increases in quantity of wrongness,
59:38
you know, if with language models, like
59:41
this is I think it's like, there
59:43
are lots of things that could go
59:45
very wrong, but we'll go very bad
59:47
with language models, you know, the ability
59:49
of them, you know, previously the amount
59:52
of bad code that could be written
59:54
was like limited to like the amount
59:56
of number of humans who could write
59:58
bad code, because like... There's
1:00:01
only so many people who could write software and
1:00:03
like you had to have at least some training
1:00:05
and so you the number like some bounded
1:00:07
amount of bad code One of the other things
1:00:09
I'm worried about is you know You have people
1:00:11
who? Look at these people saying models can
1:00:13
solve all your problems for you and now you
1:00:16
have ten times as much code Which is great
1:00:18
from one perspective because isn't it
1:00:20
fantastic that anyone in the world can
1:00:22
go and write whatever software they need
1:00:25
to solve their particular problem? That's fantastic
1:00:27
But at the same time, security in person and
1:00:29
me is kind of scared about this because
1:00:31
now you have 10 times as much stuff
1:00:33
that is probably very insecure. And you are
1:00:36
not going to be able to have, you
1:00:38
don't have 10 times as many security experts
1:00:40
to study all of this. Like you're going
1:00:42
to have a massive increase in this and
1:00:44
some potential futures. And this is one of
1:00:46
the many things that I'm, I think, I'm worried about
1:00:48
and like is why I try and use these
1:00:50
things to understand like. Does this seem like something
1:00:53
people will try and do? It seems to me
1:00:55
the answer is yes right now and yeah this
1:00:57
worries me. So I spoke with some Google
1:00:59
guys yesterday and they've been studying some
1:01:01
of the failure modes of LLM so
1:01:03
like just really crazy stuff that people
1:01:05
don't know about like they can't copy
1:01:07
they can't count you know because of
1:01:09
the soft Max and the topological representation
1:01:11
squashing in this particular loads and loads
1:01:13
of stuff they can't do. In your
1:01:16
experience have you noticed some kind of
1:01:18
tasks that LLLM's just really struggle struggle
1:01:20
on? I'm sure that there are many
1:01:22
of them. I have sort of learned to
1:01:24
just not ask those questions. And so
1:01:26
I have a hard time, like coming
1:01:28
up like, you know, like in the same
1:01:31
sense, like, what are the things that
1:01:33
search engines are bad for? You
1:01:35
know, I'm sure that there are a
1:01:37
million things that search engines are, like,
1:01:39
completely the wrong answer for, but if
1:01:41
I sort of pressed you for a
1:01:44
question, answer for, you know, you'd have
1:01:46
a little bit of a hard time, because
1:01:48
the way that you use them. all of
1:01:50
these things like whenever you want like
1:01:53
correctness in some senses the model
1:01:55
is not the thing for you like
1:01:57
in terms of like specific tasks that
1:01:59
they're particularly bad at? I mean, of course,
1:02:01
you can say anything that requires some kind
1:02:04
of, if it would take you more than,
1:02:06
you know, 20 minutes to write the program,
1:02:08
probably the model can't get that. But the
1:02:11
problem with this, like, this was changing. You
1:02:13
know, like, I, so, okay, so this is,
1:02:15
like, the other thing, like, there are things
1:02:17
that, like, I thought, would be hard that
1:02:20
end up becoming easier. So there was a
1:02:22
random problem that I wanted that I wanted
1:02:24
for unrelated reasons that it's a hard dynamic
1:02:26
programming problem to solve. It took me like,
1:02:29
I don't know, two or three hours to
1:02:31
solve it, the first time that I had
1:02:33
to do it. And so 01 just launched
1:02:35
a couple days ago, I gave the problem
1:02:38
to 01, and it gave me an implementation
1:02:40
that was 10 times faster than one I
1:02:42
wrote in like two minutes. And so I
1:02:44
can test it because I have a reference
1:02:47
solution, and like it's correct. It's like, okay,
1:02:49
so now I've learned, like, here's a thing
1:02:51
that I previously would have thought, like, I
1:02:53
would never ask models to solve something, because
1:02:56
this was, like, a challenging enough algorithmic problem
1:02:58
for me, that I would have no hope
1:03:00
for the model solving, and now I can.
1:03:02
But there are other things that, you know,
1:03:05
seem trivial to me that the models get
1:03:07
wrong, but I mostly have just, like, not
1:03:09
ask the answers are right and wrong, and
1:03:12
they'll... Just apply the wrong answer as many
1:03:14
times as they can and that seems concerning
1:03:16
Yeah, I mean this is part of the
1:03:18
anthropomorphization process because I find it fascinating that
1:03:21
I think you know we have vibes We
1:03:23
have intuitions and we actually know and we've
1:03:25
learned to skirt around the failure mode You
1:03:27
know the long tail of failure modes and
1:03:30
we just smooth it over in our supervised
1:03:32
usage of language models and the amazing thing
1:03:34
is we we don't seem to be consciously
1:03:36
aware of it Yeah, but like programmers do
1:03:39
this all the time, right? Like you have
1:03:41
a language, the language has some, like, okay,
1:03:43
so. let's suppose you're
1:03:45
someone who writes Rust.
1:03:48
Rust has a very,
1:03:50
very weird model of
1:03:52
memory. If you go
1:03:54
to someone who's very
1:03:57
good at writing Rust,
1:03:59
they will structure the
1:04:01
program differently so they
1:04:04
don't encounter all of
1:04:06
the problems because of
1:04:08
the fact that you
1:04:10
have this weird memory
1:04:13
model. But if I
1:04:15
were to do it, like I'm not very good
1:04:17
at Rust, like I try and use it and like
1:04:19
I try and write my C code in Rust
1:04:21
and like the borrower checker just like yells at me
1:04:23
to no end and I can't write my program. And
1:04:25
like I look at Rust and go like, I
1:04:28
see that this could be very good but I just don't
1:04:30
know how to get my code right because I haven't done
1:04:32
it enough. And so I look at the language and go, okay,
1:04:35
if I was not being charitable,
1:04:37
I would say why would anyone use this as
1:04:39
impossible to write my C code in Rust? Like you're supposed
1:04:41
to have all these nice guarantees but like no, you have
1:04:43
to change the way you write your code in order to
1:04:45
get, change your frame of mind and then
1:04:47
the problems will just go away. Like it's sort of like you
1:04:49
can do all of the nice things just
1:04:52
accept the way that the paradigm you're supposed to
1:04:54
be operating in and the thing goes very
1:04:56
well. I
1:04:58
see the same kind of analogy for some of
1:05:00
these kinds of things here where the models are not
1:05:02
very good in certain ways and you're trying to
1:05:04
imagine that the thing is a human and ask it
1:05:06
the things you would ask another person but it's
1:05:08
not. And you need to ask it
1:05:10
in the right way or ask the right kinds
1:05:12
of questions and then you can get the value and
1:05:15
if you don't do this then you'll end up
1:05:17
very disappointed because it's not super human. What
1:05:20
are your thoughts on benchmarks? Okay,
1:05:23
yes, I have thoughts here. This
1:05:27
I guess is the problem with language models
1:05:29
is we used to be in a world
1:05:31
where benchmarking was very easy because we wanted
1:05:33
models to solve exactly one task and
1:05:36
so what you do is you measure it on
1:05:38
that task and you see can it solve the
1:05:40
task and the answer is yes and so great,
1:05:42
you figure it out. The problem with this is
1:05:45
like that task was never the task we actually
1:05:47
cared about and this is why no one used
1:05:49
models. Like no ImageNet models ever made it
1:05:51
out into like the real world
1:05:53
to solve actual problems because we just
1:05:55
don't care about classifying between 200
1:05:57
different breeds of dogs. You know, the
1:05:59
model may be good at this,
1:06:01
but. this is not the thing we actually want. We want something different.
1:06:03
And it would have been absurd at the time
1:06:06
to say the image net model can't
1:06:08
solve this actual task I care about
1:06:10
in the real world, because of course
1:06:12
it wasn't strange for that. Language models,
1:06:14
the claim that people make for language
1:06:16
models and what people who train them
1:06:18
is, I'm going to train this one
1:06:21
general purpose model that can solve arbitrary
1:06:23
tasks. And then they'll go
1:06:25
test it on some small number of
1:06:27
tasks and say, see, it's good because
1:06:29
I can solve these tasks very
1:06:31
well. And the challenge here is that
1:06:34
if I trained a model to solve
1:06:36
any one of those tasks in
1:06:38
particular, I could probably get really
1:06:40
good scores. The challenge is that you
1:06:42
don't want the person who has trained
1:06:44
the model to have done this. You
1:06:47
wanted them to just train a good
1:06:49
model and use this as an independent,
1:06:51
you know, just... Here's a
1:06:53
task that you could train
1:06:55
the model, you could evaluate
1:06:58
the model on completely independent
1:07:01
from the initial training
1:07:03
objective in order to get like
1:07:06
an unbiased view of how well
1:07:08
the model does. But people
1:07:10
who train models are incentivized
1:07:13
to make them do well on benchmarks.
1:07:16
And while in the old
1:07:18
world, you know, I trust researchers
1:07:20
not to cheat. In
1:07:23
principle, I could have trained on
1:07:25
the test set. But this is
1:07:27
actually cheating. You're not trained on
1:07:30
the test set. So I trust
1:07:32
that people want to do this.
1:07:34
But suppose that I give you a
1:07:36
language model, and I want to
1:07:38
evaluate it on coding, which I'm
1:07:40
going to use, you know, a
1:07:43
terrible benchmark, but human evil, whatever.
1:07:45
I'm going to use MMOU. I'm
1:07:47
going to use MMOU, whatever the
1:07:49
cases may be. I may not
1:07:51
actually. train my model in particular
1:07:54
to be good on these benchmarks. And
1:07:56
so you may have a model that
1:07:58
is not very capable. in general,
1:08:00
but on these specific 20 benchmarks
1:08:02
that people use, it's fantastic. And
1:08:05
this is what everyone is incentivized
1:08:07
to do, because you want your
1:08:09
model to have maximum scores on
1:08:11
benchmarks. And so I think I
1:08:14
would like to be in a
1:08:16
world where there were a lot
1:08:18
more benchmarks, so that is not
1:08:20
the kind of thing that you
1:08:23
can... that you can easily do
1:08:25
and you can more easily trust
1:08:27
that these models are going to
1:08:29
write answers, but they accurately reflect
1:08:31
what their skill level is in
1:08:34
some way that is not being
1:08:36
designed by the model trainer to
1:08:38
maximize the scores. So at the
1:08:40
moment, you know, like the hyperscale
1:08:43
is that they put incredible amounts
1:08:45
of work into benchmarking and so
1:08:47
on and... Now we're moving to
1:08:49
a world where we've got, you
1:08:51
know, test time inference, test time
1:08:54
active fine tuning, you know, people
1:08:56
are fine tuning, quantizing, fragmenting and
1:08:58
so on. And a lot of
1:09:00
the people doing this in a
1:09:03
practical sense can't really benchmark in
1:09:05
the same way. How do you
1:09:07
see that playing out? Okay, that,
1:09:09
I don't know. I don't know.
1:09:12
I don't know. It just seems
1:09:14
very hard. because, you know, to
1:09:16
test what these things are, you
1:09:18
can just, you can use the
1:09:20
average benchmarks and hope for the
1:09:23
best, but I don't, I feel
1:09:25
like the thing I'm more worried
1:09:27
about is people who are actively
1:09:29
fine-tuning models to show that they
1:09:32
can make them better on better
1:09:34
on-tuning models to show that they
1:09:36
can make them better on certain
1:09:38
tasks. So you have lots of
1:09:40
fine-tunes of llama, for example. I
1:09:43
think that's the thing I'm more
1:09:45
worried about. But yeah, for the
1:09:47
other cases, I don't know. I
1:09:49
agree this is hard, but I
1:09:52
don't have any great solutions here.
1:09:54
That's okay. We can't let you
1:09:56
go before talking about one of
1:09:58
your actual papers. I mean if
1:10:01
this has been amazing talking about
1:10:03
general stuff, but I decided
1:10:05
to pick this one, stealing
1:10:07
part of a production language
1:10:10
model, so this is from July,
1:10:12
could you just give us a bit
1:10:14
of an elevator picture on that? For
1:10:16
a very long time, when we did
1:10:19
papers in security, what we did
1:10:21
was we would think about how a
1:10:23
model might be used in
1:10:25
some hypothetical future, and then
1:10:27
say, well, maybe we have a
1:10:30
certain kinds of attacks that
1:10:32
are possible. Let's try and
1:10:34
show in some theoretical
1:10:36
setting, this is something that
1:10:38
could happen. And so there's
1:10:41
a line of work called model
1:10:43
stealing, which tries to answer
1:10:45
the question, can someone take
1:10:47
the model that you have?
1:10:50
And without, and just like
1:10:52
by making standard queries
1:10:54
to your API, steal a copy
1:10:56
of it. This was started by
1:10:58
Florian Tramer and others in
1:11:00
2016, where they did this
1:11:02
on like very very simple
1:11:05
linear models over APIs. And
1:11:07
then it became a thing
1:11:09
that people started studying
1:11:11
in deep neural networks. And
1:11:13
there were several papers in
1:11:15
a row by a bunch of other
1:11:17
people. And then in 2020, we wrote
1:11:19
a paper that we put at crypto
1:11:22
that said, well, here is a
1:11:24
way to steal an exact copy
1:11:26
of your model. Like,
1:11:29
whatever the model you have is, I
1:11:31
can get exact copy. As long as,
1:11:33
you have a long list of assumptions,
1:11:35
it's only using a value activation.
1:11:37
The whole thing is evaluated
1:11:39
in floating point 64. I can
1:11:41
feed floating point 64 of values in,
1:11:43
I can see floating point 64 of
1:11:46
values out, the model is only fully
1:11:48
connected, its depth is no greater than
1:11:50
three, it has like no more than
1:11:52
32 units wide on any given layer,
1:11:54
like it just has a long list
1:11:56
of things that are... Never true
1:11:58
in practice. But it's a
1:12:01
very cool theoretical result. And there
1:12:03
are other papers of this kind that show
1:12:05
how to do this kind of, I steal
1:12:07
an exact copy of your model, but
1:12:09
it only works in these really contrived
1:12:11
settings. This is why we submitted
1:12:13
the paper to crypto, because they
1:12:16
have all these kinds of theoretical
1:12:18
results that are very cool, but
1:12:20
are not immediately practical in many
1:12:22
ways. And then there was a
1:12:24
line of work continuing extending upon
1:12:26
this. And the question that I
1:12:28
wanted to answer is, like, now
1:12:31
we have these language models. And if
1:12:33
I list all of the assumptions, all
1:12:35
of them are false. It's not just
1:12:37
value-only activations. It's not just fully
1:12:39
connected. I can't send floating point
1:12:41
64 inputs. I can't view floating
1:12:43
point 64 outputs. They're like a
1:12:45
billion neurons, not 500. So like,
1:12:48
all these things that are true.
1:12:50
And so I wanted to answer
1:12:52
the question, like, what's the best
1:12:54
attack that we can come up
1:12:56
with? that actually I can
1:12:58
implement in practice on a real
1:13:01
API. And so this is what we
1:13:03
tried to do. We tried to come
1:13:05
up with the best attack that
1:13:07
works against the most real API
1:13:09
that we have. And so what we
1:13:11
did is we looked at the open
1:13:13
AI and some other companies, Google had
1:13:15
the same kind of things. And
1:13:17
because of the way the API
1:13:19
was set up, it allowed us
1:13:22
to get some degree of control
1:13:24
over the outputs. that let us
1:13:26
do some fancy math that would steal
1:13:28
one layer of a model. It's like
1:13:30
among the layers in the model, it's
1:13:33
probably the least interesting, it's a very
1:13:35
small amount of data, but like I
1:13:37
can actually recover one of the layers
1:13:40
of the model. And so it's real
1:13:42
in that sense that I can do it.
1:13:44
It's also real in the sense of I have
1:13:46
the layer correctly. But it's not
1:13:48
everything. And so I think what I
1:13:50
was trying to advocate for in this
1:13:53
paper is... I think we should be pursuing
1:13:55
both directions of research at the same
1:13:57
time. One is write the papers that are
1:13:59
true. in some theoretical sets, but are
1:14:02
not the kinds of results that
1:14:04
you can actually implement in any
1:14:06
real system, and likely for the
1:14:08
foreseeable future, are not the kinds
1:14:10
we'll be able to implement in
1:14:12
any real systems. And also, at
1:14:14
the same time, do the thing
1:14:16
that most security researchers do today,
1:14:18
which is look at the systems
1:14:20
as they're deployed and try and
1:14:22
answer, given this system as it
1:14:24
exists right now, what are the
1:14:26
kinds of attacks? that you can
1:14:28
actually really get the model to
1:14:30
do and try and write papers
1:14:32
on that pieces of it. And
1:14:34
I don't know what you're going
1:14:36
to do with the last layer
1:14:38
of the model. You know, we
1:14:40
have the some things you can
1:14:42
do, but one thing that tells
1:14:44
you like the width of the
1:14:46
model, which is not something that
1:14:48
people disclose. So in our paper
1:14:50
we have, I think, the first
1:14:52
public confirmation of the width. of
1:14:54
like the the GPT-3 Ada and
1:14:56
Babbage models, which is not something
1:14:58
that Open Area ever said publicly.
1:15:00
They had the GPT-3 paper that
1:15:02
gave the width of a couple
1:15:04
of models in the paper, but
1:15:06
then they never really directly said
1:15:08
what the sizes of Ada and
1:15:10
Babbage were. People speculated, but we
1:15:12
could actually write on GPT-3. And
1:15:14
we... correctly stole the last layer,
1:15:16
and I know the size of
1:15:18
the model, and it is correct.
1:15:20
As closely close to responsible disclosure,
1:15:22
like we talked about the beginning,
1:15:24
we agreed with them ahead of
1:15:26
time, we were going to do
1:15:28
this. This is a fun conversation
1:15:30
to have with, you know, not
1:15:32
only Google lawyers, but open-air lawyers,
1:15:34
like, hi, I would like to
1:15:36
steal your model. May I please
1:15:38
please do this? You know, open-eyed
1:15:40
people were very nice. And they
1:15:42
said yes. The Google lawyers initially
1:15:44
were also very like, you know,
1:15:46
before I, even the Google lawyers,
1:15:48
like I would like to steal
1:15:50
open-air as data, like under no
1:15:52
circumstances. But like I said, like,
1:15:54
if I get the open-out general
1:15:56
counsel to agree, are you, are
1:15:58
you, are you know, We ran
1:16:01
everything, we destroyed the data, whatever.
1:16:03
But, like, part of the agreement was,
1:16:05
like, they would confirm that we got
1:16:07
the right, that we did the right thing,
1:16:09
but they asked us not to release
1:16:11
the actual data we stole. Which, like,
1:16:13
makes sense, right? Like, you want to
1:16:16
make, you want to show here's an
1:16:18
attack that works, but, like, that's not
1:16:20
actually released the stolen stuff. And so,
1:16:22
yeah, so, so, you know, if you were to
1:16:24
write down a list of, like all the people
1:16:26
in the people in the people in the
1:16:28
world, The list includes all courage and
1:16:31
former employees of open AI and me. And
1:16:33
so like it sounds like this is like
1:16:35
a very real attack because like this is
1:16:37
like, this is the easiest, like how else
1:16:39
would you learn this? The other way to learn this
1:16:41
would be like to like hack and open
1:16:43
AI servers and try and like learn this
1:16:45
thing or like you know blackmail on the
1:16:48
employees or you can do like an actual
1:16:50
episode machine learning attack and recover the size
1:16:52
of those models and the last layer. And
1:16:54
so that's like the sort
1:16:56
of motivation behind why we
1:16:58
want to write this paper was
1:17:01
to get examples and try
1:17:03
and encourage other people to
1:17:05
get examples of attacks that
1:17:07
even if they don't solve all
1:17:10
of the problems will let
1:17:12
us make them increasingly real in
1:17:14
this sense. And I think this
1:17:16
is something that we'll start to
1:17:19
need to see more of as we start
1:17:21
to get. systems deployed into more
1:17:23
and more settings. So that was
1:17:25
like why we did the paper.
1:17:28
I don't know, if you want
1:17:30
to talk about the technical methods
1:17:32
behind how we did it or
1:17:34
something, but it's, yeah. Do you want to
1:17:36
go to that? Okay, I'm sure. I can try.
1:17:38
Yeah, okay. So, for the next,
1:17:40
two minutes, let's assume some level
1:17:42
of linear algebra knowledge. If this
1:17:44
is not you, then I apologize.
1:17:47
I will try and explain it
1:17:49
in a way that makes some
1:17:51
sense. So the way that the
1:17:53
models work is they
1:17:55
have a sequence of layers,
1:17:57
and each layer is a
1:17:59
trace. transformation of the previous layer.
1:18:01
And the layers have some size, some
1:18:03
width. And it turns out that the
1:18:06
last layer of a model goes from
1:18:08
a small dimension to a big dimension.
1:18:10
So this is like the internal dimension
1:18:12
of these models is, I don't know,
1:18:14
let's say 2048 or something. And the
1:18:17
output dimension is the number of tokens
1:18:19
in the vocabulary. This is like 50,000.
1:18:21
And so what this means is that
1:18:23
if you look at the vectors that
1:18:26
are the outputs of the model, Even
1:18:28
though it's in this big giant dimensional
1:18:30
space, this 50,000 dimensional space, actually the
1:18:32
vectors, because this was a linear transformation,
1:18:34
are only in this 2,000-4,000-dimensional subspace. And
1:18:37
what this means is that if you
1:18:39
look at this space, you can actually
1:18:41
compute what's called the singular value decomposition
1:18:43
to recover how the space was indebted
1:18:46
into this bigger space. this directly, like
1:18:48
the number of, okay, I'll say a
1:18:50
phrase, the number of non-zero singular values
1:18:52
tells you the size of the model.
1:18:54
Again, like, it's like this, it's not
1:18:57
challenging math. It's like, this is, you
1:18:59
know, the last time I used this
1:19:01
was an undergrad in math. But, you
1:19:03
know, it's, if you work out the
1:19:06
details, it ends up working out. And
1:19:08
it turns out that, yeah, this is
1:19:10
an exciting. It's like a very nice
1:19:12
math to these kinds of these kinds
1:19:14
of things. And I think part of
1:19:17
the reason why I like the details
1:19:19
here is this is like the kind
1:19:21
of thing that like it's doesn't require
1:19:23
an expert in any one area. Like
1:19:26
it's like undergrad knowledge math. Like I
1:19:28
could explain this to anyone who has
1:19:30
completed the first course in linear algebra.
1:19:32
But like you need to be that
1:19:34
person and you need to also understand
1:19:37
how language models work and you need
1:19:39
to also be thinking about the security.
1:19:41
and you need to be thinking about
1:19:43
what the actual API is that it
1:19:46
provides because you can't get the standard
1:19:48
stuff like you have to like be
1:19:50
thinking about all the pieces this why
1:19:52
I think you know like I think
1:19:54
it was interesting is like this is
1:19:57
what a security person does is like
1:19:59
it's not the case that we're looking
1:20:01
at anything like sometimes you look at
1:20:03
something far deeper than any one thing
1:20:06
but most often what these exploits how
1:20:08
they happen is that you have a
1:20:10
fairly broad level of knowledge and you're
1:20:12
looking at how the details of the
1:20:14
API interacts with how the specific architecture
1:20:17
of the language model set up, using
1:20:19
techniques from linear algebra, and if you
1:20:21
were missing any one of those pieces,
1:20:23
you wouldn't have seen this attack as
1:20:25
possible, which is why the opening I
1:20:28
API had this for three years, and
1:20:30
no one else found it first. It's
1:20:32
like they were not looking for this
1:20:34
kind of thing. You don't stumble upon
1:20:37
these kinds of vulnerabilities, like you need
1:20:39
people to actually go look for them,
1:20:41
and then, you know, again, responsible disclosure,
1:20:43
we gave them 90 days to fix
1:20:45
it. They patched it, Google patched it,
1:20:48
a couple of other companies who we
1:20:50
won't, we won't name because they asked
1:20:52
not to, patched it, and it works,
1:20:54
and so that was, it was a
1:20:57
fun paper to write. Amazing. Well, Nicholas
1:20:59
Colleen, thank you so much for joining
1:21:01
us today. It's been an honor having
1:21:03
you on. Thank you.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More