Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
you've ever wondered how generative AI works
0:02
and where the technology is heading, this
0:04
episode is for you. We're going
0:06
to explain the basics of the technology
0:08
and then catch up with modern -day advances
0:10
like reasoning to help you understand exactly
0:12
how it does what it does and
0:15
where it might advance in the future.
0:17
That's coming up with semi -analysis founder
0:19
and chief analyst Dylan Patel right after
0:21
this. From LinkedIn News, I'm
0:24
Lea Smart, host of Every Day Better,
0:26
an award -winning podcast dedicated to personal development.
0:28
Join me every week for captivating stories
0:30
and research to find more fulfillment
0:32
in your work and personal life. Listen
0:34
to Everyday Better on the LinkedIn Podcast
0:36
Network, Apple Podcasts, or wherever you get your
0:38
podcasts. Did you know
0:41
that small and medium businesses
0:43
make up 98 % of the
0:45
global economy, but most B2B marketers
0:47
still treat them as a
0:49
one -size -fits -all? LinkedIn's
0:51
Meet the SMB report
0:53
reveals why that's a
0:55
missed opportunity. and how
0:57
you can reach these
1:00
fast -moving decision -makers effectively.
1:02
Learn more at linkedin.com
1:04
backslash meet -the -smb. Welcome
1:07
to Big Technology Podcast, a show
1:09
for cool -headed and nuanced conversation of
1:11
the tech world and beyond. We're
1:14
joined today by semi -analysis founder
1:16
and chief analyst Dylan Patel, a
1:18
leading expert and semiconductor and general VAI
1:20
research and someone I've been looking forward
1:22
to speaking with for a long time.
1:25
Now, I want this to
1:27
be an episode that A helps people learn
1:29
how generative AI works and B is an
1:31
episode that people will send to their friends
1:33
to explain to them how generative AI works.
1:35
I've had a couple of those that I've
1:37
been sending to my friends and
1:39
colleagues and counterparts about what is
1:41
going on within generative AI. That
1:43
includes one, this three and a
1:45
half hour long video from Andre
1:47
Karpathy, explaining everything about training large
1:50
language models. And the second one
1:52
is a great episode that Dylan
1:54
and Nathan Lambert from the Allen
1:56
Institute of AI did with Lex
1:58
Friedman. Both of those three hours
2:00
plus, so I want to do ours in
2:02
an hour. And I'm
2:04
very excited to begin. So Dylan, it's
2:06
great to see you and welcome
2:08
to the show. Thank you for having
2:10
me. Great to have you here.
2:12
Let's just start with tokens. Can you
2:14
explain how AI researchers basically take
2:16
words and then give them numerical representations
2:18
and parts of words and give
2:21
them numerical representations. So what are tokens?
2:23
Tokens are in fact like chunks of
2:25
words, right? In the human way,
2:27
you can think of like syllables, right?
2:31
Syllables are often viewed as like chunks
2:33
of word. They have some
2:35
meaning. It's the base
2:37
level of speaking, right, is syllables,
2:39
right? Now for models, tokens are
2:41
the base level of output. They're
2:43
all about like sort of compressing,
2:45
you know, sort of this is
2:47
the most efficient representation of language.
2:49
From my understanding, AI models are
2:51
very good at predicting patterns. So
2:54
if you give it one, three,
2:56
seven, nine, it might know
2:58
the next number is going to be
3:00
11. And so what it's doing
3:02
in with tokens is taking words, breaking
3:04
them down to their component parts,
3:06
assigning them a numerical value, and
3:08
then basically in its own word, in
3:10
its own language, learning to predict what number
3:13
comes next because computers are better at
3:15
numbers than converting that number back to text.
3:17
And that's what we see come out.
3:19
Is that accurate? Yeah. And
3:21
each individual token is actually, it's
3:23
not just like one number, right?
3:25
It's multiple vectors. You could think
3:27
of like, well, the tokenizer needs
3:29
to learn King and Queen are
3:31
actually extremely similar on most in
3:33
terms of like the English language,
3:35
extremely similar. except there
3:37
is one vector in which they're super different,
3:39
because a king is a male and a
3:41
queen is a female. And then
3:43
from there, in language, oftentimes kings
3:46
are considered conquerors, and all these
3:48
are the things, and these are
3:50
just historical things. So a lot
3:52
of the text around them, while
3:54
they're both royal, regal, monarchy, et
3:56
cetera, there are many vectors in
3:58
which they differ. So it's not
4:00
just converting a word into one
4:02
number. It's converting it into multiple
4:04
vectors. and each of these vectors,
4:06
the model learns what it means,
4:08
right? You don't initialize
4:10
the model with like, hey, you
4:12
know, king means male, monarch,
4:16
and it's associated with like war and conquering, because
4:18
that's all the writing about kings is on, you
4:20
know, in history and all that, right? Like people
4:22
don't talk about the daily lives of kings that
4:24
much, or they mostly talk about like their wars
4:26
and conquests and stuff. And
4:28
so like, There will be, each
4:30
of these numbers in this embedding space,
4:32
right, will be assigned over time as the
4:34
model reads the internet's text and trains
4:36
on it, it'll start to realize, oh, King
4:38
and Queen are exactly similar on these
4:40
vectors, but very different on these vectors. And
4:42
these vectors aren't, you don't explicitly tell
4:44
the model, hey, this is what this vector
4:46
is for, but it could be like, you
4:49
know, it could be as much as like, one
4:51
vector could be like, is it a building or
4:53
not? right and it doesn't actually know that you
4:55
don't you don't know that ahead of time it
4:57
just happens to in the latent space and then
4:59
all these vectors sort of relate to each other
5:02
but yeah these numbers are are an efficient representation
5:04
of words. because you
5:06
can do math on them, right? You
5:08
can multiply them, you can divide them,
5:10
you can run them through an entire
5:12
model, whereas, and your brain does something
5:14
similar, right? When it hears something, it
5:16
converts that into a frequency in your
5:18
ears, and then that gets converted to
5:20
frequencies that should go through your brain,
5:22
right? This is the same thing as
5:25
a tokenizer, right? Although it's like, obviously
5:27
a very different medium of compute, right?
5:29
Ones and zeros for computers versus, you
5:31
know, binary and multiplication, et cetera, being
5:33
more efficient, whereas humans' brains are more
5:35
like animals, analog in nature and, you
5:37
know, think born waves and patterns in
5:39
different ways. Uh, while they are very
5:41
different, it is a tokenizer, right? Like
5:43
language is not actually how our brain
5:45
thinks. It's just a representation for which
5:48
it to, you know, reason over. Yeah.
5:50
So that's crazy. So the
5:53
tokens are the sufficient representation of
5:55
words, but more than that,
5:57
the models are also learning the
5:59
way that they are. All
6:01
these words are connected and that
6:04
brings us to pre -training. From
6:06
my understanding, pre -training is when
6:08
you take the entire, basically
6:10
the entire internet worth of text,
6:12
and you use that to
6:15
teach the model these representations between
6:17
each token. So therefore,
6:19
like we talked about, if you gave
6:21
a model, the sky is, and
6:23
the next word is typically blue in
6:25
the pre -training, which is basically all
6:27
of the English language, all of
6:29
language on the internet. It should know
6:32
that the next token is blue.
6:34
So what you do is you want
6:36
to make sure that when the
6:38
model is outputting information, it's closely tied
6:40
to what that next value should
6:42
be. Is that a proper
6:44
description of what happens in free training? Yeah,
6:47
I think that's the objective
6:49
function, which is just to reduce
6:51
loss, i .e., how often is
6:53
the token predicted incorrectly versus
6:56
correctly, right? Right, so if you
6:58
said the sky is red, That's
7:00
not the most probable outcome, so that would be
7:03
wrong. But that text is on the internet, right? Because
7:05
the Martian sky is red and there's all these
7:07
books about Mars and sci -fi. Right, so how does
7:09
the model then learn how to figure this out and
7:11
in what context is it accurate to say blue
7:13
and red? Right, so I
7:15
mean, first of all, the model
7:17
doesn't just output one token. It outputs
7:19
a distribution. It turns
7:21
out the way most people take
7:23
it is they take the top
7:25
K, i .e. the most high probability.
7:28
So yes, blue is obviously the right answer if
7:30
you give it to anyone on this planet. But
7:33
there are situations and contexts where the sky
7:35
is red is the appropriate sentence, but that's
7:37
not just in isolation, right? It's like if
7:39
the prior passage is all about Mars and
7:41
all this, and then all of a sudden
7:43
it's like, and that's like a quote from
7:45
a Martian settler, and it's like the sky
7:47
is, and then the correct token is actually
7:49
red, right? The correct word. And so it
7:51
has to know this through the attention mechanism,
7:54
right? If it was just the sky is
7:56
blue, always you're gonna output blue because blue
7:58
is let's say 80%, 90%, 99 % likely
8:00
to be the right option. But as you,
8:02
as you start to add context about Mars
8:04
or any other planet. Other planets have different
8:06
colored atmospheres, I presume. You
8:08
end up with this distribution
8:10
starts to shift. If
8:13
I add we're on Mars, the
8:15
sky is, then all of a
8:17
sudden, blue goes from 99 % in
8:19
the prior context window, the text that
8:21
you sent to the model, the
8:24
attention of it. All of
8:26
a sudden, it realizes the sky
8:28
is blue. uh proceeded by that
8:30
the stuff about mars now bluish
8:32
rockets down to like you know
8:34
let's call it 20 probability and
8:36
red rockets up to 80 probability
8:38
right um now the model outputs
8:40
that and then most people just
8:42
end up taking the top probability
8:44
and outputting it to the user
8:46
um and that's sort of like
8:48
how does the model learn that
8:50
is is the attention mechanism right
8:52
and this is sort of what
8:54
is that Yeah, the attention mechanism
8:56
is the beauty of modern sort
8:58
of large Lagrange models. It takes
9:00
the relational value in this vector
9:02
space between every single token, right? So
9:05
the sky is blue, right? When I
9:07
think about it, yes, blue is
9:09
the next token after the sky is,
9:11
but in a lot of older
9:13
style models, you would just predict the
9:15
exact next word. So after sky,
9:17
Obviously, it could be many things. It
9:19
could be blue, but it could
9:21
also be like scraper, right? Sky
9:24
scraper, that makes sense.
9:27
But what attention does is
9:29
it is taking all of
9:31
these various values, the query,
9:33
the key, the value, which
9:35
represents what you're looking for,
9:37
where you're looking, and what
9:39
that value is across the
9:41
attention. you're
9:44
calculating mathematically what the relationship is
9:46
between all of these tokens. And
9:49
so going back to the king -queen
9:51
representation, right? The way these two words
9:53
interact is now calculated, right? And the
9:55
way that every word in the entire
9:57
passage you sent is calculated is tied
9:59
together, which is why models have like
10:01
challenges with like how long can you, how
10:04
many documents can you send them, right? Because
10:06
if you're sending them... you know just the question
10:08
like what color is the sky okay only
10:10
has to calculate the attention between you know those
10:12
those words right but if you're sending it
10:14
like 30 books with like insurance claims and all
10:16
these other things and you're like okay figure
10:18
out what's going on here is this a claim
10:20
or not right and in the insurance context
10:22
all of a sudden it's like okay I've got
10:24
to calculate the attention of not just like
10:27
the last five words to each other, we have
10:29
to calculate every, you know, 50 ,000 words to
10:31
each other, right? Which then ends up being
10:33
a ton of math. Back in the day, actually,
10:35
the best language models were a different architecture
10:37
entirely, right? But then
10:39
at some point, you know, transformers, large language
10:41
models, sort of large language models, which
10:43
are basically based on transformers primarily, rocketed
10:45
past and capabilities because they were able
10:47
to scale and because the hard work
10:49
got there. And then we were able
10:52
to scale them so much that we
10:54
were not to just put like some
10:56
text in them. and not just a
10:58
lot of text or a lot of
11:00
books, but the entire internet, which one
11:02
could view the internet oftentimes as a
11:04
microcosm of all human culture and learnings
11:06
and knowledge to many extents, because most
11:08
books are on the internet, most papers
11:10
are on the internet. Obviously, there's a
11:13
lot of things missing on the internet,
11:15
but this is the modern magic of
11:17
three different things coming all together at
11:19
once. An efficient way for models to
11:21
relate every word to each other. the
11:23
compute necessary to scale the data large
11:25
enough and then someone actually like pulling
11:27
the trigger to do that right at
11:29
the scale that was you know got
11:31
to the point where it was useful
11:34
right which was sort of like GPT
11:36
3 .5 level or 4 level right
11:38
where it became extremely useful for normal
11:40
humans to use you know chat, chat
11:42
models. Okay and so why
11:44
is it called pre -training? So
11:46
so pre -training is is
11:48
sort of called that because it
11:50
is what happens before the actual
11:52
training of the model. The objective
11:55
function in pre -training is to
11:57
just predict the next token, but
11:59
predicting the next token is not
12:01
what humans want to use AIs
12:03
for. I want it to
12:05
ask a question and answer it.
12:07
But in most cases, asking a
12:09
question does not necessarily mean that
12:11
the next most likely token is
12:13
the answer. Oftentimes, it is another
12:15
question. For example, if I
12:17
ingested the entire SAT, and
12:20
I asked a question, all
12:23
the next tokens would be like, A is this, B
12:25
is this, C is this, D is this. No, I just
12:27
want the answer. And
12:29
so pre -training is, the reason it's
12:31
called pre -training is because you're ingesting
12:33
humongous volumes of text no matter the
12:35
use case. And
12:38
you're learning the general patterns
12:40
across all of language. I
12:42
don't actually know that King and Queen relate to each
12:44
other in this way. and I don't know that King and
12:46
Queen are opposites in these ways, right? And
12:48
so this is why it's called
12:50
pre -training is because you must get
12:53
a broad general understanding of the entire
12:55
sort of world of text before
12:57
you're able to then do post -training
12:59
or fine -tuning, which is let me
13:01
train it on more specific data that
13:03
is specifically useful for what I
13:05
want it to do, whether it's, hey,
13:08
in chat style applications, you know,
13:10
go in, You know when I ask a
13:12
question give me the answer or in in other
13:14
applications like teach me how to build a
13:16
bomb will obviously know I'm not going to help
13:18
you teach build a bomb because that's what
13:20
I don't want the model to teach me how
13:23
to build a bomb so you know it's
13:25
sort of gotta. Do this and it's not like
13:27
you're teaching it you know when you're doing
13:29
this pre training you're filtering out all this data
13:31
because in fact there's a lot of good
13:33
useful data on how to build bombs because a
13:35
lot of useful information on like. Hey, like
13:37
C4 chemistry and like, you know, people want to
13:39
use it for chemistry, right? So you don't
13:41
want to just fill throughout everything so that the
13:44
model doesn't know anything about it. Um, but
13:46
at the same time, you don't want it to
13:48
output, you know, how to build a bomb.
13:50
Um, so there's like a fine balance here. And
13:52
that's why pre -training is defined as pre because
13:54
you're, you're, you're still letting it do things
13:56
and teaching it things and inputting things into the
13:58
model that are theoretically like quite bad. For
14:01
example, books about killing or war tactics
14:03
or what have you. Things that plausibly
14:05
you could see like, oh, well, maybe
14:07
that's not okay. Or
14:10
wild descriptions of really grotesque things all
14:12
over the internet, but you want the model
14:14
to learn these things. Because first you
14:16
build the general understanding before you say, okay,
14:18
now that you've got a general framework
14:20
or the world, let's align you so that
14:22
you with this general understanding the world
14:24
can figure out what is useful for people,
14:26
what is not useful for people, what
14:28
should I respond on, what should I not
14:30
respond on. So what happens
14:32
then in the training process? So
14:35
is the training process that the
14:37
model is then attempting to make
14:39
the next prediction and then just
14:41
trying to minimize loss as it
14:43
goes? Right, right. I mean like
14:45
basically you have loss is is
14:47
how often you're wrong versus right
14:49
in the most simple terms. You'll
14:52
run through passages through
14:54
the model, and
14:56
you'll see how often did the model
14:58
get it right. When it got it
15:00
right, great, reinforce that. When it got
15:02
it wrong, let's figure out which neurons
15:04
in the model, quote unquote, neurons, in
15:06
the model you can tweak to then
15:08
fix the answer so that when you
15:10
go through it again, it actually outputs
15:12
the correct answer. And then you move
15:14
the model slightly in that direction. No,
15:17
obviously the challenge with this is
15:19
if I first, you know, I can
15:21
come up with a simplistic way
15:23
where all the neurons will just output
15:25
the sky's blue every single time
15:27
it says the sky is. But then
15:29
when it goes to, you know, hey,
15:32
the... color blue is commonly used on
15:34
walls because it's soothing, right? And it's
15:36
like, oh, what's the next word is
15:38
soothing, right? Soothing, you know, and so
15:40
like that, that is a completely different
15:42
representation. And to understand that blue is
15:44
soothing and that the sky is blue
15:46
and those things aren't actually related, but
15:48
they are related to blue is like
15:50
very important. And so, you know, oftentimes
15:52
you'll run through the training data set
15:54
multiple times, right? Because the first time
15:56
you see it, oh, great, maybe you
15:58
memorized that the sky is blue. and
16:01
you memorize the wall is blue
16:03
and when people describe art and
16:05
oftentimes use the color blue, it
16:07
can be representations of art or
16:09
the wall. And so over time,
16:12
as you go through all this
16:14
text in pre -training, yes, you're
16:16
minimizing loss initially by just memorizing,
16:18
but over time, because you're constantly
16:20
overwriting the model, it starts to
16:22
learn the generalization. I .e., blue
16:24
is a soothing color, also represents the
16:27
sky, also used in art for either of
16:29
those two motifs. Right? And so that's
16:31
sort of the goal of pre -training is
16:33
you don't want to memorize, right? Because that's,
16:35
you know, in school you memorize all
16:37
the time. And that's not useful because you
16:39
forget everything you memorize But if you
16:42
get tested on it then and then you
16:44
get tested on it six months later
16:46
And then again six months later after that
16:48
or however you do it ends up
16:50
being oh, you don't actually like memorize that
16:52
anymore You just know it innately and
16:54
you've generalized on it and that's the real
16:56
goal that you want out of the
16:59
model But that's not necessarily something you can
17:01
just measure right and therefore loss is
17:03
something you can measure ie for this group
17:05
of this group of text, right? Because
17:07
you train the model in steps. Every
17:10
step you're inputting a bunch of text, you're trying
17:12
to see what's predict the right token, where you
17:14
didn't predict the right token, let's adjust the neurons.
17:17
Okay, onto the next batch of text.
17:19
And you'll do this, these batches
17:21
over and over and over again, across
17:23
trillions of words of text, right? And
17:26
as you step through, and then you're like,
17:28
oh, well, I'm done. But I bet if
17:30
I go back to the first group of
17:32
texts, which is all about the sky being
17:35
blue, it's going to get the answer wrong
17:37
because maybe later on in the training it
17:39
discovered it saw some passages about sci -fi
17:41
and how the Martian is red. So like
17:43
it'll overwrite, but then over time as you
17:45
go through the data multiple times, as you
17:47
see it on the internet multiple times, you
17:49
see it in different books multiple times, whether
17:51
it be scientific, sci -fi, whatever it is, you
17:53
start to realize and it starts to learn
17:56
that that representation of like, oh, when it's
17:58
on Mars, it's red because the sky and
18:00
Mars is red because the atmospheric makeup is
18:02
this way. Whereas the atmospheric makeup on Earth
18:04
is a different way. And so that's sort
18:06
of like, the whole point of pre -training
18:08
is to minimize loss, but the nice side
18:10
effect is that the model initially memorizes, but
18:12
then it stops memorizing and it generalizes. And
18:14
that's the useful pattern that we want. Okay,
18:16
that's fascinating. We've touched on post -training for a
18:19
bit, but just to recap, Post
18:21
-training is so you have a model
18:23
that's good at predicting the next
18:25
word. And in post -training, you sort
18:27
of give it a personality by inputting
18:29
sample conversations to make the model
18:31
want to emulate the certain values that
18:33
you want it to take on. Yeah,
18:36
so post -training can be a number of different
18:39
things. The most simple way of doing it
18:41
is, is yet. pay for humans
18:43
to label a bunch of data, take
18:45
a bunch of example conversations, et
18:48
cetera, and input that data and train
18:50
on that at the end, right? And
18:53
so that example data is... useful,
18:55
but this is not scalable, right? Like
18:57
using humans to train models is
18:59
just so expensive, right? So then there's
19:01
the magic of sort of reinforcement
19:03
learning and other synthetic data technologies, right?
19:06
Where the model is helping teach
19:08
the model, right? So you have many
19:10
models in a sort of in
19:12
a post training where, yes, you have
19:14
some example human data, but human
19:16
data does not scale that fast, right?
19:18
Because the internet is trillions and
19:20
trillions of words out there. Whereas even
19:22
if you had Alex and I
19:24
write words all day long for our
19:27
whole lives, we would have millions
19:29
or hundreds of millions of words written.
19:31
It's nothing. It's like orders of
19:33
magnitude off in terms of the number
19:35
of words required. So
19:37
then you have the model take
19:39
some of this example data. and
19:42
you have various models that are surrounding
19:44
the main model that you're training, right? And
19:46
these can be policy models, right? Teaching
19:48
it, hey, is this what you want or
19:50
that what you want? Reward models, right?
19:52
Like, is that good response or is that
19:54
a bad response? You have value models
19:56
like, hey, grade this output, right? And you
19:59
have all these different models working in
20:01
conjunction to say, Different
20:03
companies have different objective functions, right?
20:06
In the case of Anthropic, they
20:08
want their model to be helpful,
20:10
harmless, and safe, right? So
20:12
be helpful. but also don't harm
20:14
people or anyone or anything, and
20:16
then, you know, you know, safe,
20:18
right? In other cases, like Grock, right, Elon's
20:21
model from XAI, it actually
20:23
just wants to be helpful, and maybe it has
20:25
like a little bit of a right leaning to
20:27
it, right? And for other folks, right, like, you
20:29
know, I mean, most AI models are made in
20:31
the Bay Area, so they tend to just be
20:33
left leaning, right? But also the internet in general
20:36
is a little bit left leaning, because it skews
20:38
younger than older. And so, like, all these
20:40
things, like, sort of affect models. But
20:42
it's not just around politics, right? Post -training
20:44
is also just about teaching the model. If
20:47
I say the movie where the
20:49
princess has a slipper and it
20:51
doesn't fit, it's like, well, if
20:53
I said that into a base
20:55
model that was just pre -training,
20:57
the answer wouldn't be, oh, the
20:59
movie you're looking for, Cinderella, it
21:01
would only realize that once it goes through
21:03
post -training, right? Because a lot of times
21:06
people just throw garbage into the model, and
21:08
then the model still figures out what you
21:10
want. And this is part of what post -training
21:12
is. You can just do stream of consciousness
21:14
into models, and oftentimes it'll figure out what
21:16
you want. If it's a movie that you're
21:18
looking for, or if it's help answering a
21:20
question, or if you throw a bunch of
21:23
unstructured data into it and then ask it
21:25
to make it into a table, it does
21:27
this. And that's because of all these different
21:29
aspects of post -training. Example data, but also
21:31
generating a bunch of data and grading it
21:33
and seeing if it's good or not. and
21:36
whether it matches the various policies you want.
21:38
A lot of times grading can be based on
21:40
multiple factors. There can be a model that
21:42
says, hey, is this helpful? Hey, is this safe?
21:44
And what is safe? So then that model
21:46
for safety needs to be tuned on human data.
21:49
So it is a quite complex thing, but the
21:51
end goal is to be able to get
21:53
the model to output in a certain way. Models
21:55
aren't always about just humans using them either.
21:57
There can be models that are just focused on,
22:00
hey, if it doesn't output
22:02
code, yes, it was trained on the whole internet
22:04
because the person's going to talk to the
22:06
model using text, but if it doesn't output code,
22:08
penalize it. Now, all of a sudden, the
22:10
model will never output text ever again. It'll only
22:12
output code. And
22:14
so these sorts of models exist too.
22:16
So post -training is not just a uni -variable
22:18
thing. It's what variables do you want
22:20
to target? And so that's why
22:22
models have different personalities from different companies.
22:24
It's why they target different use cases and
22:26
why it's not just one model that
22:29
rules them all, but actually many. That's
22:31
fascinating. So that's why we've seen so
22:33
many different models with different personalities is
22:35
because it all happens in the post
22:38
-training moment. And this is
22:40
when you talk about giving the
22:42
models examples to follow. That's what
22:44
reinforcement learning with human feedback is,
22:46
is the humans give some examples
22:48
and then the model learns to
22:50
emulate what the human is interested
22:52
in, what the human trainer is
22:55
interested in having them embody. Is
22:58
that right? Yeah, exactly. Okay,
23:00
great. All right, so
23:02
first half we've covered what training is,
23:04
what tokens are, what loss is,
23:06
what post -training is, post -training is, post
23:08
-training, by the way, also called fine
23:10
-tuning. We've also covered reinforcement learning
23:12
with human feedback. We're gonna take a
23:14
quick break and then we're gonna talk
23:17
about reasoning. We'll be back right
23:19
after this. Small
23:21
and medium businesses don't have
23:23
time to waste and
23:25
neither do marketers trying to
23:27
reach them. On LinkedIn,
23:29
more SMB decision makers are
23:31
actively looking for new
23:33
solutions to help them grow,
23:36
whether it's software or
23:38
financial services. Our Meet the
23:40
SMB report breaks down
23:42
how these businesses buy and
23:44
what really influences their
23:46
choices. Learn more at linkedin.com
23:48
backslash meet -the -smb. That's
23:50
linkedin.com backslash meet -the -smb. And
23:54
we're back here on
23:56
Big Technology Podcast with Dylan
23:58
Patel. He's the founder
24:00
and chief analyst at Semi
24:02
-Analysis. He actually has great
24:04
analysis on Nvidia's recent
24:06
GTC conference, which we just
24:08
covered recently on a
24:10
recent episode. You can find
24:12
Semi -Analysis at SemiAnalysis.com. It
24:14
is both content and
24:16
consulting. So you definitely check
24:18
in with Dylan for all of those needs.
24:20
And now we're going to talk a
24:23
little bit about reasoning. Because
24:25
a couple of months ago, and Dylan,
24:27
this is really where I entered the
24:29
picture of watching your conversation with Flex, with
24:32
Nathan Lambert, about what
24:34
the difference is between reasoning
24:36
and your traditional LLMs,
24:39
large language models. If
24:41
I gathered it right from your
24:43
conversation, what reasoning is, is basically
24:46
instead of the model going, basically
24:48
predicting the next word based off
24:50
of its training. It
24:52
uses the tokens to spend more
24:54
time basically figuring out what the
24:56
right answer is and then coming
24:58
out with a new prediction. I
25:00
think Carpathia does a very interesting
25:02
job in the YouTube video talking
25:04
about how models think with tokens. The
25:07
more tokens there are, the more compute
25:09
they use because they're running these predictions through
25:11
the transformer model, which we discussed, and
25:13
therefore they can come to better answers. Is
25:15
that the right way to think about
25:17
reasoning? Humans
25:21
are also fantastic at pattern
25:23
matching, right? We're really good
25:25
at like recognizing things, but a lot
25:27
of tasks, it's not like an immediate
25:29
response, right? We are thinking, whether that's thinking
25:31
through words out loud, thinking through words
25:33
in an inner monologue on our head, or
25:35
it's just like processing somehow and then
25:38
we know the answer, right? And
25:40
this is the same for models, right?
25:42
Models are horrendous at math. a historically
25:44
happened. You could
25:46
ask it, what is
25:49
9 .11 bigger than 9
25:51
.9? And it would
25:53
say, yes, it's bigger, even though
25:55
everyone knows that 9 .11 is way
25:57
smaller than 9 .9. And
25:59
that's just a thing that happened
26:01
in models because they didn't think
26:03
or reason. And it's the same
26:05
for you, Alex, or myself. If
26:08
someone asked me, 17
26:10
times 34 I'd be like I don't
26:12
know like right off top of my
26:14
head but you know give me give
26:16
me a little bit of time I
26:18
can do some long form multiplication and
26:20
I can get the answer right and
26:22
that's because I'm thinking about it and
26:24
this is the same thing with reasoning
26:26
for models is you know when you
26:28
look at a transformer every word is
26:30
this every token output it has the
26:32
same amount of compute behind it right
26:34
i .e. you know when I'm saying
26:36
the versus sky is blue, the blue
26:38
and the D have this or the
26:40
is in the blue have the same
26:42
amount of compute to generate, right? And
26:44
this is not exactly what you want
26:46
to do, right? You want to actually
26:48
spend more time on the hard things
26:50
and not on the easy things. And
26:52
so reasoning models are effectively teaching, you
26:54
know, large pre -trained models to do this,
26:56
right? Hey, think through the problem. Hey,
26:58
output a lot of tokens. Think
27:00
about it, generate all this text. And then
27:02
when you're done, you know, start answering the
27:04
question, but now you have all of this
27:06
stuff you generated in your context, right?
27:10
And that stuff you generated is
27:12
is helpful, right? It could
27:14
be like, you know, all sorts
27:16
of, you know, just like any
27:18
human thought patterns are, right? And
27:20
so this, this is the sort
27:22
of like new paradigm that we've
27:24
entered maybe six months ago, where
27:26
models now will think for some
27:28
time before they answer, and this
27:30
enables much better performance on all
27:32
sorts of tasks, whether it be
27:34
coding or math or understanding science
27:36
or understanding complex social dilemmas, all
27:38
sorts of different topics they're much,
27:40
much better at. And this is
27:42
done through post -training, similar to the
27:44
reinforcement learning by human feedback that
27:47
we mentioned earlier, but also there's
27:49
other forms of post -training and
27:51
that's what makes these reasoning models.
27:53
Before we head out, I want
27:55
to hit on a couple of
27:57
things. First of all, the growing
27:59
efficiency of these So
28:01
I think one of the things that
28:03
people focused on with DeepSeq was that
28:05
it was just able to be much
28:07
more efficient in the way that it
28:09
generates answers. And there
28:11
was this obviously this big reaction to
28:13
NVIDIA stock worth it. fell 18 %
28:15
the day or at the Monday after
28:17
deep seek weekend because people thought we
28:19
wouldn't need as much compute. So can
28:21
you talk a little bit about how
28:23
models are becoming more efficient and how
28:26
they're doing it? Yeah, so there's a
28:28
variety of the beauty of these of
28:30
AI is not just that we continue
28:32
to build new capabilities. Because
28:34
the new capabilities are going to be able to benefit
28:36
the world in many ways. And there's
28:38
a lot of focus on those. But
28:40
there's also a lot of focus on, well,
28:43
to get to that next level of
28:45
capabilities is the scaling laws, i .e. the
28:47
more compute and data I spend, the better
28:49
the model gets. But then the other
28:51
vector is, well, can I get to the
28:53
same level with less compute and data? And
28:56
those two things are hand in hand, because if I
28:58
can get to the same level with less computing data,
29:00
then I can spend that more computing data and get
29:03
to a new level. And so
29:05
AI researchers are constantly looking for
29:07
ways to make models more efficient,
29:09
whether it be through algorithmic tweaks,
29:11
data tweaks, tweaks in how you do
29:13
reinforcement learning, so on and so
29:16
forth. And so when
29:18
we look at models across history,
29:20
they've constantly gotten cheaper and cheaper
29:22
and cheaper at a stupendous rate.
29:24
Right? And so one easy example
29:26
is GPT -3, right? Because there's
29:28
GPT -3, 3 .5 turbo, Lama
29:31
-27B, Lama -3, Lama
29:34
-3 .1, Lama -3 .2, right? As these
29:36
models have gotten bigger, we've gone from,
29:38
hey, it costs $60 for a
29:40
million tokens to it costs less than,
29:42
it costs like five cents now
29:44
for the same quality of model. Now,
29:46
and the model has shrank dramatically
29:48
in size as well. And that's because
29:51
of better algorithms, better data, et
29:53
cetera. And now what happened with deep
29:55
seek was similar You know opening
29:57
I had GPT -4 then they had
29:59
four turbo which was half the cost
30:01
then they had 4 .0 which was
30:03
again half the cost and then
30:05
meta release llama 405b Open source and
30:07
so the open source community was
30:10
able to run that and that was
30:12
again like roughly like half the
30:14
cost but or 5x lower cost Then
30:16
4 .0 which was lower than 4
30:18
turbo and 4 but deep seek
30:20
came out with another tier, right? So
30:22
when we looked at GPT -3 the
30:24
cost fell 1200x from GPT -3's initial
30:26
cost to what you can get
30:29
LOMA 3 .2 -3B today. And
30:31
likewise, when we look at
30:33
from GPT -4 to deep -seq V3, it's
30:35
fallen roughly 600X in cost. So
30:37
we're not quite at that 1200X,
30:40
but it has fallen 600X in
30:42
cost from $60 to about $1,
30:44
or to less than $1, sorry,
30:46
$60X. And so you've got this
30:48
massive cost decrease But it's not
30:50
necessarily out of bounds, right? We've
30:52
already seen, I think what was
30:54
really surprising was that it was
30:56
a Chinese company for the first
30:58
time, right? Because Google and OpenAI
31:00
and Anthropic and Meta have all
31:02
traded blows, right? Whether it
31:04
be OpenAI always being on the leading
31:06
edge or Anthropic always being on the
31:09
leading edge or Google and Meta being
31:11
close followers, but oftentimes sometimes with a
31:13
new feature and sometimes just being much
31:15
cheaper. We have not seen
31:17
this from any Chinese company, right? And
31:19
now we have a Chinese company releasing
31:21
a model that's cheap. It's
31:23
not unexpected, right? Like this is actually
31:26
within the trend line of what happened with
31:28
GPT -3 is happening to GPT -4 level
31:30
quality with Deepseek. It's more
31:32
so surprising that it's a Chinese company. And that's,
31:34
I think, why everyone freaked out. And then there
31:36
was a lot of things that, like, you know,
31:38
from there became a thing, right? Like, if Meta
31:40
had done this, I don't think people would have
31:42
freaked out, right? And Meta's gonna
31:44
release their new Lama soon enough,
31:47
right? And that one is
31:49
gonna be, you know, a similar
31:51
level of cost decrease, probably
31:53
similar areas deep -seek V3, right? It's
31:55
just not people aren't gonna freak out because it's an American
31:57
company and it was sort of expected. All
32:00
right, Dylan, let me ask you the last
32:02
question, which is the, you mentioned, I think
32:04
you mentioned the bitter lesson, which is basically
32:06
that they're, I mean, I'm gonna just be...
32:08
in summing it up. But the answer to
32:10
all questions in machine learning is just to
32:12
make bigger models. And
32:14
scale solves almost all problems. So
32:17
it's interesting that we have this moment where models
32:19
are becoming way more efficient. But
32:21
we also have massive, massive data
32:23
center buildouts. I
32:25
think it would be great to hear you kind
32:27
of recap the size of these data center buildouts and
32:29
then answer this question. If we
32:31
are getting more efficient, Why are these
32:33
data centers getting so much bigger? And
32:36
what might that added scale get in
32:38
the world of generative AI for the
32:40
companies building them? Yeah,
32:42
so when we look across the ecosystem at
32:44
data center buildouts, We track
32:46
all the build outs and server
32:48
purchases and supply chains here. And
32:51
the pace of construction is incredible. You
32:54
can pick a state and you can
32:56
see new data centers going up all
32:58
across the US and around the world.
33:00
And so you see things like
33:02
capacity in, for example, of the
33:04
largest scale training supercomputers goes from,
33:07
hey, it's not even a few
33:09
hundred million dollars a year ago,
33:11
but like, hey, for GPT -4, it
33:13
was a few hundred million. and
33:17
it's one building full
33:19
of GPUs too. GPT
33:22
4 .5 and the reasoning
33:24
models like 0103 were
33:26
done in three buildings on
33:28
the same site and
33:30
billions of dollars to, hey,
33:32
these next generation things
33:34
that people are making are
33:36
tens of billions of
33:38
dollars like OpenAI's data center
33:40
in Texas called Stargate,
33:42
right? with Crusoe and
33:44
Oracle, and et cetera. And
33:46
likewise applies to Elon Musk who's building
33:48
these data centers in an old factory
33:50
where he's got a bunch of gas
33:52
generation outside and he's doing all these
33:54
crazy things to get the data center
33:56
up as fast as possible. And
33:58
you can go to just basically every
34:00
company and they have these humongous buildouts. And
34:04
this sort of like... And because
34:06
of the scaling laws, 10x
34:08
more compute for linear
34:10
improvement gains. It's
34:12
log log, sorry. But you end
34:14
up with this very confusing thing,
34:16
which is, hey, models keep getting
34:19
better as we spend more. But
34:21
also, the model that we had
34:23
a year ago is now done
34:25
for way, way cheaper, oftentimes 10x
34:27
cheaper or more, just a year
34:29
later. So then the question is
34:31
like, why are we spending all
34:33
this money to scale? And
34:36
there's a few things here, right? A, you
34:39
can't actually make that cheaper model without making
34:41
the bigger model so you can generate data
34:43
to help you make the cheaper model, right?
34:45
Like that's part of it. But
34:47
also another part of it
34:50
is that, you know, if we
34:52
were to freeze AI capabilities
34:54
where we were basically in, what
34:56
was it? March 2023, right?
34:59
Two years ago when GPT -4 released.
35:01
Um, and only made them cheaper, right?
35:03
Like deep seek is like much cheaper.
35:05
It's much more efficient. Um, but it's
35:07
roughly the same capabilities as you PD
35:09
for, um, that would not. Pay
35:12
for all of these K buildouts, right?
35:14
AI is useful today, but it's not capable
35:16
of doing a lot of things, right? But
35:18
if we make the model way more efficient
35:20
and then continue to scale and we
35:22
have this like stair step, right? Where we
35:24
like. Increase capabilities massively make them way more
35:26
efficient increase capabilities massively make them way more
35:29
efficient We do the stair step then you
35:31
end up with creating all these new capabilities
35:33
that could in fact pay for you know
35:35
these massive AI buildouts So no one
35:37
is trying to make with these you know
35:39
with these ten billion dollar data centers They're
35:41
not trying to make chat models right they're
35:43
not trying to make models that people chat
35:45
with just to be clear right they're trying
35:48
to solve things like software engineering and make
35:50
it automated which
35:52
is like a trillion dollar plus industry,
35:54
right? So these are very different
35:56
like sort of use cases and targets.
35:58
And so it's the bitter lesson because
36:00
yes, you can make, you can spend
36:02
a lot of time and effort making
36:04
clever specialized methods, you know, based on
36:06
intuition. And you should,
36:09
right? But these things should also just
36:11
have a lot more compute thrown behind them
36:13
because if you make it more efficient as you
36:15
follow the scaling laws up. it'll also just
36:17
get better and you can then unlock new capabilities,
36:19
right? And so today, you know, a lot
36:21
of AI models, the best ones from Anthropic are
36:23
now useful for like coding. As
36:25
a assistant with you, right, you're going back and
36:27
forth, you know, as time goes forward, as
36:29
you make them more efficient and continue to scale
36:31
them, the possibility is that, hey, it can
36:33
code for like 10 minutes at a time and
36:35
I can just review the work and it'll
36:37
make me 5x more efficient, right? You
36:40
know, and so on and so forth.
36:42
And this is sort of like where reasoning
36:44
models and sort of the scaling sort
36:46
of argument comes in is like, yes. We
36:49
can make it more efficient, but we also just,
36:51
you know, that's not going to solve the problems that
36:53
we have today, right? The earth is still going
36:55
to run out of resources. We're going
36:57
to run out of nickel because we make enough batteries
36:59
and we can't make enough batteries. So then we
37:01
can't with current technology that we can't replace all of,
37:03
you know, gas, you know,
37:05
gas and coal with renewables, right? All of
37:07
these things are going to happen unless like
37:09
you continue to improve AI and invent and
37:12
we're just generally researching new things and AI
37:14
helps us research new things. Okay,
37:16
this is really the last one. Where
37:18
is GPT -5? So
37:21
OpenAI released GPT -4
37:23
.5 recently with what
37:25
they called training
37:28
run Orion. There
37:30
were hopes that Orion could be
37:32
used for GPT -5, but its
37:34
improvement was not enough to
37:36
be really a GPT -5. Furthermore,
37:38
it was trained on the classical method, which
37:40
is like which is a
37:43
ton of pre -training and then some reinforcement
37:45
learning with human feedback and some other
37:47
reinforcement learning like PPO and DPO and
37:49
stuff like that. But
37:51
then along the way, this model was
37:53
trained last year, along the way,
37:55
another team at OpenAI made the big
37:57
breakthrough of reasoning, strawberry training. And
37:59
they released 01 and then they released
38:01
03. And these models are rapidly
38:03
getting better with reinforcement learning with verifiable
38:05
rewards. And so now
38:07
GPT -5, as Sam calls it, is
38:10
gonna be a model that has huge
38:12
pre -training scale, like GPT -4 .5, but
38:14
also huge post -training scale like 01
38:16
and 03 and continuing to scale
38:18
that up. This would be the first
38:20
time we see a model that
38:22
was a step up in both at
38:24
the same time. And so that's
38:27
what OpenAI says is coming. They
38:29
say it's coming this year, hopefully
38:31
in the next three to six months,
38:33
maybe sooner. I've heard sooner, but
38:35
we'll see. Um, but this,
38:37
this path of scaling both pre -training
38:39
and a post -training with reinforcement
38:41
learning with verifiable rewards massively should
38:44
yield much better models that are
38:46
capable of much more things. And
38:48
we'll see what those things are. Very
38:51
cool. All right, Dylan, do you want to give
38:53
a quick shout out to those who are interested in
38:55
potentially working with semi analysis, who you work with
38:57
and where that, where they can learn more. Sure.
39:00
So we, you know, at somebody else's.com, we
39:02
have, you know, the, we have the public
39:04
stuff, which is like all these reports that
39:06
are, uh, pseudo free, but then we, most
39:08
of our work is done on, uh, directly
39:10
for clients. There's these datasets that we sell
39:12
around every data center the world servers, all
39:14
the compute where it's manufactured, how many, where,
39:16
what's the cost and who's doing it. Um,
39:18
and then we also do a lot of
39:20
consulting. We've got people who have worked all
39:22
the way from ASML, which makes lithography tools
39:25
all the way up to, you know, Microsoft
39:27
and Nvidia, um, which, you know,
39:29
making models and doing infrastructure. And
39:31
so we've got this whole gambit of folks. There's
39:34
roughly 30 of us across the
39:36
world in US, Taiwan, Singapore, Japan, France,
39:39
Germany. Canada so
39:41
you know there's a lot of engagement points
39:43
but if you want to reach out just
39:45
go to the website you know go to
39:47
one of those specialized pages of models or
39:49
sales and reach out and that'd be the
39:51
best way to sort of interact and engage
39:53
with us but for most people just read
39:55
the blog right like I think like unless
39:57
you have specialized like needs unless you're a
39:59
company in the space or investor in the
40:01
space like you know just want to be
40:03
informed just the blog and free right think
40:06
that's that's the best option for most people Yeah,
40:09
I will attest the blog is magnificent and
40:11
Dylan really a thrill to get a chance to
40:13
meet you and talk through these topics with
40:15
you. thanks so much for coming on the show
40:17
thank you so much Alex. all right everybody
40:19
thanks for listening we'll be back on Friday to
40:21
break down the week's news until then we'll
40:23
see you next time on Big podcast
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More