Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel by Big Technology Podcast | Podchaser

Episode from the podcastBig Technology Podcast

Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel

Released Wednesday, 23rd April 2025

Good episode? Give it some love!

Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel

Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel

Wednesday, 23rd April 2025

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

you've ever wondered how generative AI works

0:02

and where the technology is heading, this

0:04

episode is for you. We're going

0:06

to explain the basics of the technology

0:08

and then catch up with modern -day advances

0:10

like reasoning to help you understand exactly

0:12

how it does what it does and

0:15

where it might advance in the future.

0:17

That's coming up with semi -analysis founder

0:19

and chief analyst Dylan Patel right after

0:21

this. From LinkedIn News, I'm

0:24

Lea Smart, host of Every Day Better,

0:26

an award -winning podcast dedicated to personal development.

0:28

Join me every week for captivating stories

0:30

and research to find more fulfillment

0:32

in your work and personal life. Listen

0:34

to Everyday Better on the LinkedIn Podcast

0:36

Network, Apple Podcasts, or wherever you get your

0:38

podcasts. Did you know

0:41

that small and medium businesses

0:43

make up 98 % of the

0:45

global economy, but most B2B marketers

0:47

still treat them as a

0:49

one -size -fits -all? LinkedIn's

0:51

Meet the SMB report

0:53

reveals why that's a

0:55

missed opportunity. and how

0:57

you can reach these

1:00

fast -moving decision -makers effectively.

1:02

Learn more at linkedin.com

1:04

backslash meet -the -smb. Welcome

1:07

to Big Technology Podcast, a show

1:09

for cool -headed and nuanced conversation of

1:11

the tech world and beyond. We're

1:14

joined today by semi -analysis founder

1:16

and chief analyst Dylan Patel, a

1:18

leading expert and semiconductor and general VAI

1:20

research and someone I've been looking forward

1:22

to speaking with for a long time.

1:25

Now, I want this to

1:27

be an episode that A helps people learn

1:29

how generative AI works and B is an

1:31

episode that people will send to their friends

1:33

to explain to them how generative AI works.

1:35

I've had a couple of those that I've

1:37

been sending to my friends and

1:39

colleagues and counterparts about what is

1:41

going on within generative AI. That

1:43

includes one, this three and a

1:45

half hour long video from Andre

1:47

Karpathy, explaining everything about training large

1:50

language models. And the second one

1:52

is a great episode that Dylan

1:54

and Nathan Lambert from the Allen

1:56

Institute of AI did with Lex

1:58

Friedman. Both of those three hours

2:00

plus, so I want to do ours in

2:02

an hour. And I'm

2:04

very excited to begin. So Dylan, it's

2:06

great to see you and welcome

2:08

to the show. Thank you for having

2:10

me. Great to have you here.

2:12

Let's just start with tokens. Can you

2:14

explain how AI researchers basically take

2:16

words and then give them numerical representations

2:18

and parts of words and give

2:21

them numerical representations. So what are tokens?

2:23

Tokens are in fact like chunks of

2:25

words, right? In the human way,

2:27

you can think of like syllables, right?

2:31

Syllables are often viewed as like chunks

2:33

of word. They have some

2:35

meaning. It's the base

2:37

level of speaking, right, is syllables,

2:39

right? Now for models, tokens are

2:41

the base level of output. They're

2:43

all about like sort of compressing,

2:45

you know, sort of this is

2:47

the most efficient representation of language.

2:49

From my understanding, AI models are

2:51

very good at predicting patterns. So

2:54

if you give it one, three,

2:56

seven, nine, it might know

2:58

the next number is going to be

3:00

11. And so what it's doing

3:02

in with tokens is taking words, breaking

3:04

them down to their component parts,

3:06

assigning them a numerical value, and

3:08

then basically in its own word, in

3:10

its own language, learning to predict what number

3:13

comes next because computers are better at

3:15

numbers than converting that number back to text.

3:17

And that's what we see come out.

3:19

Is that accurate? Yeah. And

3:21

each individual token is actually, it's

3:23

not just like one number, right?

3:25

It's multiple vectors. You could think

3:27

of like, well, the tokenizer needs

3:29

to learn King and Queen are

3:31

actually extremely similar on most in

3:33

terms of like the English language,

3:35

extremely similar. except there

3:37

is one vector in which they're super different,

3:39

because a king is a male and a

3:41

queen is a female. And then

3:43

from there, in language, oftentimes kings

3:46

are considered conquerors, and all these

3:48

are the things, and these are

3:50

just historical things. So a lot

3:52

of the text around them, while

3:54

they're both royal, regal, monarchy, et

3:56

cetera, there are many vectors in

3:58

which they differ. So it's not

4:00

just converting a word into one

4:02

number. It's converting it into multiple

4:04

vectors. and each of these vectors,

4:06

the model learns what it means,

4:08

right? You don't initialize

4:10

the model with like, hey, you

4:12

know, king means male, monarch,

4:16

and it's associated with like war and conquering, because

4:18

that's all the writing about kings is on, you

4:20

know, in history and all that, right? Like people

4:22

don't talk about the daily lives of kings that

4:24

much, or they mostly talk about like their wars

4:26

and conquests and stuff. And

4:28

so like, There will be, each

4:30

of these numbers in this embedding space,

4:32

right, will be assigned over time as the

4:34

model reads the internet's text and trains

4:36

on it, it'll start to realize, oh, King

4:38

and Queen are exactly similar on these

4:40

vectors, but very different on these vectors. And

4:42

these vectors aren't, you don't explicitly tell

4:44

the model, hey, this is what this vector

4:46

is for, but it could be like, you

4:49

know, it could be as much as like, one

4:51

vector could be like, is it a building or

4:53

not? right and it doesn't actually know that you

4:55

don't you don't know that ahead of time it

4:57

just happens to in the latent space and then

4:59

all these vectors sort of relate to each other

5:02

but yeah these numbers are are an efficient representation

5:04

of words. because you

5:06

can do math on them, right? You

5:08

can multiply them, you can divide them,

5:10

you can run them through an entire

5:12

model, whereas, and your brain does something

5:14

similar, right? When it hears something, it

5:16

converts that into a frequency in your

5:18

ears, and then that gets converted to

5:20

frequencies that should go through your brain,

5:22

right? This is the same thing as

5:25

a tokenizer, right? Although it's like, obviously

5:27

a very different medium of compute, right?

5:29

Ones and zeros for computers versus, you

5:31

know, binary and multiplication, et cetera, being

5:33

more efficient, whereas humans' brains are more

5:35

like animals, analog in nature and, you

5:37

know, think born waves and patterns in

5:39

different ways. Uh, while they are very

5:41

different, it is a tokenizer, right? Like

5:43

language is not actually how our brain

5:45

thinks. It's just a representation for which

5:48

it to, you know, reason over. Yeah.

5:50

So that's crazy. So the

5:53

tokens are the sufficient representation of

5:55

words, but more than that,

5:57

the models are also learning the

5:59

way that they are. All

6:01

these words are connected and that

6:04

brings us to pre -training. From

6:06

my understanding, pre -training is when

6:08

you take the entire, basically

6:10

the entire internet worth of text,

6:12

and you use that to

6:15

teach the model these representations between

6:17

each token. So therefore,

6:19

like we talked about, if you gave

6:21

a model, the sky is, and

6:23

the next word is typically blue in

6:25

the pre -training, which is basically all

6:27

of the English language, all of

6:29

language on the internet. It should know

6:32

that the next token is blue.

6:34

So what you do is you want

6:36

to make sure that when the

6:38

model is outputting information, it's closely tied

6:40

to what that next value should

6:42

be. Is that a proper

6:44

description of what happens in free training? Yeah,

6:47

I think that's the objective

6:49

function, which is just to reduce

6:51

loss, i .e., how often is

6:53

the token predicted incorrectly versus

6:56

correctly, right? Right, so if you

6:58

said the sky is red, That's

7:00

not the most probable outcome, so that would be

7:03

wrong. But that text is on the internet, right? Because

7:05

the Martian sky is red and there's all these

7:07

books about Mars and sci -fi. Right, so how does

7:09

the model then learn how to figure this out and

7:11

in what context is it accurate to say blue

7:13

and red? Right, so I

7:15

mean, first of all, the model

7:17

doesn't just output one token. It outputs

7:19

a distribution. It turns

7:21

out the way most people take

7:23

it is they take the top

7:25

K, i .e. the most high probability.

7:28

So yes, blue is obviously the right answer if

7:30

you give it to anyone on this planet. But

7:33

there are situations and contexts where the sky

7:35

is red is the appropriate sentence, but that's

7:37

not just in isolation, right? It's like if

7:39

the prior passage is all about Mars and

7:41

all this, and then all of a sudden

7:43

it's like, and that's like a quote from

7:45

a Martian settler, and it's like the sky

7:47

is, and then the correct token is actually

7:49

red, right? The correct word. And so it

7:51

has to know this through the attention mechanism,

7:54

right? If it was just the sky is

7:56

blue, always you're gonna output blue because blue

7:58

is let's say 80%, 90%, 99 % likely

8:00

to be the right option. But as you,

8:02

as you start to add context about Mars

8:04

or any other planet. Other planets have different

8:06

colored atmospheres, I presume. You

8:08

end up with this distribution

8:10

starts to shift. If

8:13

I add we're on Mars, the

8:15

sky is, then all of a

8:17

sudden, blue goes from 99 % in

8:19

the prior context window, the text that

8:21

you sent to the model, the

8:24

attention of it. All of

8:26

a sudden, it realizes the sky

8:28

is blue. uh proceeded by that

8:30

the stuff about mars now bluish

8:32

rockets down to like you know

8:34

let's call it 20 probability and

8:36

red rockets up to 80 probability

8:38

right um now the model outputs

8:40

that and then most people just

8:42

end up taking the top probability

8:44

and outputting it to the user

8:46

um and that's sort of like

8:48

how does the model learn that

8:50

is is the attention mechanism right

8:52

and this is sort of what

8:54

is that Yeah, the attention mechanism

8:56

is the beauty of modern sort

8:58

of large Lagrange models. It takes

9:00

the relational value in this vector

9:02

space between every single token, right? So

9:05

the sky is blue, right? When I

9:07

think about it, yes, blue is

9:09

the next token after the sky is,

9:11

but in a lot of older

9:13

style models, you would just predict the

9:15

exact next word. So after sky,

9:17

Obviously, it could be many things. It

9:19

could be blue, but it could

9:21

also be like scraper, right? Sky

9:24

scraper, that makes sense.

9:27

But what attention does is

9:29

it is taking all of

9:31

these various values, the query,

9:33

the key, the value, which

9:35

represents what you're looking for,

9:37

where you're looking, and what

9:39

that value is across the

9:41

attention. you're

9:44

calculating mathematically what the relationship is

9:46

between all of these tokens. And

9:49

so going back to the king -queen

9:51

representation, right? The way these two words

9:53

interact is now calculated, right? And the

9:55

way that every word in the entire

9:57

passage you sent is calculated is tied

9:59

together, which is why models have like

10:01

challenges with like how long can you, how

10:04

many documents can you send them, right? Because

10:06

if you're sending them... you know just the question

10:08

like what color is the sky okay only

10:10

has to calculate the attention between you know those

10:12

those words right but if you're sending it

10:14

like 30 books with like insurance claims and all

10:16

these other things and you're like okay figure

10:18

out what's going on here is this a claim

10:20

or not right and in the insurance context

10:22

all of a sudden it's like okay I've got

10:24

to calculate the attention of not just like

10:27

the last five words to each other, we have

10:29

to calculate every, you know, 50 ,000 words to

10:31

each other, right? Which then ends up being

10:33

a ton of math. Back in the day, actually,

10:35

the best language models were a different architecture

10:37

entirely, right? But then

10:39

at some point, you know, transformers, large language

10:41

models, sort of large language models, which

10:43

are basically based on transformers primarily, rocketed

10:45

past and capabilities because they were able

10:47

to scale and because the hard work

10:49

got there. And then we were able

10:52

to scale them so much that we

10:54

were not to just put like some

10:56

text in them. and not just a

10:58

lot of text or a lot of

11:00

books, but the entire internet, which one

11:02

could view the internet oftentimes as a

11:04

microcosm of all human culture and learnings

11:06

and knowledge to many extents, because most

11:08

books are on the internet, most papers

11:10

are on the internet. Obviously, there's a

11:13

lot of things missing on the internet,

11:15

but this is the modern magic of

11:17

three different things coming all together at

11:19

once. An efficient way for models to

11:21

relate every word to each other. the

11:23

compute necessary to scale the data large

11:25

enough and then someone actually like pulling

11:27

the trigger to do that right at

11:29

the scale that was you know got

11:31

to the point where it was useful

11:34

right which was sort of like GPT

11:36

3 .5 level or 4 level right

11:38

where it became extremely useful for normal

11:40

humans to use you know chat, chat

11:42

models. Okay and so why

11:44

is it called pre -training? So

11:46

so pre -training is is

11:48

sort of called that because it

11:50

is what happens before the actual

11:52

training of the model. The objective

11:55

function in pre -training is to

11:57

just predict the next token, but

11:59

predicting the next token is not

12:01

what humans want to use AIs

12:03

for. I want it to

12:05

ask a question and answer it.

12:07

But in most cases, asking a

12:09

question does not necessarily mean that

12:11

the next most likely token is

12:13

the answer. Oftentimes, it is another

12:15

question. For example, if I

12:17

ingested the entire SAT, and

12:20

I asked a question, all

12:23

the next tokens would be like, A is this, B

12:25

is this, C is this, D is this. No, I just

12:27

want the answer. And

12:29

so pre -training is, the reason it's

12:31

called pre -training is because you're ingesting

12:33

humongous volumes of text no matter the

12:35

use case. And

12:38

you're learning the general patterns

12:40

across all of language. I

12:42

don't actually know that King and Queen relate to each

12:44

other in this way. and I don't know that King and

12:46

Queen are opposites in these ways, right? And

12:48

so this is why it's called

12:50

pre -training is because you must get

12:53

a broad general understanding of the entire

12:55

sort of world of text before

12:57

you're able to then do post -training

12:59

or fine -tuning, which is let me

13:01

train it on more specific data that

13:03

is specifically useful for what I

13:05

want it to do, whether it's, hey,

13:08

in chat style applications, you know,

13:10

go in, You know when I ask a

13:12

question give me the answer or in in other

13:14

applications like teach me how to build a

13:16

bomb will obviously know I'm not going to help

13:18

you teach build a bomb because that's what

13:20

I don't want the model to teach me how

13:23

to build a bomb so you know it's

13:25

sort of gotta. Do this and it's not like

13:27

you're teaching it you know when you're doing

13:29

this pre training you're filtering out all this data

13:31

because in fact there's a lot of good

13:33

useful data on how to build bombs because a

13:35

lot of useful information on like. Hey, like

13:37

C4 chemistry and like, you know, people want to

13:39

use it for chemistry, right? So you don't

13:41

want to just fill throughout everything so that the

13:44

model doesn't know anything about it. Um, but

13:46

at the same time, you don't want it to

13:48

output, you know, how to build a bomb.

13:50

Um, so there's like a fine balance here. And

13:52

that's why pre -training is defined as pre because

13:54

you're, you're, you're still letting it do things

13:56

and teaching it things and inputting things into the

13:58

model that are theoretically like quite bad. For

14:01

example, books about killing or war tactics

14:03

or what have you. Things that plausibly

14:05

you could see like, oh, well, maybe

14:07

that's not okay. Or

14:10

wild descriptions of really grotesque things all

14:12

over the internet, but you want the model

14:14

to learn these things. Because first you

14:16

build the general understanding before you say, okay,

14:18

now that you've got a general framework

14:20

or the world, let's align you so that

14:22

you with this general understanding the world

14:24

can figure out what is useful for people,

14:26

what is not useful for people, what

14:28

should I respond on, what should I not

14:30

respond on. So what happens

14:32

then in the training process? So

14:35

is the training process that the

14:37

model is then attempting to make

14:39

the next prediction and then just

14:41

trying to minimize loss as it

14:43

goes? Right, right. I mean like

14:45

basically you have loss is is

14:47

how often you're wrong versus right

14:49

in the most simple terms. You'll

14:52

run through passages through

14:54

the model, and

14:56

you'll see how often did the model

14:58

get it right. When it got it

15:00

right, great, reinforce that. When it got

15:02

it wrong, let's figure out which neurons

15:04

in the model, quote unquote, neurons, in

15:06

the model you can tweak to then

15:08

fix the answer so that when you

15:10

go through it again, it actually outputs

15:12

the correct answer. And then you move

15:14

the model slightly in that direction. No,

15:17

obviously the challenge with this is

15:19

if I first, you know, I can

15:21

come up with a simplistic way

15:23

where all the neurons will just output

15:25

the sky's blue every single time

15:27

it says the sky is. But then

15:29

when it goes to, you know, hey,

15:32

the... color blue is commonly used on

15:34

walls because it's soothing, right? And it's

15:36

like, oh, what's the next word is

15:38

soothing, right? Soothing, you know, and so

15:40

like that, that is a completely different

15:42

representation. And to understand that blue is

15:44

soothing and that the sky is blue

15:46

and those things aren't actually related, but

15:48

they are related to blue is like

15:50

very important. And so, you know, oftentimes

15:52

you'll run through the training data set

15:54

multiple times, right? Because the first time

15:56

you see it, oh, great, maybe you

15:58

memorized that the sky is blue. and

16:01

you memorize the wall is blue

16:03

and when people describe art and

16:05

oftentimes use the color blue, it

16:07

can be representations of art or

16:09

the wall. And so over time,

16:12

as you go through all this

16:14

text in pre -training, yes, you're

16:16

minimizing loss initially by just memorizing,

16:18

but over time, because you're constantly

16:20

overwriting the model, it starts to

16:22

learn the generalization. I .e., blue

16:24

is a soothing color, also represents the

16:27

sky, also used in art for either of

16:29

those two motifs. Right? And so that's

16:31

sort of the goal of pre -training is

16:33

you don't want to memorize, right? Because that's,

16:35

you know, in school you memorize all

16:37

the time. And that's not useful because you

16:39

forget everything you memorize But if you

16:42

get tested on it then and then you

16:44

get tested on it six months later

16:46

And then again six months later after that

16:48

or however you do it ends up

16:50

being oh, you don't actually like memorize that

16:52

anymore You just know it innately and

16:54

you've generalized on it and that's the real

16:56

goal that you want out of the

16:59

model But that's not necessarily something you can

17:01

just measure right and therefore loss is

17:03

something you can measure ie for this group

17:05

of this group of text, right? Because

17:07

you train the model in steps. Every

17:10

step you're inputting a bunch of text, you're trying

17:12

to see what's predict the right token, where you

17:14

didn't predict the right token, let's adjust the neurons.

17:17

Okay, onto the next batch of text.

17:19

And you'll do this, these batches

17:21

over and over and over again, across

17:23

trillions of words of text, right? And

17:26

as you step through, and then you're like,

17:28

oh, well, I'm done. But I bet if

17:30

I go back to the first group of

17:32

texts, which is all about the sky being

17:35

blue, it's going to get the answer wrong

17:37

because maybe later on in the training it

17:39

discovered it saw some passages about sci -fi

17:41

and how the Martian is red. So like

17:43

it'll overwrite, but then over time as you

17:45

go through the data multiple times, as you

17:47

see it on the internet multiple times, you

17:49

see it in different books multiple times, whether

17:51

it be scientific, sci -fi, whatever it is, you

17:53

start to realize and it starts to learn

17:56

that that representation of like, oh, when it's

17:58

on Mars, it's red because the sky and

18:00

Mars is red because the atmospheric makeup is

18:02

this way. Whereas the atmospheric makeup on Earth

18:04

is a different way. And so that's sort

18:06

of like, the whole point of pre -training

18:08

is to minimize loss, but the nice side

18:10

effect is that the model initially memorizes, but

18:12

then it stops memorizing and it generalizes. And

18:14

that's the useful pattern that we want. Okay,

18:16

that's fascinating. We've touched on post -training for a

18:19

bit, but just to recap, Post

18:21

-training is so you have a model

18:23

that's good at predicting the next

18:25

word. And in post -training, you sort

18:27

of give it a personality by inputting

18:29

sample conversations to make the model

18:31

want to emulate the certain values that

18:33

you want it to take on. Yeah,

18:36

so post -training can be a number of different

18:39

things. The most simple way of doing it

18:41

is, is yet. pay for humans

18:43

to label a bunch of data, take

18:45

a bunch of example conversations, et

18:48

cetera, and input that data and train

18:50

on that at the end, right? And

18:53

so that example data is... useful,

18:55

but this is not scalable, right? Like

18:57

using humans to train models is

18:59

just so expensive, right? So then there's

19:01

the magic of sort of reinforcement

19:03

learning and other synthetic data technologies, right?

19:06

Where the model is helping teach

19:08

the model, right? So you have many

19:10

models in a sort of in

19:12

a post training where, yes, you have

19:14

some example human data, but human

19:16

data does not scale that fast, right?

19:18

Because the internet is trillions and

19:20

trillions of words out there. Whereas even

19:22

if you had Alex and I

19:24

write words all day long for our

19:27

whole lives, we would have millions

19:29

or hundreds of millions of words written.

19:31

It's nothing. It's like orders of

19:33

magnitude off in terms of the number

19:35

of words required. So

19:37

then you have the model take

19:39

some of this example data. and

19:42

you have various models that are surrounding

19:44

the main model that you're training, right? And

19:46

these can be policy models, right? Teaching

19:48

it, hey, is this what you want or

19:50

that what you want? Reward models, right?

19:52

Like, is that good response or is that

19:54

a bad response? You have value models

19:56

like, hey, grade this output, right? And you

19:59

have all these different models working in

20:01

conjunction to say, Different

20:03

companies have different objective functions, right?

20:06

In the case of Anthropic, they

20:08

want their model to be helpful,

20:10

harmless, and safe, right? So

20:12

be helpful. but also don't harm

20:14

people or anyone or anything, and

20:16

then, you know, you know, safe,

20:18

right? In other cases, like Grock, right, Elon's

20:21

model from XAI, it actually

20:23

just wants to be helpful, and maybe it has

20:25

like a little bit of a right leaning to

20:27

it, right? And for other folks, right, like, you

20:29

know, I mean, most AI models are made in

20:31

the Bay Area, so they tend to just be

20:33

left leaning, right? But also the internet in general

20:36

is a little bit left leaning, because it skews

20:38

younger than older. And so, like, all these

20:40

things, like, sort of affect models. But

20:42

it's not just around politics, right? Post -training

20:44

is also just about teaching the model. If

20:47

I say the movie where the

20:49

princess has a slipper and it

20:51

doesn't fit, it's like, well, if

20:53

I said that into a base

20:55

model that was just pre -training,

20:57

the answer wouldn't be, oh, the

20:59

movie you're looking for, Cinderella, it

21:01

would only realize that once it goes through

21:03

post -training, right? Because a lot of times

21:06

people just throw garbage into the model, and

21:08

then the model still figures out what you

21:10

want. And this is part of what post -training

21:12

is. You can just do stream of consciousness

21:14

into models, and oftentimes it'll figure out what

21:16

you want. If it's a movie that you're

21:18

looking for, or if it's help answering a

21:20

question, or if you throw a bunch of

21:23

unstructured data into it and then ask it

21:25

to make it into a table, it does

21:27

this. And that's because of all these different

21:29

aspects of post -training. Example data, but also

21:31

generating a bunch of data and grading it

21:33

and seeing if it's good or not. and

21:36

whether it matches the various policies you want.

21:38

A lot of times grading can be based on

21:40

multiple factors. There can be a model that

21:42

says, hey, is this helpful? Hey, is this safe?

21:44

And what is safe? So then that model

21:46

for safety needs to be tuned on human data.

21:49

So it is a quite complex thing, but the

21:51

end goal is to be able to get

21:53

the model to output in a certain way. Models

21:55

aren't always about just humans using them either.

21:57

There can be models that are just focused on,

22:00

hey, if it doesn't output

22:02

code, yes, it was trained on the whole internet

22:04

because the person's going to talk to the

22:06

model using text, but if it doesn't output code,

22:08

penalize it. Now, all of a sudden, the

22:10

model will never output text ever again. It'll only

22:12

output code. And

22:14

so these sorts of models exist too.

22:16

So post -training is not just a uni -variable

22:18

thing. It's what variables do you want

22:20

to target? And so that's why

22:22

models have different personalities from different companies.

22:24

It's why they target different use cases and

22:26

why it's not just one model that

22:29

rules them all, but actually many. That's

22:31

fascinating. So that's why we've seen so

22:33

many different models with different personalities is

22:35

because it all happens in the post

22:38

-training moment. And this is

22:40

when you talk about giving the

22:42

models examples to follow. That's what

22:44

reinforcement learning with human feedback is,

22:46

is the humans give some examples

22:48

and then the model learns to

22:50

emulate what the human is interested

22:52

in, what the human trainer is

22:55

interested in having them embody. Is

22:58

that right? Yeah, exactly. Okay,

23:00

great. All right, so

23:02

first half we've covered what training is,

23:04

what tokens are, what loss is,

23:06

what post -training is, post -training is, post

23:08

-training, by the way, also called fine

23:10

-tuning. We've also covered reinforcement learning

23:12

with human feedback. We're gonna take a

23:14

quick break and then we're gonna talk

23:17

about reasoning. We'll be back right

23:19

after this. Small

23:21

and medium businesses don't have

23:23

time to waste and

23:25

neither do marketers trying to

23:27

reach them. On LinkedIn,

23:29

more SMB decision makers are

23:31

actively looking for new

23:33

solutions to help them grow,

23:36

whether it's software or

23:38

financial services. Our Meet the

23:40

SMB report breaks down

23:42

how these businesses buy and

23:44

what really influences their

23:46

choices. Learn more at linkedin.com

23:48

backslash meet -the -smb. That's

23:50

linkedin.com backslash meet -the -smb. And

23:54

we're back here on

23:56

Big Technology Podcast with Dylan

23:58

Patel. He's the founder

24:00

and chief analyst at Semi

24:02

-Analysis. He actually has great

24:04

analysis on Nvidia's recent

24:06

GTC conference, which we just

24:08

covered recently on a

24:10

recent episode. You can find

24:12

Semi -Analysis at SemiAnalysis.com. It

24:14

is both content and

24:16

consulting. So you definitely check

24:18

in with Dylan for all of those needs.

24:20

And now we're going to talk a

24:23

little bit about reasoning. Because

24:25

a couple of months ago, and Dylan,

24:27

this is really where I entered the

24:29

picture of watching your conversation with Flex, with

24:32

Nathan Lambert, about what

24:34

the difference is between reasoning

24:36

and your traditional LLMs,

24:39

large language models. If

24:41

I gathered it right from your

24:43

conversation, what reasoning is, is basically

24:46

instead of the model going, basically

24:48

predicting the next word based off

24:50

of its training. It

24:52

uses the tokens to spend more

24:54

time basically figuring out what the

24:56

right answer is and then coming

24:58

out with a new prediction. I

25:00

think Carpathia does a very interesting

25:02

job in the YouTube video talking

25:04

about how models think with tokens. The

25:07

more tokens there are, the more compute

25:09

they use because they're running these predictions through

25:11

the transformer model, which we discussed, and

25:13

therefore they can come to better answers. Is

25:15

that the right way to think about

25:17

reasoning? Humans

25:21

are also fantastic at pattern

25:23

matching, right? We're really good

25:25

at like recognizing things, but a lot

25:27

of tasks, it's not like an immediate

25:29

response, right? We are thinking, whether that's thinking

25:31

through words out loud, thinking through words

25:33

in an inner monologue on our head, or

25:35

it's just like processing somehow and then

25:38

we know the answer, right? And

25:40

this is the same for models, right?

25:42

Models are horrendous at math. a historically

25:44

happened. You could

25:46

ask it, what is

25:49

9 .11 bigger than 9

25:51

.9? And it would

25:53

say, yes, it's bigger, even though

25:55

everyone knows that 9 .11 is way

25:57

smaller than 9 .9. And

25:59

that's just a thing that happened

26:01

in models because they didn't think

26:03

or reason. And it's the same

26:05

for you, Alex, or myself. If

26:08

someone asked me, 17

26:10

times 34 I'd be like I don't

26:12

know like right off top of my

26:14

head but you know give me give

26:16

me a little bit of time I

26:18

can do some long form multiplication and

26:20

I can get the answer right and

26:22

that's because I'm thinking about it and

26:24

this is the same thing with reasoning

26:26

for models is you know when you

26:28

look at a transformer every word is

26:30

this every token output it has the

26:32

same amount of compute behind it right

26:34

i .e. you know when I'm saying

26:36

the versus sky is blue, the blue

26:38

and the D have this or the

26:40

is in the blue have the same

26:42

amount of compute to generate, right? And

26:44

this is not exactly what you want

26:46

to do, right? You want to actually

26:48

spend more time on the hard things

26:50

and not on the easy things. And

26:52

so reasoning models are effectively teaching, you

26:54

know, large pre -trained models to do this,

26:56

right? Hey, think through the problem. Hey,

26:58

output a lot of tokens. Think

27:00

about it, generate all this text. And then

27:02

when you're done, you know, start answering the

27:04

question, but now you have all of this

27:06

stuff you generated in your context, right?

27:10

And that stuff you generated is

27:12

is helpful, right? It could

27:14

be like, you know, all sorts

27:16

of, you know, just like any

27:18

human thought patterns are, right? And

27:20

so this, this is the sort

27:22

of like new paradigm that we've

27:24

entered maybe six months ago, where

27:26

models now will think for some

27:28

time before they answer, and this

27:30

enables much better performance on all

27:32

sorts of tasks, whether it be

27:34

coding or math or understanding science

27:36

or understanding complex social dilemmas, all

27:38

sorts of different topics they're much,

27:40

much better at. And this is

27:42

done through post -training, similar to the

27:44

reinforcement learning by human feedback that

27:47

we mentioned earlier, but also there's

27:49

other forms of post -training and

27:51

that's what makes these reasoning models.

27:53

Before we head out, I want

27:55

to hit on a couple of

27:57

things. First of all, the growing

27:59

efficiency of these So

28:01

I think one of the things that

28:03

people focused on with DeepSeq was that

28:05

it was just able to be much

28:07

more efficient in the way that it

28:09

generates answers. And there

28:11

was this obviously this big reaction to

28:13

NVIDIA stock worth it. fell 18 %

28:15

the day or at the Monday after

28:17

deep seek weekend because people thought we

28:19

wouldn't need as much compute. So can

28:21

you talk a little bit about how

28:23

models are becoming more efficient and how

28:26

they're doing it? Yeah, so there's a

28:28

variety of the beauty of these of

28:30

AI is not just that we continue

28:32

to build new capabilities. Because

28:34

the new capabilities are going to be able to benefit

28:36

the world in many ways. And there's

28:38

a lot of focus on those. But

28:40

there's also a lot of focus on, well,

28:43

to get to that next level of

28:45

capabilities is the scaling laws, i .e. the

28:47

more compute and data I spend, the better

28:49

the model gets. But then the other

28:51

vector is, well, can I get to the

28:53

same level with less compute and data? And

28:56

those two things are hand in hand, because if I

28:58

can get to the same level with less computing data,

29:00

then I can spend that more computing data and get

29:03

to a new level. And so

29:05

AI researchers are constantly looking for

29:07

ways to make models more efficient,

29:09

whether it be through algorithmic tweaks,

29:11

data tweaks, tweaks in how you do

29:13

reinforcement learning, so on and so

29:16

forth. And so when

29:18

we look at models across history,

29:20

they've constantly gotten cheaper and cheaper

29:22

and cheaper at a stupendous rate.

29:24

Right? And so one easy example

29:26

is GPT -3, right? Because there's

29:28

GPT -3, 3 .5 turbo, Lama

29:31

-27B, Lama -3, Lama

29:34

-3 .1, Lama -3 .2, right? As these

29:36

models have gotten bigger, we've gone from,

29:38

hey, it costs $60 for a

29:40

million tokens to it costs less than,

29:42

it costs like five cents now

29:44

for the same quality of model. Now,

29:46

and the model has shrank dramatically

29:48

in size as well. And that's because

29:51

of better algorithms, better data, et

29:53

cetera. And now what happened with deep

29:55

seek was similar You know opening

29:57

I had GPT -4 then they had

29:59

four turbo which was half the cost

30:01

then they had 4 .0 which was

30:03

again half the cost and then

30:05

meta release llama 405b Open source and

30:07

so the open source community was

30:10

able to run that and that was

30:12

again like roughly like half the

30:14

cost but or 5x lower cost Then

30:16

4 .0 which was lower than 4

30:18

turbo and 4 but deep seek

30:20

came out with another tier, right? So

30:22

when we looked at GPT -3 the

30:24

cost fell 1200x from GPT -3's initial

30:26

cost to what you can get

30:29

LOMA 3 .2 -3B today. And

30:31

likewise, when we look at

30:33

from GPT -4 to deep -seq V3, it's

30:35

fallen roughly 600X in cost. So

30:37

we're not quite at that 1200X,

30:40

but it has fallen 600X in

30:42

cost from $60 to about $1,

30:44

or to less than $1, sorry,

30:46

$60X. And so you've got this

30:48

massive cost decrease But it's not

30:50

necessarily out of bounds, right? We've

30:52

already seen, I think what was

30:54

really surprising was that it was

30:56

a Chinese company for the first

30:58

time, right? Because Google and OpenAI

31:00

and Anthropic and Meta have all

31:02

traded blows, right? Whether it

31:04

be OpenAI always being on the leading

31:06

edge or Anthropic always being on the

31:09

leading edge or Google and Meta being

31:11

close followers, but oftentimes sometimes with a

31:13

new feature and sometimes just being much

31:15

cheaper. We have not seen

31:17

this from any Chinese company, right? And

31:19

now we have a Chinese company releasing

31:21

a model that's cheap. It's

31:23

not unexpected, right? Like this is actually

31:26

within the trend line of what happened with

31:28

GPT -3 is happening to GPT -4 level

31:30

quality with Deepseek. It's more

31:32

so surprising that it's a Chinese company. And that's,

31:34

I think, why everyone freaked out. And then there

31:36

was a lot of things that, like, you know,

31:38

from there became a thing, right? Like, if Meta

31:40

had done this, I don't think people would have

31:42

freaked out, right? And Meta's gonna

31:44

release their new Lama soon enough,

31:47

right? And that one is

31:49

gonna be, you know, a similar

31:51

level of cost decrease, probably

31:53

similar areas deep -seek V3, right? It's

31:55

just not people aren't gonna freak out because it's an American

31:57

company and it was sort of expected. All

32:00

right, Dylan, let me ask you the last

32:02

question, which is the, you mentioned, I think

32:04

you mentioned the bitter lesson, which is basically

32:06

that they're, I mean, I'm gonna just be...

32:08

in summing it up. But the answer to

32:10

all questions in machine learning is just to

32:12

make bigger models. And

32:14

scale solves almost all problems. So

32:17

it's interesting that we have this moment where models

32:19

are becoming way more efficient. But

32:21

we also have massive, massive data

32:23

center buildouts. I

32:25

think it would be great to hear you kind

32:27

of recap the size of these data center buildouts and

32:29

then answer this question. If we

32:31

are getting more efficient, Why are these

32:33

data centers getting so much bigger? And

32:36

what might that added scale get in

32:38

the world of generative AI for the

32:40

companies building them? Yeah,

32:42

so when we look across the ecosystem at

32:44

data center buildouts, We track

32:46

all the build outs and server

32:48

purchases and supply chains here. And

32:51

the pace of construction is incredible. You

32:54

can pick a state and you can

32:56

see new data centers going up all

32:58

across the US and around the world.

33:00

And so you see things like

33:02

capacity in, for example, of the

33:04

largest scale training supercomputers goes from,

33:07

hey, it's not even a few

33:09

hundred million dollars a year ago,

33:11

but like, hey, for GPT -4, it

33:13

was a few hundred million. and

33:17

it's one building full

33:19

of GPUs too. GPT

33:22

4 .5 and the reasoning

33:24

models like 0103 were

33:26

done in three buildings on

33:28

the same site and

33:30

billions of dollars to, hey,

33:32

these next generation things

33:34

that people are making are

33:36

tens of billions of

33:38

dollars like OpenAI's data center

33:40

in Texas called Stargate,

33:42

right? with Crusoe and

33:44

Oracle, and et cetera. And

33:46

likewise applies to Elon Musk who's building

33:48

these data centers in an old factory

33:50

where he's got a bunch of gas

33:52

generation outside and he's doing all these

33:54

crazy things to get the data center

33:56

up as fast as possible. And

33:58

you can go to just basically every

34:00

company and they have these humongous buildouts. And

34:04

this sort of like... And because

34:06

of the scaling laws, 10x

34:08

more compute for linear

34:10

improvement gains. It's

34:12

log log, sorry. But you end

34:14

up with this very confusing thing,

34:16

which is, hey, models keep getting

34:19

better as we spend more. But

34:21

also, the model that we had

34:23

a year ago is now done

34:25

for way, way cheaper, oftentimes 10x

34:27

cheaper or more, just a year

34:29

later. So then the question is

34:31

like, why are we spending all

34:33

this money to scale? And

34:36

there's a few things here, right? A, you

34:39

can't actually make that cheaper model without making

34:41

the bigger model so you can generate data

34:43

to help you make the cheaper model, right?

34:45

Like that's part of it. But

34:47

also another part of it

34:50

is that, you know, if we

34:52

were to freeze AI capabilities

34:54

where we were basically in, what

34:56

was it? March 2023, right?

34:59

Two years ago when GPT -4 released.

35:01

Um, and only made them cheaper, right?

35:03

Like deep seek is like much cheaper.

35:05

It's much more efficient. Um, but it's

35:07

roughly the same capabilities as you PD

35:09

for, um, that would not. Pay

35:12

for all of these K buildouts, right?

35:14

AI is useful today, but it's not capable

35:16

of doing a lot of things, right? But

35:18

if we make the model way more efficient

35:20

and then continue to scale and we

35:22

have this like stair step, right? Where we

35:24

like. Increase capabilities massively make them way more

35:26

efficient increase capabilities massively make them way more

35:29

efficient We do the stair step then you

35:31

end up with creating all these new capabilities

35:33

that could in fact pay for you know

35:35

these massive AI buildouts So no one

35:37

is trying to make with these you know

35:39

with these ten billion dollar data centers They're

35:41

not trying to make chat models right they're

35:43

not trying to make models that people chat

35:45

with just to be clear right they're trying

35:48

to solve things like software engineering and make

35:50

it automated which

35:52

is like a trillion dollar plus industry,

35:54

right? So these are very different

35:56

like sort of use cases and targets.

35:58

And so it's the bitter lesson because

36:00

yes, you can make, you can spend

36:02

a lot of time and effort making

36:04

clever specialized methods, you know, based on

36:06

intuition. And you should,

36:09

right? But these things should also just

36:11

have a lot more compute thrown behind them

36:13

because if you make it more efficient as you

36:15

follow the scaling laws up. it'll also just

36:17

get better and you can then unlock new capabilities,

36:19

right? And so today, you know, a lot

36:21

of AI models, the best ones from Anthropic are

36:23

now useful for like coding. As

36:25

a assistant with you, right, you're going back and

36:27

forth, you know, as time goes forward, as

36:29

you make them more efficient and continue to scale

36:31

them, the possibility is that, hey, it can

36:33

code for like 10 minutes at a time and

36:35

I can just review the work and it'll

36:37

make me 5x more efficient, right? You

36:40

know, and so on and so forth.

36:42

And this is sort of like where reasoning

36:44

models and sort of the scaling sort

36:46

of argument comes in is like, yes. We

36:49

can make it more efficient, but we also just,

36:51

you know, that's not going to solve the problems that

36:53

we have today, right? The earth is still going

36:55

to run out of resources. We're going

36:57

to run out of nickel because we make enough batteries

36:59

and we can't make enough batteries. So then we

37:01

can't with current technology that we can't replace all of,

37:03

you know, gas, you know,

37:05

gas and coal with renewables, right? All of

37:07

these things are going to happen unless like

37:09

you continue to improve AI and invent and

37:12

we're just generally researching new things and AI

37:14

helps us research new things. Okay,

37:16

this is really the last one. Where

37:18

is GPT -5? So

37:21

OpenAI released GPT -4

37:23

.5 recently with what

37:25

they called training

37:28

run Orion. There

37:30

were hopes that Orion could be

37:32

used for GPT -5, but its

37:34

improvement was not enough to

37:36

be really a GPT -5. Furthermore,

37:38

it was trained on the classical method, which

37:40

is like which is a

37:43

ton of pre -training and then some reinforcement

37:45

learning with human feedback and some other

37:47

reinforcement learning like PPO and DPO and

37:49

stuff like that. But

37:51

then along the way, this model was

37:53

trained last year, along the way,

37:55

another team at OpenAI made the big

37:57

breakthrough of reasoning, strawberry training. And

37:59

they released 01 and then they released

38:01

03. And these models are rapidly

38:03

getting better with reinforcement learning with verifiable

38:05

rewards. And so now

38:07

GPT -5, as Sam calls it, is

38:10

gonna be a model that has huge

38:12

pre -training scale, like GPT -4 .5, but

38:14

also huge post -training scale like 01

38:16

and 03 and continuing to scale

38:18

that up. This would be the first

38:20

time we see a model that

38:22

was a step up in both at

38:24

the same time. And so that's

38:27

what OpenAI says is coming. They

38:29

say it's coming this year, hopefully

38:31

in the next three to six months,

38:33

maybe sooner. I've heard sooner, but

38:35

we'll see. Um, but this,

38:37

this path of scaling both pre -training

38:39

and a post -training with reinforcement

38:41

learning with verifiable rewards massively should

38:44

yield much better models that are

38:46

capable of much more things. And

38:48

we'll see what those things are. Very

38:51

cool. All right, Dylan, do you want to give

38:53

a quick shout out to those who are interested in

38:55

potentially working with semi analysis, who you work with

38:57

and where that, where they can learn more. Sure.

39:00

So we, you know, at somebody else's.com, we

39:02

have, you know, the, we have the public

39:04

stuff, which is like all these reports that

39:06

are, uh, pseudo free, but then we, most

39:08

of our work is done on, uh, directly

39:10

for clients. There's these datasets that we sell

39:12

around every data center the world servers, all

39:14

the compute where it's manufactured, how many, where,

39:16

what's the cost and who's doing it. Um,

39:18

and then we also do a lot of

39:20

consulting. We've got people who have worked all

39:22

the way from ASML, which makes lithography tools

39:25

all the way up to, you know, Microsoft

39:27

and Nvidia, um, which, you know,

39:29

making models and doing infrastructure. And

39:31

so we've got this whole gambit of folks. There's

39:34

roughly 30 of us across the

39:36

world in US, Taiwan, Singapore, Japan, France,

39:39

Germany. Canada so

39:41

you know there's a lot of engagement points

39:43

but if you want to reach out just

39:45

go to the website you know go to

39:47

one of those specialized pages of models or

39:49

sales and reach out and that'd be the

39:51

best way to sort of interact and engage

39:53

with us but for most people just read

39:55

the blog right like I think like unless

39:57

you have specialized like needs unless you're a

39:59

company in the space or investor in the

40:01

space like you know just want to be

40:03

informed just the blog and free right think

40:06

that's that's the best option for most people Yeah,

40:09

I will attest the blog is magnificent and

40:11

Dylan really a thrill to get a chance to

40:13

meet you and talk through these topics with

40:15

you. thanks so much for coming on the show

40:17

thank you so much Alex. all right everybody

40:19

thanks for listening we'll be back on Friday to

40:21

break down the week's news until then we'll

40:23

see you next time on Big podcast

Rate

Get this podcast via API

From The Podcast

Big Technology Podcast

The Big Technology Podcast takes you behind the scenes in the tech world featuring interviews with plugged-in insiders and outside agitators. Alex Kantrowitz, a Silicon Valley journalist who's interviewed the world's top tech CEOs — from Mark Zuckerberg to Larry Ellison — is the host.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Download Audio Filehttps://traffic.megaphone.fm/LI9563190229.mp3?updated=1745421190

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More