ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI

ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI

Released Tuesday, 28th January 2025
Good episode? Give it some love!
ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI

ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI

ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI

ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI

Tuesday, 28th January 2025
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Go is a complex game and there was

0:02

always a bit of worry about whether AlphaGo

0:04

was truly as good as we believed. So

0:06

we actually had the conviction that deep

0:08

reinforced learning is the answer based on

0:10

everything that we could measure and everything

0:13

we could see. But that's, the thing

0:15

about these systems is that they're not

0:17

like classic computers where you just

0:19

like know that they always produce

0:22

the same answer. They're like stochastic.

0:24

they are creative and they have

0:26

like some blind spots, they hallucinate,

0:28

like similarly to how like model

0:31

LAMS hallucination. So you need to

0:33

just like really push them and

0:35

just like see exactly where they

0:38

break and the only way you

0:40

could actually do that is by

0:42

having like the best humans

0:44

playing against them. Today

0:58

we're excited to welcome Janis Antinoglo,

1:00

a researcher and an engineer who

1:02

has contributed to some of the

1:05

most significant breakthroughs in

1:07

AI. As a founding engineer at

1:09

Deep Mind, Janis played a crucial

1:11

role in developing Alpha Go, which

1:13

made history by defeating Go World

1:15

Champion Lisa-Al. He later co-led

1:17

the development of Mu Zero,

1:19

which pushed the boundaries even

1:21

further by mastering multiple games

1:24

autonomously. As he embarks in

1:26

his latest venture with reflection,

1:28

he's focused on building the

1:30

next generation of AI agents. We're

1:33

excited to talk to Janus about

1:35

the breakthrough moments in AI history

1:37

that he's witnessed firsthand. From

1:39

Alpha Go's famous Move 37 to

1:42

his perspective today on what's

1:44

next for the combination of

1:46

reinforcement learning and large language

1:49

models on the way to AGI. Janus,

1:51

thank you so much for joining us

1:53

today. Thank you so much for having me. Janice,

1:55

you have an incredible background having worked

1:57

at Deep Mind as a founding engineer

2:00

for for over a decade, starting with

2:02

some of the most notable projects that

2:04

have really defined the industry. Deep Mind,

2:07

quite notably, created this notion of building

2:09

AI within games to start. Can you

2:11

share a little bit more about why

2:13

Deep Mind chose to start with games

2:16

at the time? Yeah, so Tip Mind

2:18

was the first company to truly embrace

2:20

the concept of artificial general intelligence, or

2:23

AGI, from the outset that had grand

2:25

ambitions aiming to build systems that would

2:27

match or exceed your intelligence. So the

2:29

big question was, and still is, how

2:32

do you build AGI? And more importantly,

2:34

how do you measure intelligence in a

2:36

way that allows for meaningful reasons and

2:39

performance improvements? So the idea of using

2:41

video games as a testing ground came

2:43

naturally to deep mind founders. It was

2:45

Tems for Sabis and Shane Lake. Because

2:48

Temes had a background in the gaming

2:50

industry and Shane's PhD thesis defined AGI

2:52

as a system that could learn to

2:55

complete any task. Video games provided the

2:57

control yet complex environment where these ideas

2:59

could be explored and tested. And to

3:01

what extent he mentioned games are they

3:04

provide a very controlled environment? To what

3:06

extent are games representative or not of

3:08

the real world? Like if you have

3:11

a result in games, do you think

3:13

that generalizes naturally to the real world

3:15

or not? So I mean, I guess

3:17

games have indeed been viable for developing

3:20

AI. And you actually have like a

3:22

few examples of that. So you can

3:24

see that PPO, for example, which is

3:27

currently being used in RLHF, was developed

3:29

using Open AI, GM, and with Joko

3:31

and Atari. And similarly, we have like

3:33

MCTS, which was developed, which stands for

3:36

Monte Carlo research, and was developed through

3:38

board games like Pacammon and Go. But

3:40

at the same time. games have like

3:43

a number of limitations. So the real

3:45

world is messy, is unbounded, and it's

3:47

a much tougher not to crack than

3:49

even the most complex games. So even

3:52

though it just gives you an interesting

3:54

testbed to develop new ideas, it's definitely

3:56

limiting and it does really capture all

3:59

the complex. of the real world. Okay,

4:01

interesting though. So a

4:03

lot of the techniques and algorithms

4:05

that you've developed in a game

4:08

environment, DPO, etc., these are

4:10

used in the real world. Yeah,

4:12

so PPO is actually like exactly

4:15

what, you know, just pretty used

4:17

for RLHS. And so MCTS, it's

4:19

used in Museo and Museo has

4:21

been used in like the real

4:23

world in things like, you know,

4:26

compression, video compression for YouTube. It

4:28

was part of the cell driving

4:30

system at Tesla at some time.

4:32

And it was also like

4:34

used for developing a

4:37

pilot, like that was

4:39

completely controlled by an AI. So

4:41

yeah, I mean, you can see

4:43

methods like that being used

4:45

in the world to solve

4:48

real problems. So interesting.

4:50

Janus, I remember back in 2017,

4:53

when AlphaGo, the movie came out

4:55

and it featured the incredible game

4:57

of AlphaGo against Lisa Dole, can

5:00

you take us back to that

5:02

moment in time and maybe the

5:04

years leading up to it as

5:07

you're building AlphaGo? How was AlphaGo

5:09

specifically chosen as the game to focus

5:11

on? So... I found like games have

5:13

always been a benchmark for AI research.

5:16

So like before go you have chess

5:18

and chess was like a major milestone

5:20

with like IBM's deep blue defeating Garkasparov

5:22

in the late 90s. And I mean even

5:25

though chess and go are completely

5:27

different games and go is definitely

5:29

a different beast, there is like games

5:31

have always been acted as test

5:33

bets for like the development, especially

5:35

board games for the development of

5:38

like new AI methods. Actually, even

5:40

going back to the earliest days

5:42

of AI research, Turing and Shannon,

5:44

they both worked on their own

5:46

versions of chess boards. So now,

5:48

the thing about like Go is

5:50

that it's a much harder problem

5:52

than like chess. The reason for

5:54

that is because it's almost

5:57

closely possible to define an

5:59

evaluation. method, heuristic. So in chess,

6:01

you can just like take a look

6:03

at the board, you can count the

6:05

number of pawns that like each side

6:07

has, you can see what the ranks

6:09

of these pawns are, and then you

6:11

can just like make some, you can

6:14

draw some conclusions or like who's winning

6:16

and why. But like in goal, there's

6:18

nothing like that. Like it's mostly human

6:20

intuition. And if you ask like a

6:22

go, you know. professional player like how

6:24

they know whether a position is a

6:26

good one or a bad one they

6:28

would say that like you know after

6:30

having playing the game for so long

6:32

they can just like seal it in

6:34

their gut like this a better position

6:36

on the other one. So now it's

6:38

actually a question of how do you

6:40

encode the feeling in your gut into

6:43

like an AI system right? So this

6:45

is exactly the reason why Solving Go

6:47

was considered the holy grail of air

6:49

research for a long time. And it

6:51

was a challenge that seemed almost impossible,

6:53

but at the same time it was

6:55

like within reach. People felt that they

6:57

could actually get it cracked. And this

6:59

is exactly what Alpha Go did back

7:01

in 2016. And it kind of like

7:03

showed case two new methods, which is

7:05

like deep learning and reinforcement learning. Because

7:07

back in 2015 and 2016, like... Now

7:09

we kind of think of deep planning

7:11

and reinforced learning as much to technologies,

7:14

but like back then, we're kind of

7:16

like literally like making the, they're taking

7:18

their first steps and they were kind

7:20

of like the new kid in the

7:22

block. And most people were kind of

7:24

like really skeptical about them. Everyone thought

7:26

that deep planning was another AI fat

7:28

that would just like won't last the

7:30

test of time. So yeah, I mean,

7:32

AlphaGo was chosen because it was like

7:34

clear to show J's that you actually

7:36

have like the most. it's the most

7:38

performant agent in the world. You could

7:40

actually evaluate it, you can have it

7:43

play with other humans, and at the

7:45

same time, it was within reach, given

7:47

like the latest developments in deep learning

7:49

and reinforced learning. I remember reading that

7:51

there's more configurations of the go board

7:53

than Adams in the universe, by women,

7:55

everything nice too, then that blew me

7:57

away. I mean, I grew up. plane

7:59

go and it felt like such a

8:01

you know it's a very simple in

8:03

terms of the rules but I see why

8:05

it was the holy growl. Maybe can you

8:08

explain how Alpha Go worked technically

8:10

maybe explain to me like I'm

8:12

a fifth grader because that is

8:14

that is effectively my level of

8:17

sophistication understanding these things but how

8:19

did it work and you mentioned

8:21

both reinforcement learning and deep learning

8:23

were involved I'd love to peel that back

8:26

a little bit. Yeah absolutely so

8:28

Alpha Go has two deep neural networks. So like

8:30

a neural network is a function that like

8:32

takes something as an input and produces something

8:34

as an output. And it's literally like a

8:37

black box. We don't really know exactly how

8:39

it does it. Just like know that you

8:41

can actually, if you train it on enough

8:43

data, we'll just like learn the mapping, if

8:46

we learn the function from input to the

8:48

output space. So AlphaGo actually had access to

8:50

two. deep neural networks, the policy network and the

8:52

value network. And the policy network suggested the

8:54

most promising move. So it will just take

8:57

a look at a current port position, and

8:59

it will just like say, okay, you know,

9:01

based on the current position, this is the

9:03

list of moves that I would recommend you

9:05

just like consider playing. And it also had access

9:07

to the volume network, we'll just take a

9:09

look at like a port position, and just

9:12

like give you a winning probability. Like what

9:14

are your chances of actually winning the game

9:16

starting from this position? This is exactly the

9:18

gut feeling. It had its own gut feeling

9:20

on whether the position is a good one

9:22

or a bad one. So once you have

9:24

access to these two networks, then you

9:27

can actually play in your imagination a

9:29

number of games. You can consider the

9:31

most promising moves. Then you can consider

9:33

your opponents most promising moves. And then

9:35

you can just evaluate its moves, like

9:37

the value network. And then you can

9:40

use a method called Min Max. What

9:42

that says is that I want to

9:44

win the game. but I also like know

9:46

that my opponent wants to win the game.

9:48

So I want to just like pick a

9:51

move that will maximize my chances of winning

9:53

knowing that like my opponent will try to

9:55

maximize their chance of winning. So if

9:57

you actually like do that and simulate a...

10:00

of moves, then you can just like get

10:02

the optimal action. And you know, the way

10:04

to just like do this imagination, this planning,

10:06

this search in the most efficient way is

10:08

by using a tree sets method called modical

10:10

tree sets. So, MTTS. So, whenever people talk

10:12

about MTTS, they literally just like mean this

10:14

heuristic of hardware. how do I choose which

10:16

few choose to consider so that I can

10:18

make informed decisions. The role for reinforced learning

10:20

and deep learning in building AlphaGo was that

10:22

AlphaGo first of all was the success of

10:24

reinforced learning and deep learning because like this

10:26

is exactly the two methods that powered AlphaGo.

10:28

And the policy network was initially trained on

10:30

a large set of human games. So you

10:32

had like many games played by human professionals

10:34

and you just like consider every position and

10:36

you consider the move they took at this

10:38

position. And then you have like dipping a

10:40

legwork that tries to predict this move. Then

10:42

once you have the policy network, you need

10:44

to somehow find a way to just like

10:46

obtain a value network. So we did it

10:48

in two ways. First, we just took the

10:50

policy network and we had to play against

10:52

itself. And we used to reinforce the learning

10:54

to improve the blank strength of the model.

10:56

So we use a technique called policy gradient.

10:58

So what policy gradient does is that it

11:00

just like looks at the game and then

11:03

it looks at the outcome. This is the

11:05

simplest version, kind of like of policy gradient.

11:07

It looks at the outcome of the game.

11:09

And for all the moves that led to

11:11

a win, they'll just like say, great, you

11:13

know, just increase the probability of choosing this

11:15

move. And for all the moves that led

11:17

to a loss, it says, great, now decrease

11:19

the probability of like this move being selected

11:21

in the future. And if you do it

11:23

like, you know, for many games and for

11:25

long enough, then you just like get an

11:27

improved policy. Now, once you have this improved

11:29

policy, you can just generate a new data

11:31

set of games where like the policy plays

11:33

against itself. games where for

11:35

each position you know

11:37

who the final winner

11:39

was. So then you

11:41

can take this network,

11:43

you can take another

11:45

network, a value network

11:47

and have it predict

11:49

the outcome of the

11:51

game based on the

11:53

current position. So what

11:55

the network will learn

11:57

is that if I

11:59

start out this position

12:01

and I play under

12:03

my current policy, on

12:06

average this is the

12:08

player who wins, like

12:10

it's either a black

12:12

player or the white

12:14

player. So this is

12:16

the first version of

12:18

like a value network

12:20

and you can just

12:22

like use it within

12:24

AlphaGo by combining it

12:26

with the policy network.

12:28

And what were some

12:30

of the biggest challenges

12:32

in building this and

12:34

how did you overcome

12:36

them? Yeah,

12:38

so AlphaGo was not just a

12:40

recent challenge but was mostly I'd

12:42

say an engineering marvel. It was

12:44

the early versions run on 1200

12:47

CPUs and 176 GPUs and the

12:49

version that played against this adult

12:51

use 48 CPUs. So like TPUs

12:53

for like the first accelerator, custom

12:55

accelerators. And these were like, these

12:57

accelerators were like really primitive back

12:59

then because literally it was like

13:01

the first version right? Like now

13:03

the later accelerators are much much

13:06

better and much more stable. So

13:08

the system had to be highly

13:10

optimized to minimize latency, maximize throughput.

13:12

We had to build large -scale infrastructure

13:14

for training these networks and it

13:16

was a massive endeavor. Just required

13:18

a lot of coordinated effort from

13:20

many talented individuals working on different

13:22

aspects of the project. But you

13:25

know, I just like walked you

13:27

through a number of steps just

13:29

to contain the policy network and

13:31

the value network and each of

13:33

these steps had to just be

13:35

implement at the limits of like

13:37

what was available and what was

13:39

possible back then in terms of

13:42

scale. And it had to be

13:44

implemented in a way where people

13:46

could just like tinker with it.

13:48

They could just like try the

13:50

resistance idea as fast and get

13:52

results, results fast. So yeah, lots

13:54

of people scale in, you know,

13:56

at levels that hadn't been implemented

13:58

before and it's kind of like

14:01

walking at the forefront of what was possible back then. I

14:03

love your highlight of it being a research marvel

14:05

and an engineering marvel. And I remember

14:07

you sharing one time that part of

14:09

the reason this project came about also

14:11

was because Google had TPUs that they

14:14

needed to, they needed a test customer

14:16

for and that was the spark.

14:18

This alphabet, off-a-go project, so that's

14:20

pretty incredible. How much conviction did the

14:22

Deep Mine team have that this is

14:24

going to work? You mentioned that, you

14:26

know, at the time. People learning, reinforcement

14:28

learning were still relatively novel, but Deep

14:31

Mind was very much founded with that

14:33

belief. But did you guys think that

14:35

you were going to be able to have

14:37

kind of these superhuman level results beating the

14:39

top go player in the world? Like was

14:41

it a crazy idea and maybe they'll work

14:44

or did the team have conviction like this

14:46

is going to work? Yeah, so I'd say

14:48

I would like the team had a cautious

14:50

optimism. So one of AlphaGo's lead developers, Ajaja

14:52

Hunk, He is a strong amateur co-player and

14:55

he had been working on Go for like

14:57

a decade before Alpha Go happened. And

14:59

we also had like a lead report

15:01

of a computer game of computer players

15:04

and you could see that Alpha Go

15:06

was significantly stronger than anything that can

15:08

before. But Go is a complex game and

15:10

there was always a bit of worry about

15:12

whether Alpha Go was true or less good

15:14

as we believed. So we actually

15:16

had the conviction that... Deep reinforced

15:19

learning is the answer, based on

15:21

everything that we could measure and

15:23

everything we could see. But that's,

15:25

the thing about the systems is

15:27

that, you know, they're not like

15:29

classic computers, where you just like

15:32

know that they always produce the

15:34

same answer. They're like stochastic. They

15:37

are creative. So, and they

15:39

only have like, they have

15:41

like some blind spots. They

15:43

hallucinate, like similarly to how

15:45

like model LAMS collusinate. So, You

15:47

need to just like really push them

15:49

and just like see exactly where the

15:51

break and the only way you could

15:53

actually do that is by having like

15:55

the best humans playing against them. Move

15:57

37. Can you tell us what that was?

15:59

it was such a monumental move and

16:02

I think everyone watching it at the

16:04

time it was in least at all

16:06

maybe primarily was confused by that move.

16:09

What was going on in your head

16:11

when that happened? So yeah I mean

16:13

move 37 yeah in game two against

16:16

list all was literally just a spectacular

16:18

moment in the sense that staff showed

16:20

gaze to the world that Alpha Go

16:23

has creativity. and it demonstrated that AI

16:25

could come up with strategy that even

16:27

top human players hadn't considered. So at

16:30

first, like I still remember that, like,

16:32

we thought that Alpha Go made an

16:34

error. So that's, it actually, like, hallucinated.

16:37

It did something that like it didn't

16:39

mean to do. But then it turned

16:41

out to be a pretty and a

16:44

conventional move that underscored that the system

16:46

had a deep understanding of the game,

16:48

that the system actually had, like, like,

16:51

creativity of things that, like people hadn't

16:53

thought of before before. I want to

16:55

take us to another key move in

16:58

the game. I think it was in

17:00

game four. At this point I was

17:02

rooting for Lee because I was like

17:05

a four guy needs to win a

17:07

game. Move 78. I think I think

17:09

Alpha Go made a mistake and Lisa

17:12

Dal knows it. I guess what was

17:14

the weakness there that Lee found during

17:16

the game? Yeah, exactly. So I mean,

17:19

Lisa Dal's fiction in game four was

17:21

literally a testament to human ingenuity. Like,

17:23

Move 78 was unexpected. And this will

17:25

alpha goal, like based on its evaluations,

17:28

misinterpreted as a mistake and thought that

17:30

it was actually like winning. So that's

17:32

why it didn't respond appropriately. And, you

17:35

know, this kind of highlighted the blind

17:37

spot in the system. So, the game

17:39

showed that like while systems like Africa

17:42

are extremely powerful at the same time,

17:44

they still have abilities and there were

17:46

like still areas where it could further

17:49

improve it. But how do you go

17:51

about improving something like that? Do you

17:53

need to show it a lot more

17:56

data of, you know, you know, that

17:58

type of human ingenuity move or how

18:00

do you go about fixing and patching

18:03

those those points? So yeah I mean

18:05

it's actually interesting that by the end

18:07

of the game field like this is

18:10

all we just like put together a

18:12

benchmark where it's just kind of like

18:14

trying to quantify and just have a

18:17

way of measuring the mistakes that like

18:19

Afrigal makes and you know this kind

18:21

of blind spots let's say and then

18:24

we just write another approaches to just

18:26

like improve the algorithm so that we

18:28

can solve these issues and what happened

18:31

is that Actually, the most effective way

18:33

of getting rid of them was to

18:35

just like do what we were doing,

18:38

just like at a higher scale and

18:40

better. So just like change the architecture

18:42

of the model, we just like switch

18:45

to a deep rest net with two

18:47

output heads. And we also like, we

18:49

just had a bigger network trend on

18:52

more data, then just like move to

18:54

alpha zero and better algorithms. And that

18:56

kind of like made it so that

18:59

we didn't have any. hallucinations anymore. So

19:01

in a way, just like scale data,

19:03

you know, things that are always kind

19:06

of the well-known recipe in the field

19:08

of AI, is exactly what solves it

19:10

in our case too. With scale and

19:13

data, how much did higher quality data,

19:15

or maybe specifically data from great professional

19:17

players, the best professional players, make a

19:20

meaningful difference? Or was it just any

19:22

data? Now for us, what matter was

19:24

that... we kind of like solved it

19:27

using self-play. So we actually had access

19:29

to the most competent goal player in

19:31

the world. And we just used it

19:34

to generate the best quality games and

19:36

then we just trained on these games.

19:38

So I guess like, you know, we

19:40

didn't need to have like human experts

19:43

because you had like an expert in

19:45

house. It wasn't human. Interesting. Amazing. Well,

19:47

I'd love to move on to the

19:50

progression from AlphaGo to Alpha Zero. And

19:52

you talked a little bit about this

19:54

notion of self-play just now. Alpha Zero

19:57

was powerful because it learned how to

19:59

play the game from... scratch entirely from

20:01

self-play without any human intervention. Can you

20:04

share more about how that worked and

20:06

why that was important? So Alpha Zero

20:08

was a game changer because it

20:10

learned entirely from scratch through self-play

20:13

without any human data. And this

20:15

was like a major leap from AlphaGo

20:17

because like AlphaGo as I said relied

20:19

heavily on human expert games. So

20:21

two things happened. First of all

20:23

Alpha Zero managed to simplify the

20:25

training process and also like showed

20:28

that AI... who literally just like

20:30

get from zero to superhuman performance

20:32

just purely by playing against itself

20:34

and that allowed it to just

20:36

be applicable to a whole range of

20:39

like new domains that were out of

20:41

reach because like there wasn't there weren't

20:43

enough like human data for it but

20:45

I think like the more the more

20:47

important thing is that just so that

20:50

Alpha Zero also solved all the issues

20:52

of like Alpha Go Hat in terms

20:54

of hallucinations in terms of you know

20:56

blind sports and raw robustness. So like

20:58

Alpha Zero it was like a better

21:00

method, just you know, fuselage. And you

21:03

explained kind of how AlphaGo worked

21:05

to a fifth grader. What would you

21:07

tell the fifth grader would be the

21:09

key difference technically that you that you

21:12

implemented with Alpha Zero? So Alpha Zero,

21:14

just like AlphaGo, uses a policy

21:16

network and a value network along

21:19

with multi-collar choices. So in that

21:21

respect, it's exactly the same as

21:23

AlphaGo. So the key difference is

21:26

in training. Alpha Zero starts with

21:28

random weights and lands by playing

21:30

games against itself, it iteratively improves

21:33

its performance. But the main idea

21:35

behind Alpha Zero is that whenever you

21:37

take... a set of weights, a

21:39

set of policy and value networks,

21:42

and then you just combine them

21:44

with such, then you just like

21:46

end up with a better player, you

21:48

just like increase your performance,

21:50

you just like become a

21:52

stronger player. So what that meant

21:55

is that we can actually

21:57

use this mechanism to improve

21:59

the model. policy, the role policy.

22:01

So this is what we call in

22:03

the reinforcement learning a policy improvement operator.

22:05

Whenever you can just like take an

22:08

existing policy and then do something, some

22:10

magic, and then just like come up

22:12

with like a better policy and then

22:15

you can just like take this policy

22:17

and distill it back to the initial

22:19

policy and then just repeat this process,

22:21

then you have like a reinforced learning

22:24

algorithm. And I think like you know

22:26

this is exactly what people are trying

22:28

to do today with like you know.

22:30

two-star or like, you know, synthetic data.

22:33

This is exactly the idea of, like,

22:35

how can I take a policy, do

22:37

something with it, planning, search, compute, whatever

22:39

it is, and derive a better policy,

22:42

which I can then imitate and just

22:44

like, kind of distill back to the

22:46

original policy. So this is exactly what

22:48

Alpha Zero is doing. It uses MCTA

22:51

search to produce a better policy, then...

22:53

It takes these trajectories, it trains its

22:55

policy and value network content, new better

22:58

trajectories, and it repeats this process until

23:00

it converges to the, you know, to

23:02

an expert level, go player. That's fascinating

23:04

and counterintuitive that kind of like starting

23:07

without the weights that you would have

23:09

from, you know, professional level players is

23:11

actually a better starting place. The epitome

23:13

of AI agents and games has achieved

23:16

I think via Mu Zero which the

23:18

progression even from Alpha Zero itself and

23:20

it's also where you became one of

23:22

the co-leads or one of the leads

23:25

of the game. Alpha Zero was obviously

23:27

impressive because of South Play but it

23:29

also needed to be told the environment's

23:32

dynamics or the rules of the game.

23:34

And Muziro takes this to the next

23:36

level without needing to be told the

23:38

rules of the game and it mastered

23:41

quite a few different games, Go chess

23:43

and many others. Can you share a

23:45

little bit about how Muziro worked and

23:47

why was this particularly meaningful? Absolutely. So

23:50

Alpha Zero, you know, as he said,

23:52

was a massive success in games like

23:54

chess. So in games where we actually

23:56

had access to the game rules, where

23:59

we actually had access to a perfect

24:01

simulator of the world. But like these

24:03

two lands on the perfect simulator made

24:05

it silencing to apply to real world

24:08

problems. And real world problems are often

24:10

messy and they lack the rules and

24:12

truly have just like right a perfect

24:15

simulator of them. So. That's exactly what

24:17

Museo tried to solve. So Museo masters

24:19

the games of course like Go chess

24:21

and soygee, but those are like masters

24:24

more visually challenging games or games like

24:26

a hot goat like Atari. And it

24:28

does that without giving access to the

24:30

simulator, just like lends how to build

24:33

any internal simulator of the world and

24:35

then just use this internal simulator in

24:37

the way similar to what Alpha Zero

24:39

is doing. So it does that by

24:42

using model-based reinforced learning. where what that

24:44

means is that you can just take

24:46

a number of trajectories generated by an

24:49

agent and then try and you know

24:51

learn a model, learn a prediction model

24:53

of how the world works. So this

24:55

is actually like quite similar to what

24:58

methods like SORA are trying to do

25:00

now where they just like take YouTube

25:02

videos and they try just like learn

25:04

a world model by just trying to

25:07

predict based on starting from one frame

25:09

what's going to happen in the future

25:11

frames. So New York tries to do

25:13

exactly that, but it does it in

25:16

a way different from, you know, genetic

25:18

models in the sense that it tries

25:20

to only model things that matter for

25:22

solving the reinforced learning problem. So it

25:25

tries to predict what the rewards can

25:27

be in the future, what's the value

25:29

of like future... what's the policy for

25:32

like future states. So only things that

25:34

you need to within your entities. But

25:36

you know the fundamental is kind of

25:38

like remain the same. So how do

25:41

you just like learn a model based

25:43

on trajectories and then once you have

25:45

this model you can just combine the

25:47

search and you know get super human

25:50

performance. So of course like you can

25:52

always decouple the two problems and have

25:54

like the model being trained separately from

25:56

you know data out in the wild

25:59

and then just I combined that with

26:01

new zero. And we just found

26:03

that back then, given the limitations

26:05

of like our models and the

26:07

smaller sizes, kind of like make

26:10

more sense to just like keep

26:12

those two together and only have

26:14

the model predict things that

26:16

matter for planning. So just

26:19

like try to model everything

26:21

because you're kind of hitting

26:23

the limits of what the

26:25

capacity of the model could take.

26:27

Is it right to assume that

26:29

not only SORA takes the same

26:31

approach, but maybe other world models

26:33

or other robotics foundation models? Yeah,

26:35

so anything that tries to just like build

26:37

a model of how the world works, and

26:40

then just like use that for planning,

26:42

it's within new zero-like methods. So yeah,

26:44

you can just like train it on

26:46

YouTube videos, you can train it on

26:49

like the inputs coming from like

26:51

robots, you can train it on like

26:53

robots, you can train its own... you

26:55

know, any environment. You can even think

26:57

of like large language models as a

26:59

form of models of like text. So

27:01

like the model text. But the thing

27:03

about text is that like the model

27:05

is a bit trivial. Like you don't

27:08

need to just, there aren't many

27:10

artifacts happening when you're trying

27:12

to predict what the next world

27:14

is going to be. So have you seen

27:16

the ideas behind me Zero kind of

27:18

be used outside gameplay or in

27:20

messy real world environments? So

27:23

yeah, I mean, so as I've said, Alpha-Zero

27:25

is your quite general methods

27:27

and they were like, there's

27:29

a number of scientific communities

27:31

in chemistry, so there's Alpha-

27:34

Chem in quantum computing, some

27:36

people try to use Alpha-Zero

27:38

in optimization, where they just like

27:40

adopted Alpha-Zero because it was

27:43

really powerful in... really doing

27:45

planning and just like solving this

27:47

optimization problems. At the same time,

27:49

you see it was incorporated in

27:52

a version of like Tesla self-driving

27:54

system, just kind of reported in

27:56

their AI day, and it was

27:58

also used as I think it's

28:00

currently being used within YouTube as

28:02

a custom cooperation algorithm. But I

28:05

think it's early days and takes

28:07

time for this new technology to

28:09

be fully adopted by the industry.

28:11

We'd love to talk a little

28:13

bit more about reinforcement learning and

28:15

agents. You alluded earlier to the

28:17

fact that reinforcement learning and deep

28:19

learning back in 2015 were new,

28:21

nascent ideas. They really grew in

28:23

popularity 2017, 2018, 2019 onwards. And

28:25

then they were overshadowed by LLLMs,

28:28

largely because of the GPT and

28:30

everything else that came out. But

28:32

now reinforcement learning is back. Why

28:34

do you think that is the

28:36

case? Yeah, I mean, first of

28:38

all, LLLMs and multi-model models have

28:40

indeed got incredible progress to AI.

28:42

So this models are exceptionally powerful

28:44

and can perform some truly impressive

28:46

tasks. But they have like some

28:49

fundamental limitations and one of them

28:51

is the availability of human data.

28:53

People just keep talking about the

28:55

data wall and what happens once

28:57

you run out of like high

28:59

quality data. And this is exactly

29:01

where we're forced learning science. So

29:03

we're forced learning. Excels because it

29:05

doesn't rely solely on pre-existing human

29:07

data. Instead, reinforced learning uses experience

29:10

generated by the agent itself to

29:12

prove its performance. So this self-generated

29:14

experience allows reinforced learning to learn

29:16

and adapt and to even adapt

29:18

to scenarios where human data is

29:20

scarce or like non-existent. So if

29:22

you define the reinforced learning problem

29:24

in the right setting, in the

29:26

right way, you can literally effectively

29:28

exchange compute for intelligence. You can

29:31

just like get to a point

29:33

similar to where we were with

29:35

Alpha Zero, where we just like,

29:37

the moment we threw more computer

29:39

to it, like we made the

29:41

networks bigger, we just like, you

29:43

know, used more games, we just

29:45

literally got a better player. And

29:47

it was deterministic. You always get

29:49

a better player. So I guess

29:52

this is exactly where we want

29:54

to be with like... like this

29:56

synthetic data pipeline. Currently we have

29:58

that with, you know, the scaling

30:00

clause in LLLMs, that if you

30:02

have like more data and bigger

30:04

models, then you get like a,

30:06

you know, you can predict that

30:08

there's going to be an improvement

30:10

to performance. But, you know, once

30:13

you run out of like human

30:15

data, how do you just keep

30:17

going? And synthetic data is like

30:19

the answer to that. And the,

30:21

the only way that, you know,

30:23

you can actually get high quality

30:25

reinforcement learning. high quality data to

30:27

just like improve your model is

30:29

like via some form of reinforcement

30:31

learning. And just like leaving, I'm

30:33

just like keeping reinforcement learning as

30:36

a really kind of blanket term

30:38

here where I just like define

30:40

it as anything that lends through

30:42

trial and error. How do you

30:44

think reinforcement learning is being brought

30:46

into the kind of like LM

30:48

world and you mentioned Q-star earlier?

30:50

Like I guess In a closed

30:52

form game you have like a

30:54

pretty clearly defined policy and value

30:57

function. How does that work in

30:59

like a messy kind of real

31:01

world environment or the LLLM world?

31:03

I mean I guess like there

31:05

are two different types of like

31:07

messy real world right like there

31:09

is the if you try to

31:11

just like build a controller or

31:13

something that's a really messy environment

31:15

and then if you if you

31:18

operate in the digital space so

31:20

Personally, I believe that distal AGI,

31:22

which happened much earlier than, you

31:24

know, robotics AGI. And the reason

31:26

for that is exactly that you

31:28

have control over the environment. And

31:30

the environment is like computers, like

31:32

the distal world. So even though

31:34

it's like messy and noisy, it's

31:36

still contained. It's not like the

31:39

real kind of like world in

31:41

that sense. So now in terms

31:43

of how do you bring like,

31:45

like, reinforcement learning is a... We

31:47

used to say in deep mind

31:49

that you have like the problem

31:51

and you have the solution. And

31:53

the problem setting of reinforcement learning

31:55

is how do I take a

31:57

model, how do I take... policy

32:00

and generate synthetic data, or like

32:02

I learn, I find a way to

32:04

improve this policy via interacting with

32:06

the environment, via trial and error.

32:08

And this is like the reinforced learning

32:10

problem setting, right? And then

32:12

there's like the solution space

32:14

where you have value functions and

32:16

have like reinforced learning methods.

32:19

So I think that there's a lot of

32:21

inspiration to draw from like classical

32:23

reinforced learning methods that were developed

32:25

in the past decade, but have

32:27

just a... you have to adjust

32:29

them to the new world of

32:31

LLLMs. So methods like Q-star try

32:33

to do that by just taking

32:35

the idea that if I have

32:37

a policy and then I do

32:39

planning, I consider possible future scenarios,

32:41

and then I have a way

32:43

to evaluate which one is better, then

32:46

I can just take the best

32:48

ones and then ask the model

32:50

to imitate these better ones. And

32:52

this is like a way of

32:54

improving the policy. In the classic

32:57

RL framework you do that by

32:59

using a policy and a value

33:01

network. In the new world you'll

33:03

just do that by asking your

33:06

by having a reward model or

33:08

asking your your LLM just like

33:10

gives you feedback on an

33:12

output it gave you. So

33:14

interesting. You also talked a little bit

33:16

about synthetic data earlier. I think some

33:19

folks are very bullish on synthetic data

33:21

and some folks are more skeptical. I

33:23

also believe that synthetic data is more

33:25

useful in some domains where outcomes and

33:27

success is perhaps more deterministic. Can you

33:30

share a little bit about your perspective

33:32

on the role synthetic data and how

33:34

bullish you are on it? Yeah, I mean, I

33:36

think like synthetic data is something that like

33:38

we have to solve one way or another.

33:40

So it's not about like whether, you're bullish

33:42

or not. It's an obstacle that

33:44

we have just find a way around

33:46

it. Like, we will run out of data.

33:48

Like, you know, there is so much

33:50

data that like humans can produce. And

33:52

also, like, it's important that this

33:55

system start taking actions. They

33:57

start learning from their own mistakes.

33:59

So we... need to just find a

34:01

way to make synthetic data work.

34:03

Now, what people have done is

34:05

that they've tried like the most, I

34:08

guess like, naive approach where you just

34:10

like take the models to produce

34:12

something and you try to just like

34:14

train on that. And of course, like,

34:17

you know, they've seen that there's more

34:19

collapsing and this just like doesn't

34:21

work out of the box, but you

34:23

know, new methods never work out of

34:26

the box. You just like need to

34:28

invest in it. and just like

34:30

take your time and you know really

34:32

kind of think of what's the best

34:35

way of doing it. So I'm

34:37

really optimistic that we'll just definitely find

34:39

ways to improve these models and I

34:41

think that like actually there is a

34:44

number of methods out there like

34:46

the two-star and the equivalents that just

34:48

you know in the new world where

34:50

people don't really set their research breakthroughs

34:53

the way they used to is

34:55

probably hidden behind like some company. trade

34:57

secrets. I'm going to ask about reasoning

34:59

and, you know, novel scientific discoveries.

35:01

Do you think that that can kind

35:04

of naturally come out of just scaling

35:06

LLLMs if you have enough data? Or

35:08

do you think that kind of

35:10

like the ability of reason and, you

35:13

know, come up with net new ideas

35:15

requires kind of doing reinforcement learning and,

35:17

you know, deeper compute at inference

35:19

time? So I think like you need

35:22

reinforcement to get a bit reasoning because

35:24

the... distribution of like it, it's

35:26

also about the distribution of data, right?

35:28

Like you have like a, you have

35:31

a lot of data out in the

35:33

wild in the internet, but at

35:35

the same time, you don't always have

35:38

like the right type of data. So

35:40

you don't have the data or

35:42

like someone reasons and they just like

35:44

explain the reasoning in detail. You have

35:47

some of it, you have like an

35:49

incredible load like the models you

35:51

actually amounts to to pick it up

35:53

and just imidated. But if you want

35:56

to just like improve on that capability,

35:58

then you need to do that

36:00

through reinforcement learning. you need to just

36:02

like show the model how this calf

36:05

imaging capability can further be improved

36:07

by just like have it generated data

36:09

interacts with the environment you know just

36:11

tell it when it's doing something right

36:14

and when it's starting something right.

36:16

So yeah I think that like reinforced

36:18

learning is definitely part of the answer

36:20

for that. AlphaGo Alpha Zero and Mu

36:23

Zero are the most powerful agents

36:25

we've ever built. Can you share a

36:27

little bit about how some of the

36:29

lessons and learnings unlocked from that

36:31

are relevant to how we're pursuing building

36:34

AI agents today? Yeah, so I think

36:36

like Alpha Go and Muzio, you know,

36:38

they've actually finally transformed our approach

36:40

to AI agents because they highlight the

36:43

importance of planning and scale in my

36:45

opinion that... If you actually look

36:47

at the charts of like different models

36:49

and how they scale, you can see

36:52

that like AlphaGo and Alpha Zero were

36:54

like kind of really ahead of

36:56

the time, like they were kind of

36:58

outliers. You had like this, this curves

37:01

of like how compute scaled and then

37:03

you have like Alpha Zero or

37:05

like somewhere standing on its own. So

37:07

it shows that like if you can

37:10

scale and you can really push

37:12

on that, then you can get like

37:14

incredible, incredible results. At the same time,

37:17

you know, you know, it also like...

37:19

you know, have better performance during

37:21

inference, during a test, during evaluation, but

37:23

just like using planning. And I think

37:26

that this is something that will start

37:28

seeing more and more in the

37:30

near future, or like this method would

37:32

just like start thinking more, like planning

37:35

more before they're just making any

37:37

decisions. So I'd say that like this

37:39

is more of the charitets of alphago

37:41

and of zero and new zero. It's

37:44

the basic principles and the basic

37:46

principles are of that. scale matters, planning

37:48

matters. These methods can really solve problems

37:50

that we thought that are insanely complex

37:53

or like you know beyond what

37:55

we can solve on our own. Similar

37:57

problems with the ones that you actually

37:59

observe today with these last longest

38:01

models are things that we saw back

38:04

then, like back in 2016, we actually

38:06

saw that these models can hallucinate or

38:08

that like at the same time

38:10

they're also creative, that they will just

38:13

come up with solutions that we hadn't

38:15

thought of. But they can also

38:17

like have blind spots or like hallucinate

38:19

or be susceptible to kind of like

38:22

address serial attacks, which I guess like

38:24

everyone knows now that these neural

38:26

networks suffer from. So I think that like

38:28

this are the. the main kind of lessons

38:31

drawn from this line of work.

38:33

What do you think are the biggest

38:35

open questions from this line

38:37

of work for the field dance

38:39

are going forward? So the main

38:42

question is, we had like Alpha Go

38:44

and Museo and we just like

38:46

March, they have like this

38:48

insanely robust and reliable systems

38:50

that will just always play,

38:52

go and at the highest

38:54

possible kind of level and

38:56

they'll just like achieve. consistently,

39:00

they will just like be top of the

39:02

litter bolt, will just like never lose a

39:04

game. So half a go master actually like

39:07

played against 60 people in online

39:09

matches and just like literally won

39:11

in every single one of them.

39:13

So there's like no, there's like,

39:15

this battle's for like a critical

39:17

robust reliable. And I think like

39:19

this is exactly what we're missing

39:21

now with this aller than based

39:23

agents. Sometimes they get it, sometimes

39:25

they don't. You cannot trust them.

39:27

They will just like a... You know, you

39:29

have like some amazing demos, but like, you

39:32

know, they happen once every two times even,

39:34

or like once every ten times, you have

39:36

like something amazing. And the remaining nine, they

39:38

just lost their way and didn't do

39:40

anything. So I think like what we

39:42

need to do is just find a

39:44

way to just make these LLC-based agents

39:46

equally robust to the ones that we

39:48

had with AlphaGo and MuZio and AlphaZero.

39:50

This is like the new open question of like

39:52

how do you actually do that. We'd love

39:54

to move into some of your thoughts

39:57

on the broader ecosystem today. You've touched

39:59

on a few... really core problems that

40:01

people are working on right now.

40:03

One, the data wall problem that

40:05

will hit eventually perhaps by 2028

40:07

or so. As some folks predict,

40:10

another being the idea of planning

40:12

as an area that AI agents

40:14

need to get better at. And

40:16

then a third idea that you

40:18

just described was around robustness and

40:20

reliability. Can you share a little

40:23

bit about and maybe some of

40:25

these areas that you think the

40:27

whole field needs to solve that

40:29

you are most excited about to

40:31

help us unlock this vision of

40:33

really getting to the AI agents

40:35

that we want? Yeah, I mean,

40:38

I'll just like also add another

40:40

one to the list. So I

40:42

feel like another major. challenge is

40:44

like how to improve the context

40:46

learning capabilities of this model. So

40:48

like how do they how do

40:50

you make sure that like these

40:53

systems can learn on the fly

40:55

and how they can adapt to

40:57

new context like with you. So

40:59

this is like another thing that

41:01

I think it's going to be

41:03

really important it's going to happen

41:06

the next few years a couple

41:08

years actually. So Janus what's the

41:10

term they used for that? In

41:12

context learning? In context learning. Yeah.

41:14

So it's the idea that... a

41:16

system can actually learn how to

41:18

do a new task with like

41:21

few short prompting like it kind

41:23

of like sees a few examples

41:25

and on the fly it kind

41:27

of like lands how to adapt

41:29

to the new environment it lands

41:31

how to use the the new

41:33

tools that were provided to it

41:36

or like it's kind of like

41:38

lens it's not just all the

41:40

knowledge it has stored in sweets

41:42

but like it can also like

41:44

acquiring new knowledge by just like

41:46

interacting with the real world, interacting

41:49

with the environment. So I think

41:51

that this is like another place

41:53

where there is a lot of

41:55

work happening at the moment and

41:57

going to have like amazing progress.

41:59

in the next couple of years.

42:01

And I'm really excited about that. So

42:04

yeah, I mean, to recap, I think

42:06

that like planning is important. You know,

42:08

in-contact learning is important

42:10

and, you know, reliability. So the

42:12

best way to achieve reliability

42:15

is just like ensure that this model

42:17

somehow know how to return from their

42:19

mistakes. So if they just like made

42:21

a mistake somewhere, they can just like

42:24

see that and they're like, you know,

42:26

I made a mistake. I'll just like

42:28

work for it. The way that humans,

42:30

you know, make mistakes all the time,

42:32

but like we, you know, you can correct for

42:34

them. So these are like the three

42:36

areas which I'm really, I'm really excited

42:38

to see progress on. Now that

42:41

you've kind of embarked on your

42:43

own entrepreneurial journey, how do you

42:45

think that the areas where startups

42:47

can compete against the big research

42:50

labs and like, how do you kind

42:52

of motivate yourself for that, for

42:54

that journey? Yeah, I mean, it's a

42:56

completely like, it's a new world for me,

42:58

but at the same time, it's not that

43:00

new, because when I joined Deep Mind, it

43:03

was literally a startup. So, and I was

43:05

like literally in the first few

43:07

employees. So I actually like saw that

43:09

first hand. But, you know, one of the

43:11

benefits of like working for a startup

43:14

is that, you know, the agility and

43:16

the focus. So everyone really cares.

43:18

Everyone just moves really fast. And

43:20

there's like a clear focus on what we

43:23

want to beat. So, so. The building is

43:25

like what's the most important kind

43:27

of motivation for people, like just

43:30

like building. And I think that

43:32

like this is one of the

43:34

big advantages that like startups have

43:36

over more established businesses. At the

43:38

same time, you know, it's easier

43:40

to just like default to adapt

43:43

to new findings in technologies. You're

43:45

not kind of like tied to, you

43:47

know, some pre-existing solutions or

43:49

like some products that you don't want

43:51

to... duplicate because like you know they bring a

43:54

lot of revenue to you while if you're a

43:56

startup you know you have like no such chains you

43:58

can just like move fast and you know be innovative

44:00

and just, you know, break conventions.

44:02

And at the same time, just

44:04

like allows you to leverage like

44:07

open source resources, things that are

44:09

out of touch for like the

44:11

big labs. And yeah, and you

44:13

don't have like the red tape

44:16

that like big places tend to

44:18

have. I love the term that

44:20

you sometimes be honest, main quest

44:22

versus side quest. Yeah, it's the

44:24

idea of like having a main

44:27

focus, like you know in big

44:29

places, in big labs, they have

44:31

like many different projects that like

44:33

people are working on. And it

44:35

usually happens that they have like

44:38

the main quest, the main thing

44:40

that like everyone is working on

44:42

and there's like many multiple like

44:44

smaller side quests that the idea

44:46

is just like feed into the

44:49

bigger quest, but like usually they

44:51

don't get... as much, they don't

44:53

get like as many resources or

44:55

like as many as much focus

44:57

on like the leadership. So yeah,

45:00

they tend to, yeah, dot trophy.

45:02

In the broader field, what are

45:04

some of the most defining projects

45:06

that you admire the most and

45:08

maybe who are some of the

45:11

most influential researchers that you admire

45:13

the most? Yeah, absolutely. So, so

45:15

I actually like started my AI

45:17

research journey back in 2012. and

45:19

I've actually like seen some milestones.

45:22

So just like I give a

45:24

list of like what I think

45:26

are like the main milestones like

45:28

in AI in the past like

45:30

12 years that I've been around.

45:33

So the first one I'll say

45:35

it's like Alex net. This is

45:37

the first paper that kind of

45:39

like show that deep learning is

45:42

the is the answer. I mean

45:44

back then didn't feel like it.

45:46

It just like felt like you

45:48

know a kind of curiosity. but

45:50

like now I think that most

45:53

people are convinced that like deep

45:55

learning is part of the answer.

45:57

Then it was a TQN. I

45:59

had the pleasure to actually walk

46:01

on TQN and just like see

46:04

first hand how it started. It

46:06

was actually developed by a friend

46:08

of my flat knee and it

46:10

was like the first system that

46:12

showed that you can actually combine

46:15

two planning with reinforced learning to

46:17

achieve human performance or like super

46:19

human performance in really complex environments.

46:21

Then this was Alpha Go. Again

46:23

I was like really lucky to

46:26

just like walk on that and

46:28

it showed that you know scale

46:30

and planning are really important ingredients

46:32

and if you just like do

46:34

that right then you get huge

46:37

success in an incredible complex environment.

46:39

Alpha fold, another one, this is

46:41

I can buy deep mind, it's

46:43

so that like this methods are

46:45

not just like things that you

46:48

can use to solve games, but

46:50

they have, they actually will make

46:52

this world a better place. It

46:54

will just like ensure that healthcare

46:56

is improved, that scientific discoveries are

46:59

being realized, that we'll just like

47:01

make sure this world is a

47:03

better place by using AI. It

47:05

kind of like brought AI to

47:08

everyone, just like made it accessible

47:10

to the broad audience, like everyone

47:12

knows what AI is now. It's,

47:14

it has made my life of

47:16

explaining my job much easier. So,

47:19

and finally, Tip to Four, and

47:21

I think that, yeah, probably, Jupiter

47:23

Four is like the latest kind

47:25

of big advancement in AI, because

47:27

it kind of like showed that

47:30

you know. Especially General Intelligence is

47:32

a matter of years. It's within

47:34

rich. Yeah, we are getting there.

47:36

I think that, you know, most

47:38

people now believe that we are

47:41

like a few years away from

47:43

like AGI. And, you know, that's

47:45

because of like the incredible breakthrough

47:47

that GD4 was. Now in terms

47:49

of like some people I really

47:52

admire. Before I forget, so I'd

47:54

say first like David Silver, he's,

47:56

he was my PhD supervisor, he

47:58

was my mentor, a deep mind.

48:00

He's an incredibly researcher. He worked,

48:03

he led Afco and Office Zero,

48:05

and he is, you know, he has a

48:07

lot of gilding dedication to the

48:09

field of reinforced learning, and

48:12

he's, you know, probably the

48:14

one of the smartest people,

48:16

or maybe the smartest

48:18

person I know went, amazing,

48:20

reinforced learning engineer. And

48:23

the second one I'll say

48:25

is, I'll ask you. and he was

48:27

a co-founder of opening eye. I had

48:29

the opportunity to work with him just

48:31

a little bit in the really early

48:33

days of a go. But I think it's

48:35

like his commitment to scaling I

48:37

am efforts and pushing the boundaries

48:40

of what the systems can achieve

48:42

is remarkable. And you know he

48:44

gave nature that like Jupiter 3 and

48:46

Jupiter 4 happened. So yeah immense

48:49

respect towards him. Thank you

48:51

for sharing that. Let's close out

48:53

with some rapid fire questions. What do you

48:55

think will be the next big milestones

48:57

in AI would say in the next

48:59

one, five, and ten years? So I

49:02

think like the next five to ten

49:04

years, the world will be a

49:06

different place. I actually really believe

49:08

that. I think that in the

49:10

next few years we'll see models

49:12

becoming powerful and reliable agents that

49:14

can actually independently execute tasks.

49:16

And I think that AI

49:19

agents will be massively adopted

49:21

across industries. especially in

49:23

science and health care. So

49:25

in that sense, I'm really

49:27

excited on what's coming, what's

49:29

coming in AI. And, you know, what

49:31

I'm most excited about is AI

49:34

agents. Systems can actually do

49:36

tasks for you. And, you know,

49:38

this is exactly what we're

49:41

building at reflection. In what

49:43

year do you think we'll pass

49:45

the 50% threshold on sweet bench?

49:47

So I think we are one to three

49:49

years away from the 5% threshold for

49:51

three agents and three to five years

49:54

from achieving 90% So the reason is

49:56

while progress is amazing. I think

49:58

I'd like we still need reliable agent

50:00

to hit these milestones. And it's really,

50:02

when it comes to research, it's like

50:05

hard to make precise predictions. When do

50:07

you think we'll hit the data wall

50:09

for scaling OLMs? And do you think

50:11

all the research in RL is mature

50:14

enough to keep up our slope of

50:16

progress? Or do you think there will

50:18

be a bit of a lull as

50:20

we try to figure out what happens

50:23

when we hit the wall? So I

50:25

think like the wall, you know, based

50:27

on like what I've read, I think

50:29

like we have at least one more

50:32

year for text for text. just like

50:34

before we hit the wall. And then

50:36

we have like this extra modalities, which

50:38

might actually buy us maybe a year

50:40

extra. And I think we are in

50:43

a really good place to just like

50:45

start using synthetic data. So in the

50:47

next two years, we'll just like figure

50:49

out the synthetic data problem. So I

50:52

think that we won't really hit the

50:54

wall. Just like we'll hit the wall,

50:56

but like no one realized it because

50:58

we have like new methods in place.

51:01

And if so, when? I think it's

51:03

like LLLM's hard draft of a moment

51:05

with the initial release of Chesapeake, where

51:07

they showed days to power and the

51:10

progress made over the past decade. I

51:12

think like what they hadn't had yet

51:14

is their alpha zero mode. And that's

51:16

the moment where more compute directly translates

51:19

to increase intelligence without your intervention. And

51:21

I think it's like this breakthrough is

51:23

still on the horizon. When you think

51:25

that will happen? I think this can

51:28

happen in the next five years. Wow.

51:30

Amazing. Janus, thank you so much for

51:32

joining us and taking us through the

51:34

awesome history of Alpha Go, Alpha Zero,

51:37

Mu Zero, your own journey through Deep

51:39

Mind, and then many of the core

51:41

research problems that the whole industry is

51:43

tackling today around data and building for

51:45

reliability and robustness and planning and in

51:48

context learning. We're really excited for the

51:50

future that you're helping us build. and

51:52

that you're pushing forward in the field

51:54

as well. So thank you so much,

51:57

Jonas. Thank you so much for having

51:59

me.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features