Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Go is a complex game and there was
0:02
always a bit of worry about whether AlphaGo
0:04
was truly as good as we believed. So
0:06
we actually had the conviction that deep
0:08
reinforced learning is the answer based on
0:10
everything that we could measure and everything
0:13
we could see. But that's, the thing
0:15
about these systems is that they're not
0:17
like classic computers where you just
0:19
like know that they always produce
0:22
the same answer. They're like stochastic.
0:24
they are creative and they have
0:26
like some blind spots, they hallucinate,
0:28
like similarly to how like model
0:31
LAMS hallucination. So you need to
0:33
just like really push them and
0:35
just like see exactly where they
0:38
break and the only way you
0:40
could actually do that is by
0:42
having like the best humans
0:44
playing against them. Today
0:58
we're excited to welcome Janis Antinoglo,
1:00
a researcher and an engineer who
1:02
has contributed to some of the
1:05
most significant breakthroughs in
1:07
AI. As a founding engineer at
1:09
Deep Mind, Janis played a crucial
1:11
role in developing Alpha Go, which
1:13
made history by defeating Go World
1:15
Champion Lisa-Al. He later co-led
1:17
the development of Mu Zero,
1:19
which pushed the boundaries even
1:21
further by mastering multiple games
1:24
autonomously. As he embarks in
1:26
his latest venture with reflection,
1:28
he's focused on building the
1:30
next generation of AI agents. We're
1:33
excited to talk to Janus about
1:35
the breakthrough moments in AI history
1:37
that he's witnessed firsthand. From
1:39
Alpha Go's famous Move 37 to
1:42
his perspective today on what's
1:44
next for the combination of
1:46
reinforcement learning and large language
1:49
models on the way to AGI. Janus,
1:51
thank you so much for joining us
1:53
today. Thank you so much for having me. Janice,
1:55
you have an incredible background having worked
1:57
at Deep Mind as a founding engineer
2:00
for for over a decade, starting with
2:02
some of the most notable projects that
2:04
have really defined the industry. Deep Mind,
2:07
quite notably, created this notion of building
2:09
AI within games to start. Can you
2:11
share a little bit more about why
2:13
Deep Mind chose to start with games
2:16
at the time? Yeah, so Tip Mind
2:18
was the first company to truly embrace
2:20
the concept of artificial general intelligence, or
2:23
AGI, from the outset that had grand
2:25
ambitions aiming to build systems that would
2:27
match or exceed your intelligence. So the
2:29
big question was, and still is, how
2:32
do you build AGI? And more importantly,
2:34
how do you measure intelligence in a
2:36
way that allows for meaningful reasons and
2:39
performance improvements? So the idea of using
2:41
video games as a testing ground came
2:43
naturally to deep mind founders. It was
2:45
Tems for Sabis and Shane Lake. Because
2:48
Temes had a background in the gaming
2:50
industry and Shane's PhD thesis defined AGI
2:52
as a system that could learn to
2:55
complete any task. Video games provided the
2:57
control yet complex environment where these ideas
2:59
could be explored and tested. And to
3:01
what extent he mentioned games are they
3:04
provide a very controlled environment? To what
3:06
extent are games representative or not of
3:08
the real world? Like if you have
3:11
a result in games, do you think
3:13
that generalizes naturally to the real world
3:15
or not? So I mean, I guess
3:17
games have indeed been viable for developing
3:20
AI. And you actually have like a
3:22
few examples of that. So you can
3:24
see that PPO, for example, which is
3:27
currently being used in RLHF, was developed
3:29
using Open AI, GM, and with Joko
3:31
and Atari. And similarly, we have like
3:33
MCTS, which was developed, which stands for
3:36
Monte Carlo research, and was developed through
3:38
board games like Pacammon and Go. But
3:40
at the same time. games have like
3:43
a number of limitations. So the real
3:45
world is messy, is unbounded, and it's
3:47
a much tougher not to crack than
3:49
even the most complex games. So even
3:52
though it just gives you an interesting
3:54
testbed to develop new ideas, it's definitely
3:56
limiting and it does really capture all
3:59
the complex. of the real world. Okay,
4:01
interesting though. So a
4:03
lot of the techniques and algorithms
4:05
that you've developed in a game
4:08
environment, DPO, etc., these are
4:10
used in the real world. Yeah,
4:12
so PPO is actually like exactly
4:15
what, you know, just pretty used
4:17
for RLHS. And so MCTS, it's
4:19
used in Museo and Museo has
4:21
been used in like the real
4:23
world in things like, you know,
4:26
compression, video compression for YouTube. It
4:28
was part of the cell driving
4:30
system at Tesla at some time.
4:32
And it was also like
4:34
used for developing a
4:37
pilot, like that was
4:39
completely controlled by an AI. So
4:41
yeah, I mean, you can see
4:43
methods like that being used
4:45
in the world to solve
4:48
real problems. So interesting.
4:50
Janus, I remember back in 2017,
4:53
when AlphaGo, the movie came out
4:55
and it featured the incredible game
4:57
of AlphaGo against Lisa Dole, can
5:00
you take us back to that
5:02
moment in time and maybe the
5:04
years leading up to it as
5:07
you're building AlphaGo? How was AlphaGo
5:09
specifically chosen as the game to focus
5:11
on? So... I found like games have
5:13
always been a benchmark for AI research.
5:16
So like before go you have chess
5:18
and chess was like a major milestone
5:20
with like IBM's deep blue defeating Garkasparov
5:22
in the late 90s. And I mean even
5:25
though chess and go are completely
5:27
different games and go is definitely
5:29
a different beast, there is like games
5:31
have always been acted as test
5:33
bets for like the development, especially
5:35
board games for the development of
5:38
like new AI methods. Actually, even
5:40
going back to the earliest days
5:42
of AI research, Turing and Shannon,
5:44
they both worked on their own
5:46
versions of chess boards. So now,
5:48
the thing about like Go is
5:50
that it's a much harder problem
5:52
than like chess. The reason for
5:54
that is because it's almost
5:57
closely possible to define an
5:59
evaluation. method, heuristic. So in chess,
6:01
you can just like take a look
6:03
at the board, you can count the
6:05
number of pawns that like each side
6:07
has, you can see what the ranks
6:09
of these pawns are, and then you
6:11
can just like make some, you can
6:14
draw some conclusions or like who's winning
6:16
and why. But like in goal, there's
6:18
nothing like that. Like it's mostly human
6:20
intuition. And if you ask like a
6:22
go, you know. professional player like how
6:24
they know whether a position is a
6:26
good one or a bad one they
6:28
would say that like you know after
6:30
having playing the game for so long
6:32
they can just like seal it in
6:34
their gut like this a better position
6:36
on the other one. So now it's
6:38
actually a question of how do you
6:40
encode the feeling in your gut into
6:43
like an AI system right? So this
6:45
is exactly the reason why Solving Go
6:47
was considered the holy grail of air
6:49
research for a long time. And it
6:51
was a challenge that seemed almost impossible,
6:53
but at the same time it was
6:55
like within reach. People felt that they
6:57
could actually get it cracked. And this
6:59
is exactly what Alpha Go did back
7:01
in 2016. And it kind of like
7:03
showed case two new methods, which is
7:05
like deep learning and reinforcement learning. Because
7:07
back in 2015 and 2016, like... Now
7:09
we kind of think of deep planning
7:11
and reinforced learning as much to technologies,
7:14
but like back then, we're kind of
7:16
like literally like making the, they're taking
7:18
their first steps and they were kind
7:20
of like the new kid in the
7:22
block. And most people were kind of
7:24
like really skeptical about them. Everyone thought
7:26
that deep planning was another AI fat
7:28
that would just like won't last the
7:30
test of time. So yeah, I mean,
7:32
AlphaGo was chosen because it was like
7:34
clear to show J's that you actually
7:36
have like the most. it's the most
7:38
performant agent in the world. You could
7:40
actually evaluate it, you can have it
7:43
play with other humans, and at the
7:45
same time, it was within reach, given
7:47
like the latest developments in deep learning
7:49
and reinforced learning. I remember reading that
7:51
there's more configurations of the go board
7:53
than Adams in the universe, by women,
7:55
everything nice too, then that blew me
7:57
away. I mean, I grew up. plane
7:59
go and it felt like such a
8:01
you know it's a very simple in
8:03
terms of the rules but I see why
8:05
it was the holy growl. Maybe can you
8:08
explain how Alpha Go worked technically
8:10
maybe explain to me like I'm
8:12
a fifth grader because that is
8:14
that is effectively my level of
8:17
sophistication understanding these things but how
8:19
did it work and you mentioned
8:21
both reinforcement learning and deep learning
8:23
were involved I'd love to peel that back
8:26
a little bit. Yeah absolutely so
8:28
Alpha Go has two deep neural networks. So like
8:30
a neural network is a function that like
8:32
takes something as an input and produces something
8:34
as an output. And it's literally like a
8:37
black box. We don't really know exactly how
8:39
it does it. Just like know that you
8:41
can actually, if you train it on enough
8:43
data, we'll just like learn the mapping, if
8:46
we learn the function from input to the
8:48
output space. So AlphaGo actually had access to
8:50
two. deep neural networks, the policy network and the
8:52
value network. And the policy network suggested the
8:54
most promising move. So it will just take
8:57
a look at a current port position, and
8:59
it will just like say, okay, you know,
9:01
based on the current position, this is the
9:03
list of moves that I would recommend you
9:05
just like consider playing. And it also had access
9:07
to the volume network, we'll just take a
9:09
look at like a port position, and just
9:12
like give you a winning probability. Like what
9:14
are your chances of actually winning the game
9:16
starting from this position? This is exactly the
9:18
gut feeling. It had its own gut feeling
9:20
on whether the position is a good one
9:22
or a bad one. So once you have
9:24
access to these two networks, then you
9:27
can actually play in your imagination a
9:29
number of games. You can consider the
9:31
most promising moves. Then you can consider
9:33
your opponents most promising moves. And then
9:35
you can just evaluate its moves, like
9:37
the value network. And then you can
9:40
use a method called Min Max. What
9:42
that says is that I want to
9:44
win the game. but I also like know
9:46
that my opponent wants to win the game.
9:48
So I want to just like pick a
9:51
move that will maximize my chances of winning
9:53
knowing that like my opponent will try to
9:55
maximize their chance of winning. So if
9:57
you actually like do that and simulate a...
10:00
of moves, then you can just like get
10:02
the optimal action. And you know, the way
10:04
to just like do this imagination, this planning,
10:06
this search in the most efficient way is
10:08
by using a tree sets method called modical
10:10
tree sets. So, MTTS. So, whenever people talk
10:12
about MTTS, they literally just like mean this
10:14
heuristic of hardware. how do I choose which
10:16
few choose to consider so that I can
10:18
make informed decisions. The role for reinforced learning
10:20
and deep learning in building AlphaGo was that
10:22
AlphaGo first of all was the success of
10:24
reinforced learning and deep learning because like this
10:26
is exactly the two methods that powered AlphaGo.
10:28
And the policy network was initially trained on
10:30
a large set of human games. So you
10:32
had like many games played by human professionals
10:34
and you just like consider every position and
10:36
you consider the move they took at this
10:38
position. And then you have like dipping a
10:40
legwork that tries to predict this move. Then
10:42
once you have the policy network, you need
10:44
to somehow find a way to just like
10:46
obtain a value network. So we did it
10:48
in two ways. First, we just took the
10:50
policy network and we had to play against
10:52
itself. And we used to reinforce the learning
10:54
to improve the blank strength of the model.
10:56
So we use a technique called policy gradient.
10:58
So what policy gradient does is that it
11:00
just like looks at the game and then
11:03
it looks at the outcome. This is the
11:05
simplest version, kind of like of policy gradient.
11:07
It looks at the outcome of the game.
11:09
And for all the moves that led to
11:11
a win, they'll just like say, great, you
11:13
know, just increase the probability of choosing this
11:15
move. And for all the moves that led
11:17
to a loss, it says, great, now decrease
11:19
the probability of like this move being selected
11:21
in the future. And if you do it
11:23
like, you know, for many games and for
11:25
long enough, then you just like get an
11:27
improved policy. Now, once you have this improved
11:29
policy, you can just generate a new data
11:31
set of games where like the policy plays
11:33
against itself. games where for
11:35
each position you know
11:37
who the final winner
11:39
was. So then you
11:41
can take this network,
11:43
you can take another
11:45
network, a value network
11:47
and have it predict
11:49
the outcome of the
11:51
game based on the
11:53
current position. So what
11:55
the network will learn
11:57
is that if I
11:59
start out this position
12:01
and I play under
12:03
my current policy, on
12:06
average this is the
12:08
player who wins, like
12:10
it's either a black
12:12
player or the white
12:14
player. So this is
12:16
the first version of
12:18
like a value network
12:20
and you can just
12:22
like use it within
12:24
AlphaGo by combining it
12:26
with the policy network.
12:28
And what were some
12:30
of the biggest challenges
12:32
in building this and
12:34
how did you overcome
12:36
them? Yeah,
12:38
so AlphaGo was not just a
12:40
recent challenge but was mostly I'd
12:42
say an engineering marvel. It was
12:44
the early versions run on 1200
12:47
CPUs and 176 GPUs and the
12:49
version that played against this adult
12:51
use 48 CPUs. So like TPUs
12:53
for like the first accelerator, custom
12:55
accelerators. And these were like, these
12:57
accelerators were like really primitive back
12:59
then because literally it was like
13:01
the first version right? Like now
13:03
the later accelerators are much much
13:06
better and much more stable. So
13:08
the system had to be highly
13:10
optimized to minimize latency, maximize throughput.
13:12
We had to build large -scale infrastructure
13:14
for training these networks and it
13:16
was a massive endeavor. Just required
13:18
a lot of coordinated effort from
13:20
many talented individuals working on different
13:22
aspects of the project. But you
13:25
know, I just like walked you
13:27
through a number of steps just
13:29
to contain the policy network and
13:31
the value network and each of
13:33
these steps had to just be
13:35
implement at the limits of like
13:37
what was available and what was
13:39
possible back then in terms of
13:42
scale. And it had to be
13:44
implemented in a way where people
13:46
could just like tinker with it.
13:48
They could just like try the
13:50
resistance idea as fast and get
13:52
results, results fast. So yeah, lots
13:54
of people scale in, you know,
13:56
at levels that hadn't been implemented
13:58
before and it's kind of like
14:01
walking at the forefront of what was possible back then. I
14:03
love your highlight of it being a research marvel
14:05
and an engineering marvel. And I remember
14:07
you sharing one time that part of
14:09
the reason this project came about also
14:11
was because Google had TPUs that they
14:14
needed to, they needed a test customer
14:16
for and that was the spark.
14:18
This alphabet, off-a-go project, so that's
14:20
pretty incredible. How much conviction did the
14:22
Deep Mine team have that this is
14:24
going to work? You mentioned that, you
14:26
know, at the time. People learning, reinforcement
14:28
learning were still relatively novel, but Deep
14:31
Mind was very much founded with that
14:33
belief. But did you guys think that
14:35
you were going to be able to have
14:37
kind of these superhuman level results beating the
14:39
top go player in the world? Like was
14:41
it a crazy idea and maybe they'll work
14:44
or did the team have conviction like this
14:46
is going to work? Yeah, so I'd say
14:48
I would like the team had a cautious
14:50
optimism. So one of AlphaGo's lead developers, Ajaja
14:52
Hunk, He is a strong amateur co-player and
14:55
he had been working on Go for like
14:57
a decade before Alpha Go happened. And
14:59
we also had like a lead report
15:01
of a computer game of computer players
15:04
and you could see that Alpha Go
15:06
was significantly stronger than anything that can
15:08
before. But Go is a complex game and
15:10
there was always a bit of worry about
15:12
whether Alpha Go was true or less good
15:14
as we believed. So we actually
15:16
had the conviction that... Deep reinforced
15:19
learning is the answer, based on
15:21
everything that we could measure and
15:23
everything we could see. But that's,
15:25
the thing about the systems is
15:27
that, you know, they're not like
15:29
classic computers, where you just like
15:32
know that they always produce the
15:34
same answer. They're like stochastic. They
15:37
are creative. So, and they
15:39
only have like, they have
15:41
like some blind spots. They
15:43
hallucinate, like similarly to how
15:45
like model LAMS collusinate. So, You
15:47
need to just like really push them
15:49
and just like see exactly where the
15:51
break and the only way you could
15:53
actually do that is by having like
15:55
the best humans playing against them. Move
15:57
37. Can you tell us what that was?
15:59
it was such a monumental move and
16:02
I think everyone watching it at the
16:04
time it was in least at all
16:06
maybe primarily was confused by that move.
16:09
What was going on in your head
16:11
when that happened? So yeah I mean
16:13
move 37 yeah in game two against
16:16
list all was literally just a spectacular
16:18
moment in the sense that staff showed
16:20
gaze to the world that Alpha Go
16:23
has creativity. and it demonstrated that AI
16:25
could come up with strategy that even
16:27
top human players hadn't considered. So at
16:30
first, like I still remember that, like,
16:32
we thought that Alpha Go made an
16:34
error. So that's, it actually, like, hallucinated.
16:37
It did something that like it didn't
16:39
mean to do. But then it turned
16:41
out to be a pretty and a
16:44
conventional move that underscored that the system
16:46
had a deep understanding of the game,
16:48
that the system actually had, like, like,
16:51
creativity of things that, like people hadn't
16:53
thought of before before. I want to
16:55
take us to another key move in
16:58
the game. I think it was in
17:00
game four. At this point I was
17:02
rooting for Lee because I was like
17:05
a four guy needs to win a
17:07
game. Move 78. I think I think
17:09
Alpha Go made a mistake and Lisa
17:12
Dal knows it. I guess what was
17:14
the weakness there that Lee found during
17:16
the game? Yeah, exactly. So I mean,
17:19
Lisa Dal's fiction in game four was
17:21
literally a testament to human ingenuity. Like,
17:23
Move 78 was unexpected. And this will
17:25
alpha goal, like based on its evaluations,
17:28
misinterpreted as a mistake and thought that
17:30
it was actually like winning. So that's
17:32
why it didn't respond appropriately. And, you
17:35
know, this kind of highlighted the blind
17:37
spot in the system. So, the game
17:39
showed that like while systems like Africa
17:42
are extremely powerful at the same time,
17:44
they still have abilities and there were
17:46
like still areas where it could further
17:49
improve it. But how do you go
17:51
about improving something like that? Do you
17:53
need to show it a lot more
17:56
data of, you know, you know, that
17:58
type of human ingenuity move or how
18:00
do you go about fixing and patching
18:03
those those points? So yeah I mean
18:05
it's actually interesting that by the end
18:07
of the game field like this is
18:10
all we just like put together a
18:12
benchmark where it's just kind of like
18:14
trying to quantify and just have a
18:17
way of measuring the mistakes that like
18:19
Afrigal makes and you know this kind
18:21
of blind spots let's say and then
18:24
we just write another approaches to just
18:26
like improve the algorithm so that we
18:28
can solve these issues and what happened
18:31
is that Actually, the most effective way
18:33
of getting rid of them was to
18:35
just like do what we were doing,
18:38
just like at a higher scale and
18:40
better. So just like change the architecture
18:42
of the model, we just like switch
18:45
to a deep rest net with two
18:47
output heads. And we also like, we
18:49
just had a bigger network trend on
18:52
more data, then just like move to
18:54
alpha zero and better algorithms. And that
18:56
kind of like made it so that
18:59
we didn't have any. hallucinations anymore. So
19:01
in a way, just like scale data,
19:03
you know, things that are always kind
19:06
of the well-known recipe in the field
19:08
of AI, is exactly what solves it
19:10
in our case too. With scale and
19:13
data, how much did higher quality data,
19:15
or maybe specifically data from great professional
19:17
players, the best professional players, make a
19:20
meaningful difference? Or was it just any
19:22
data? Now for us, what matter was
19:24
that... we kind of like solved it
19:27
using self-play. So we actually had access
19:29
to the most competent goal player in
19:31
the world. And we just used it
19:34
to generate the best quality games and
19:36
then we just trained on these games.
19:38
So I guess like, you know, we
19:40
didn't need to have like human experts
19:43
because you had like an expert in
19:45
house. It wasn't human. Interesting. Amazing. Well,
19:47
I'd love to move on to the
19:50
progression from AlphaGo to Alpha Zero. And
19:52
you talked a little bit about this
19:54
notion of self-play just now. Alpha Zero
19:57
was powerful because it learned how to
19:59
play the game from... scratch entirely from
20:01
self-play without any human intervention. Can you
20:04
share more about how that worked and
20:06
why that was important? So Alpha Zero
20:08
was a game changer because it
20:10
learned entirely from scratch through self-play
20:13
without any human data. And this
20:15
was like a major leap from AlphaGo
20:17
because like AlphaGo as I said relied
20:19
heavily on human expert games. So
20:21
two things happened. First of all
20:23
Alpha Zero managed to simplify the
20:25
training process and also like showed
20:28
that AI... who literally just like
20:30
get from zero to superhuman performance
20:32
just purely by playing against itself
20:34
and that allowed it to just
20:36
be applicable to a whole range of
20:39
like new domains that were out of
20:41
reach because like there wasn't there weren't
20:43
enough like human data for it but
20:45
I think like the more the more
20:47
important thing is that just so that
20:50
Alpha Zero also solved all the issues
20:52
of like Alpha Go Hat in terms
20:54
of hallucinations in terms of you know
20:56
blind sports and raw robustness. So like
20:58
Alpha Zero it was like a better
21:00
method, just you know, fuselage. And you
21:03
explained kind of how AlphaGo worked
21:05
to a fifth grader. What would you
21:07
tell the fifth grader would be the
21:09
key difference technically that you that you
21:12
implemented with Alpha Zero? So Alpha Zero,
21:14
just like AlphaGo, uses a policy
21:16
network and a value network along
21:19
with multi-collar choices. So in that
21:21
respect, it's exactly the same as
21:23
AlphaGo. So the key difference is
21:26
in training. Alpha Zero starts with
21:28
random weights and lands by playing
21:30
games against itself, it iteratively improves
21:33
its performance. But the main idea
21:35
behind Alpha Zero is that whenever you
21:37
take... a set of weights, a
21:39
set of policy and value networks,
21:42
and then you just combine them
21:44
with such, then you just like
21:46
end up with a better player, you
21:48
just like increase your performance,
21:50
you just like become a
21:52
stronger player. So what that meant
21:55
is that we can actually
21:57
use this mechanism to improve
21:59
the model. policy, the role policy.
22:01
So this is what we call in
22:03
the reinforcement learning a policy improvement operator.
22:05
Whenever you can just like take an
22:08
existing policy and then do something, some
22:10
magic, and then just like come up
22:12
with like a better policy and then
22:15
you can just like take this policy
22:17
and distill it back to the initial
22:19
policy and then just repeat this process,
22:21
then you have like a reinforced learning
22:24
algorithm. And I think like you know
22:26
this is exactly what people are trying
22:28
to do today with like you know.
22:30
two-star or like, you know, synthetic data.
22:33
This is exactly the idea of, like,
22:35
how can I take a policy, do
22:37
something with it, planning, search, compute, whatever
22:39
it is, and derive a better policy,
22:42
which I can then imitate and just
22:44
like, kind of distill back to the
22:46
original policy. So this is exactly what
22:48
Alpha Zero is doing. It uses MCTA
22:51
search to produce a better policy, then...
22:53
It takes these trajectories, it trains its
22:55
policy and value network content, new better
22:58
trajectories, and it repeats this process until
23:00
it converges to the, you know, to
23:02
an expert level, go player. That's fascinating
23:04
and counterintuitive that kind of like starting
23:07
without the weights that you would have
23:09
from, you know, professional level players is
23:11
actually a better starting place. The epitome
23:13
of AI agents and games has achieved
23:16
I think via Mu Zero which the
23:18
progression even from Alpha Zero itself and
23:20
it's also where you became one of
23:22
the co-leads or one of the leads
23:25
of the game. Alpha Zero was obviously
23:27
impressive because of South Play but it
23:29
also needed to be told the environment's
23:32
dynamics or the rules of the game.
23:34
And Muziro takes this to the next
23:36
level without needing to be told the
23:38
rules of the game and it mastered
23:41
quite a few different games, Go chess
23:43
and many others. Can you share a
23:45
little bit about how Muziro worked and
23:47
why was this particularly meaningful? Absolutely. So
23:50
Alpha Zero, you know, as he said,
23:52
was a massive success in games like
23:54
chess. So in games where we actually
23:56
had access to the game rules, where
23:59
we actually had access to a perfect
24:01
simulator of the world. But like these
24:03
two lands on the perfect simulator made
24:05
it silencing to apply to real world
24:08
problems. And real world problems are often
24:10
messy and they lack the rules and
24:12
truly have just like right a perfect
24:15
simulator of them. So. That's exactly what
24:17
Museo tried to solve. So Museo masters
24:19
the games of course like Go chess
24:21
and soygee, but those are like masters
24:24
more visually challenging games or games like
24:26
a hot goat like Atari. And it
24:28
does that without giving access to the
24:30
simulator, just like lends how to build
24:33
any internal simulator of the world and
24:35
then just use this internal simulator in
24:37
the way similar to what Alpha Zero
24:39
is doing. So it does that by
24:42
using model-based reinforced learning. where what that
24:44
means is that you can just take
24:46
a number of trajectories generated by an
24:49
agent and then try and you know
24:51
learn a model, learn a prediction model
24:53
of how the world works. So this
24:55
is actually like quite similar to what
24:58
methods like SORA are trying to do
25:00
now where they just like take YouTube
25:02
videos and they try just like learn
25:04
a world model by just trying to
25:07
predict based on starting from one frame
25:09
what's going to happen in the future
25:11
frames. So New York tries to do
25:13
exactly that, but it does it in
25:16
a way different from, you know, genetic
25:18
models in the sense that it tries
25:20
to only model things that matter for
25:22
solving the reinforced learning problem. So it
25:25
tries to predict what the rewards can
25:27
be in the future, what's the value
25:29
of like future... what's the policy for
25:32
like future states. So only things that
25:34
you need to within your entities. But
25:36
you know the fundamental is kind of
25:38
like remain the same. So how do
25:41
you just like learn a model based
25:43
on trajectories and then once you have
25:45
this model you can just combine the
25:47
search and you know get super human
25:50
performance. So of course like you can
25:52
always decouple the two problems and have
25:54
like the model being trained separately from
25:56
you know data out in the wild
25:59
and then just I combined that with
26:01
new zero. And we just found
26:03
that back then, given the limitations
26:05
of like our models and the
26:07
smaller sizes, kind of like make
26:10
more sense to just like keep
26:12
those two together and only have
26:14
the model predict things that
26:16
matter for planning. So just
26:19
like try to model everything
26:21
because you're kind of hitting
26:23
the limits of what the
26:25
capacity of the model could take.
26:27
Is it right to assume that
26:29
not only SORA takes the same
26:31
approach, but maybe other world models
26:33
or other robotics foundation models? Yeah,
26:35
so anything that tries to just like build
26:37
a model of how the world works, and
26:40
then just like use that for planning,
26:42
it's within new zero-like methods. So yeah,
26:44
you can just like train it on
26:46
YouTube videos, you can train it on
26:49
like the inputs coming from like
26:51
robots, you can train it on like
26:53
robots, you can train its own... you
26:55
know, any environment. You can even think
26:57
of like large language models as a
26:59
form of models of like text. So
27:01
like the model text. But the thing
27:03
about text is that like the model
27:05
is a bit trivial. Like you don't
27:08
need to just, there aren't many
27:10
artifacts happening when you're trying
27:12
to predict what the next world
27:14
is going to be. So have you seen
27:16
the ideas behind me Zero kind of
27:18
be used outside gameplay or in
27:20
messy real world environments? So
27:23
yeah, I mean, so as I've said, Alpha-Zero
27:25
is your quite general methods
27:27
and they were like, there's
27:29
a number of scientific communities
27:31
in chemistry, so there's Alpha-
27:34
Chem in quantum computing, some
27:36
people try to use Alpha-Zero
27:38
in optimization, where they just like
27:40
adopted Alpha-Zero because it was
27:43
really powerful in... really doing
27:45
planning and just like solving this
27:47
optimization problems. At the same time,
27:49
you see it was incorporated in
27:52
a version of like Tesla self-driving
27:54
system, just kind of reported in
27:56
their AI day, and it was
27:58
also used as I think it's
28:00
currently being used within YouTube as
28:02
a custom cooperation algorithm. But I
28:05
think it's early days and takes
28:07
time for this new technology to
28:09
be fully adopted by the industry.
28:11
We'd love to talk a little
28:13
bit more about reinforcement learning and
28:15
agents. You alluded earlier to the
28:17
fact that reinforcement learning and deep
28:19
learning back in 2015 were new,
28:21
nascent ideas. They really grew in
28:23
popularity 2017, 2018, 2019 onwards. And
28:25
then they were overshadowed by LLLMs,
28:28
largely because of the GPT and
28:30
everything else that came out. But
28:32
now reinforcement learning is back. Why
28:34
do you think that is the
28:36
case? Yeah, I mean, first of
28:38
all, LLLMs and multi-model models have
28:40
indeed got incredible progress to AI.
28:42
So this models are exceptionally powerful
28:44
and can perform some truly impressive
28:46
tasks. But they have like some
28:49
fundamental limitations and one of them
28:51
is the availability of human data.
28:53
People just keep talking about the
28:55
data wall and what happens once
28:57
you run out of like high
28:59
quality data. And this is exactly
29:01
where we're forced learning science. So
29:03
we're forced learning. Excels because it
29:05
doesn't rely solely on pre-existing human
29:07
data. Instead, reinforced learning uses experience
29:10
generated by the agent itself to
29:12
prove its performance. So this self-generated
29:14
experience allows reinforced learning to learn
29:16
and adapt and to even adapt
29:18
to scenarios where human data is
29:20
scarce or like non-existent. So if
29:22
you define the reinforced learning problem
29:24
in the right setting, in the
29:26
right way, you can literally effectively
29:28
exchange compute for intelligence. You can
29:31
just like get to a point
29:33
similar to where we were with
29:35
Alpha Zero, where we just like,
29:37
the moment we threw more computer
29:39
to it, like we made the
29:41
networks bigger, we just like, you
29:43
know, used more games, we just
29:45
literally got a better player. And
29:47
it was deterministic. You always get
29:49
a better player. So I guess
29:52
this is exactly where we want
29:54
to be with like... like this
29:56
synthetic data pipeline. Currently we have
29:58
that with, you know, the scaling
30:00
clause in LLLMs, that if you
30:02
have like more data and bigger
30:04
models, then you get like a,
30:06
you know, you can predict that
30:08
there's going to be an improvement
30:10
to performance. But, you know, once
30:13
you run out of like human
30:15
data, how do you just keep
30:17
going? And synthetic data is like
30:19
the answer to that. And the,
30:21
the only way that, you know,
30:23
you can actually get high quality
30:25
reinforcement learning. high quality data to
30:27
just like improve your model is
30:29
like via some form of reinforcement
30:31
learning. And just like leaving, I'm
30:33
just like keeping reinforcement learning as
30:36
a really kind of blanket term
30:38
here where I just like define
30:40
it as anything that lends through
30:42
trial and error. How do you
30:44
think reinforcement learning is being brought
30:46
into the kind of like LM
30:48
world and you mentioned Q-star earlier?
30:50
Like I guess In a closed
30:52
form game you have like a
30:54
pretty clearly defined policy and value
30:57
function. How does that work in
30:59
like a messy kind of real
31:01
world environment or the LLLM world?
31:03
I mean I guess like there
31:05
are two different types of like
31:07
messy real world right like there
31:09
is the if you try to
31:11
just like build a controller or
31:13
something that's a really messy environment
31:15
and then if you if you
31:18
operate in the digital space so
31:20
Personally, I believe that distal AGI,
31:22
which happened much earlier than, you
31:24
know, robotics AGI. And the reason
31:26
for that is exactly that you
31:28
have control over the environment. And
31:30
the environment is like computers, like
31:32
the distal world. So even though
31:34
it's like messy and noisy, it's
31:36
still contained. It's not like the
31:39
real kind of like world in
31:41
that sense. So now in terms
31:43
of how do you bring like,
31:45
like, reinforcement learning is a... We
31:47
used to say in deep mind
31:49
that you have like the problem
31:51
and you have the solution. And
31:53
the problem setting of reinforcement learning
31:55
is how do I take a
31:57
model, how do I take... policy
32:00
and generate synthetic data, or like
32:02
I learn, I find a way to
32:04
improve this policy via interacting with
32:06
the environment, via trial and error.
32:08
And this is like the reinforced learning
32:10
problem setting, right? And then
32:12
there's like the solution space
32:14
where you have value functions and
32:16
have like reinforced learning methods.
32:19
So I think that there's a lot of
32:21
inspiration to draw from like classical
32:23
reinforced learning methods that were developed
32:25
in the past decade, but have
32:27
just a... you have to adjust
32:29
them to the new world of
32:31
LLLMs. So methods like Q-star try
32:33
to do that by just taking
32:35
the idea that if I have
32:37
a policy and then I do
32:39
planning, I consider possible future scenarios,
32:41
and then I have a way
32:43
to evaluate which one is better, then
32:46
I can just take the best
32:48
ones and then ask the model
32:50
to imitate these better ones. And
32:52
this is like a way of
32:54
improving the policy. In the classic
32:57
RL framework you do that by
32:59
using a policy and a value
33:01
network. In the new world you'll
33:03
just do that by asking your
33:06
by having a reward model or
33:08
asking your your LLM just like
33:10
gives you feedback on an
33:12
output it gave you. So
33:14
interesting. You also talked a little bit
33:16
about synthetic data earlier. I think some
33:19
folks are very bullish on synthetic data
33:21
and some folks are more skeptical. I
33:23
also believe that synthetic data is more
33:25
useful in some domains where outcomes and
33:27
success is perhaps more deterministic. Can you
33:30
share a little bit about your perspective
33:32
on the role synthetic data and how
33:34
bullish you are on it? Yeah, I mean, I
33:36
think like synthetic data is something that like
33:38
we have to solve one way or another.
33:40
So it's not about like whether, you're bullish
33:42
or not. It's an obstacle that
33:44
we have just find a way around
33:46
it. Like, we will run out of data.
33:48
Like, you know, there is so much
33:50
data that like humans can produce. And
33:52
also, like, it's important that this
33:55
system start taking actions. They
33:57
start learning from their own mistakes.
33:59
So we... need to just find a
34:01
way to make synthetic data work.
34:03
Now, what people have done is
34:05
that they've tried like the most, I
34:08
guess like, naive approach where you just
34:10
like take the models to produce
34:12
something and you try to just like
34:14
train on that. And of course, like,
34:17
you know, they've seen that there's more
34:19
collapsing and this just like doesn't
34:21
work out of the box, but you
34:23
know, new methods never work out of
34:26
the box. You just like need to
34:28
invest in it. and just like
34:30
take your time and you know really
34:32
kind of think of what's the best
34:35
way of doing it. So I'm
34:37
really optimistic that we'll just definitely find
34:39
ways to improve these models and I
34:41
think that like actually there is a
34:44
number of methods out there like
34:46
the two-star and the equivalents that just
34:48
you know in the new world where
34:50
people don't really set their research breakthroughs
34:53
the way they used to is
34:55
probably hidden behind like some company. trade
34:57
secrets. I'm going to ask about reasoning
34:59
and, you know, novel scientific discoveries.
35:01
Do you think that that can kind
35:04
of naturally come out of just scaling
35:06
LLLMs if you have enough data? Or
35:08
do you think that kind of
35:10
like the ability of reason and, you
35:13
know, come up with net new ideas
35:15
requires kind of doing reinforcement learning and,
35:17
you know, deeper compute at inference
35:19
time? So I think like you need
35:22
reinforcement to get a bit reasoning because
35:24
the... distribution of like it, it's
35:26
also about the distribution of data, right?
35:28
Like you have like a, you have
35:31
a lot of data out in the
35:33
wild in the internet, but at
35:35
the same time, you don't always have
35:38
like the right type of data. So
35:40
you don't have the data or
35:42
like someone reasons and they just like
35:44
explain the reasoning in detail. You have
35:47
some of it, you have like an
35:49
incredible load like the models you
35:51
actually amounts to to pick it up
35:53
and just imidated. But if you want
35:56
to just like improve on that capability,
35:58
then you need to do that
36:00
through reinforcement learning. you need to just
36:02
like show the model how this calf
36:05
imaging capability can further be improved
36:07
by just like have it generated data
36:09
interacts with the environment you know just
36:11
tell it when it's doing something right
36:14
and when it's starting something right.
36:16
So yeah I think that like reinforced
36:18
learning is definitely part of the answer
36:20
for that. AlphaGo Alpha Zero and Mu
36:23
Zero are the most powerful agents
36:25
we've ever built. Can you share a
36:27
little bit about how some of the
36:29
lessons and learnings unlocked from that
36:31
are relevant to how we're pursuing building
36:34
AI agents today? Yeah, so I think
36:36
like Alpha Go and Muzio, you know,
36:38
they've actually finally transformed our approach
36:40
to AI agents because they highlight the
36:43
importance of planning and scale in my
36:45
opinion that... If you actually look
36:47
at the charts of like different models
36:49
and how they scale, you can see
36:52
that like AlphaGo and Alpha Zero were
36:54
like kind of really ahead of
36:56
the time, like they were kind of
36:58
outliers. You had like this, this curves
37:01
of like how compute scaled and then
37:03
you have like Alpha Zero or
37:05
like somewhere standing on its own. So
37:07
it shows that like if you can
37:10
scale and you can really push
37:12
on that, then you can get like
37:14
incredible, incredible results. At the same time,
37:17
you know, you know, it also like...
37:19
you know, have better performance during
37:21
inference, during a test, during evaluation, but
37:23
just like using planning. And I think
37:26
that this is something that will start
37:28
seeing more and more in the
37:30
near future, or like this method would
37:32
just like start thinking more, like planning
37:35
more before they're just making any
37:37
decisions. So I'd say that like this
37:39
is more of the charitets of alphago
37:41
and of zero and new zero. It's
37:44
the basic principles and the basic
37:46
principles are of that. scale matters, planning
37:48
matters. These methods can really solve problems
37:50
that we thought that are insanely complex
37:53
or like you know beyond what
37:55
we can solve on our own. Similar
37:57
problems with the ones that you actually
37:59
observe today with these last longest
38:01
models are things that we saw back
38:04
then, like back in 2016, we actually
38:06
saw that these models can hallucinate or
38:08
that like at the same time
38:10
they're also creative, that they will just
38:13
come up with solutions that we hadn't
38:15
thought of. But they can also
38:17
like have blind spots or like hallucinate
38:19
or be susceptible to kind of like
38:22
address serial attacks, which I guess like
38:24
everyone knows now that these neural
38:26
networks suffer from. So I think that like
38:28
this are the. the main kind of lessons
38:31
drawn from this line of work.
38:33
What do you think are the biggest
38:35
open questions from this line
38:37
of work for the field dance
38:39
are going forward? So the main
38:42
question is, we had like Alpha Go
38:44
and Museo and we just like
38:46
March, they have like this
38:48
insanely robust and reliable systems
38:50
that will just always play,
38:52
go and at the highest
38:54
possible kind of level and
38:56
they'll just like achieve. consistently,
39:00
they will just like be top of the
39:02
litter bolt, will just like never lose a
39:04
game. So half a go master actually like
39:07
played against 60 people in online
39:09
matches and just like literally won
39:11
in every single one of them.
39:13
So there's like no, there's like,
39:15
this battle's for like a critical
39:17
robust reliable. And I think like
39:19
this is exactly what we're missing
39:21
now with this aller than based
39:23
agents. Sometimes they get it, sometimes
39:25
they don't. You cannot trust them.
39:27
They will just like a... You know, you
39:29
have like some amazing demos, but like, you
39:32
know, they happen once every two times even,
39:34
or like once every ten times, you have
39:36
like something amazing. And the remaining nine, they
39:38
just lost their way and didn't do
39:40
anything. So I think like what we
39:42
need to do is just find a
39:44
way to just make these LLC-based agents
39:46
equally robust to the ones that we
39:48
had with AlphaGo and MuZio and AlphaZero.
39:50
This is like the new open question of like
39:52
how do you actually do that. We'd love
39:54
to move into some of your thoughts
39:57
on the broader ecosystem today. You've touched
39:59
on a few... really core problems that
40:01
people are working on right now.
40:03
One, the data wall problem that
40:05
will hit eventually perhaps by 2028
40:07
or so. As some folks predict,
40:10
another being the idea of planning
40:12
as an area that AI agents
40:14
need to get better at. And
40:16
then a third idea that you
40:18
just described was around robustness and
40:20
reliability. Can you share a little
40:23
bit about and maybe some of
40:25
these areas that you think the
40:27
whole field needs to solve that
40:29
you are most excited about to
40:31
help us unlock this vision of
40:33
really getting to the AI agents
40:35
that we want? Yeah, I mean,
40:38
I'll just like also add another
40:40
one to the list. So I
40:42
feel like another major. challenge is
40:44
like how to improve the context
40:46
learning capabilities of this model. So
40:48
like how do they how do
40:50
you make sure that like these
40:53
systems can learn on the fly
40:55
and how they can adapt to
40:57
new context like with you. So
40:59
this is like another thing that
41:01
I think it's going to be
41:03
really important it's going to happen
41:06
the next few years a couple
41:08
years actually. So Janus what's the
41:10
term they used for that? In
41:12
context learning? In context learning. Yeah.
41:14
So it's the idea that... a
41:16
system can actually learn how to
41:18
do a new task with like
41:21
few short prompting like it kind
41:23
of like sees a few examples
41:25
and on the fly it kind
41:27
of like lands how to adapt
41:29
to the new environment it lands
41:31
how to use the the new
41:33
tools that were provided to it
41:36
or like it's kind of like
41:38
lens it's not just all the
41:40
knowledge it has stored in sweets
41:42
but like it can also like
41:44
acquiring new knowledge by just like
41:46
interacting with the real world, interacting
41:49
with the environment. So I think
41:51
that this is like another place
41:53
where there is a lot of
41:55
work happening at the moment and
41:57
going to have like amazing progress.
41:59
in the next couple of years.
42:01
And I'm really excited about that. So
42:04
yeah, I mean, to recap, I think
42:06
that like planning is important. You know,
42:08
in-contact learning is important
42:10
and, you know, reliability. So the
42:12
best way to achieve reliability
42:15
is just like ensure that this model
42:17
somehow know how to return from their
42:19
mistakes. So if they just like made
42:21
a mistake somewhere, they can just like
42:24
see that and they're like, you know,
42:26
I made a mistake. I'll just like
42:28
work for it. The way that humans,
42:30
you know, make mistakes all the time,
42:32
but like we, you know, you can correct for
42:34
them. So these are like the three
42:36
areas which I'm really, I'm really excited
42:38
to see progress on. Now that
42:41
you've kind of embarked on your
42:43
own entrepreneurial journey, how do you
42:45
think that the areas where startups
42:47
can compete against the big research
42:50
labs and like, how do you kind
42:52
of motivate yourself for that, for
42:54
that journey? Yeah, I mean, it's a
42:56
completely like, it's a new world for me,
42:58
but at the same time, it's not that
43:00
new, because when I joined Deep Mind, it
43:03
was literally a startup. So, and I was
43:05
like literally in the first few
43:07
employees. So I actually like saw that
43:09
first hand. But, you know, one of the
43:11
benefits of like working for a startup
43:14
is that, you know, the agility and
43:16
the focus. So everyone really cares.
43:18
Everyone just moves really fast. And
43:20
there's like a clear focus on what we
43:23
want to beat. So, so. The building is
43:25
like what's the most important kind
43:27
of motivation for people, like just
43:30
like building. And I think that
43:32
like this is one of the
43:34
big advantages that like startups have
43:36
over more established businesses. At the
43:38
same time, you know, it's easier
43:40
to just like default to adapt
43:43
to new findings in technologies. You're
43:45
not kind of like tied to, you
43:47
know, some pre-existing solutions or
43:49
like some products that you don't want
43:51
to... duplicate because like you know they bring a
43:54
lot of revenue to you while if you're a
43:56
startup you know you have like no such chains you
43:58
can just like move fast and you know be innovative
44:00
and just, you know, break conventions.
44:02
And at the same time, just
44:04
like allows you to leverage like
44:07
open source resources, things that are
44:09
out of touch for like the
44:11
big labs. And yeah, and you
44:13
don't have like the red tape
44:16
that like big places tend to
44:18
have. I love the term that
44:20
you sometimes be honest, main quest
44:22
versus side quest. Yeah, it's the
44:24
idea of like having a main
44:27
focus, like you know in big
44:29
places, in big labs, they have
44:31
like many different projects that like
44:33
people are working on. And it
44:35
usually happens that they have like
44:38
the main quest, the main thing
44:40
that like everyone is working on
44:42
and there's like many multiple like
44:44
smaller side quests that the idea
44:46
is just like feed into the
44:49
bigger quest, but like usually they
44:51
don't get... as much, they don't
44:53
get like as many resources or
44:55
like as many as much focus
44:57
on like the leadership. So yeah,
45:00
they tend to, yeah, dot trophy.
45:02
In the broader field, what are
45:04
some of the most defining projects
45:06
that you admire the most and
45:08
maybe who are some of the
45:11
most influential researchers that you admire
45:13
the most? Yeah, absolutely. So, so
45:15
I actually like started my AI
45:17
research journey back in 2012. and
45:19
I've actually like seen some milestones.
45:22
So just like I give a
45:24
list of like what I think
45:26
are like the main milestones like
45:28
in AI in the past like
45:30
12 years that I've been around.
45:33
So the first one I'll say
45:35
it's like Alex net. This is
45:37
the first paper that kind of
45:39
like show that deep learning is
45:42
the is the answer. I mean
45:44
back then didn't feel like it.
45:46
It just like felt like you
45:48
know a kind of curiosity. but
45:50
like now I think that most
45:53
people are convinced that like deep
45:55
learning is part of the answer.
45:57
Then it was a TQN. I
45:59
had the pleasure to actually walk
46:01
on TQN and just like see
46:04
first hand how it started. It
46:06
was actually developed by a friend
46:08
of my flat knee and it
46:10
was like the first system that
46:12
showed that you can actually combine
46:15
two planning with reinforced learning to
46:17
achieve human performance or like super
46:19
human performance in really complex environments.
46:21
Then this was Alpha Go. Again
46:23
I was like really lucky to
46:26
just like walk on that and
46:28
it showed that you know scale
46:30
and planning are really important ingredients
46:32
and if you just like do
46:34
that right then you get huge
46:37
success in an incredible complex environment.
46:39
Alpha fold, another one, this is
46:41
I can buy deep mind, it's
46:43
so that like this methods are
46:45
not just like things that you
46:48
can use to solve games, but
46:50
they have, they actually will make
46:52
this world a better place. It
46:54
will just like ensure that healthcare
46:56
is improved, that scientific discoveries are
46:59
being realized, that we'll just like
47:01
make sure this world is a
47:03
better place by using AI. It
47:05
kind of like brought AI to
47:08
everyone, just like made it accessible
47:10
to the broad audience, like everyone
47:12
knows what AI is now. It's,
47:14
it has made my life of
47:16
explaining my job much easier. So,
47:19
and finally, Tip to Four, and
47:21
I think that, yeah, probably, Jupiter
47:23
Four is like the latest kind
47:25
of big advancement in AI, because
47:27
it kind of like showed that
47:30
you know. Especially General Intelligence is
47:32
a matter of years. It's within
47:34
rich. Yeah, we are getting there.
47:36
I think that, you know, most
47:38
people now believe that we are
47:41
like a few years away from
47:43
like AGI. And, you know, that's
47:45
because of like the incredible breakthrough
47:47
that GD4 was. Now in terms
47:49
of like some people I really
47:52
admire. Before I forget, so I'd
47:54
say first like David Silver, he's,
47:56
he was my PhD supervisor, he
47:58
was my mentor, a deep mind.
48:00
He's an incredibly researcher. He worked,
48:03
he led Afco and Office Zero,
48:05
and he is, you know, he has a
48:07
lot of gilding dedication to the
48:09
field of reinforced learning, and
48:12
he's, you know, probably the
48:14
one of the smartest people,
48:16
or maybe the smartest
48:18
person I know went, amazing,
48:20
reinforced learning engineer. And
48:23
the second one I'll say
48:25
is, I'll ask you. and he was
48:27
a co-founder of opening eye. I had
48:29
the opportunity to work with him just
48:31
a little bit in the really early
48:33
days of a go. But I think it's
48:35
like his commitment to scaling I
48:37
am efforts and pushing the boundaries
48:40
of what the systems can achieve
48:42
is remarkable. And you know he
48:44
gave nature that like Jupiter 3 and
48:46
Jupiter 4 happened. So yeah immense
48:49
respect towards him. Thank you
48:51
for sharing that. Let's close out
48:53
with some rapid fire questions. What do you
48:55
think will be the next big milestones
48:57
in AI would say in the next
48:59
one, five, and ten years? So I
49:02
think like the next five to ten
49:04
years, the world will be a
49:06
different place. I actually really believe
49:08
that. I think that in the
49:10
next few years we'll see models
49:12
becoming powerful and reliable agents that
49:14
can actually independently execute tasks.
49:16
And I think that AI
49:19
agents will be massively adopted
49:21
across industries. especially in
49:23
science and health care. So
49:25
in that sense, I'm really
49:27
excited on what's coming, what's
49:29
coming in AI. And, you know, what
49:31
I'm most excited about is AI
49:34
agents. Systems can actually do
49:36
tasks for you. And, you know,
49:38
this is exactly what we're
49:41
building at reflection. In what
49:43
year do you think we'll pass
49:45
the 50% threshold on sweet bench?
49:47
So I think we are one to three
49:49
years away from the 5% threshold for
49:51
three agents and three to five years
49:54
from achieving 90% So the reason is
49:56
while progress is amazing. I think
49:58
I'd like we still need reliable agent
50:00
to hit these milestones. And it's really,
50:02
when it comes to research, it's like
50:05
hard to make precise predictions. When do
50:07
you think we'll hit the data wall
50:09
for scaling OLMs? And do you think
50:11
all the research in RL is mature
50:14
enough to keep up our slope of
50:16
progress? Or do you think there will
50:18
be a bit of a lull as
50:20
we try to figure out what happens
50:23
when we hit the wall? So I
50:25
think like the wall, you know, based
50:27
on like what I've read, I think
50:29
like we have at least one more
50:32
year for text for text. just like
50:34
before we hit the wall. And then
50:36
we have like this extra modalities, which
50:38
might actually buy us maybe a year
50:40
extra. And I think we are in
50:43
a really good place to just like
50:45
start using synthetic data. So in the
50:47
next two years, we'll just like figure
50:49
out the synthetic data problem. So I
50:52
think that we won't really hit the
50:54
wall. Just like we'll hit the wall,
50:56
but like no one realized it because
50:58
we have like new methods in place.
51:01
And if so, when? I think it's
51:03
like LLLM's hard draft of a moment
51:05
with the initial release of Chesapeake, where
51:07
they showed days to power and the
51:10
progress made over the past decade. I
51:12
think like what they hadn't had yet
51:14
is their alpha zero mode. And that's
51:16
the moment where more compute directly translates
51:19
to increase intelligence without your intervention. And
51:21
I think it's like this breakthrough is
51:23
still on the horizon. When you think
51:25
that will happen? I think this can
51:28
happen in the next five years. Wow.
51:30
Amazing. Janus, thank you so much for
51:32
joining us and taking us through the
51:34
awesome history of Alpha Go, Alpha Zero,
51:37
Mu Zero, your own journey through Deep
51:39
Mind, and then many of the core
51:41
research problems that the whole industry is
51:43
tackling today around data and building for
51:45
reliability and robustness and planning and in
51:48
context learning. We're really excited for the
51:50
future that you're helping us build. and
51:52
that you're pushing forward in the field
51:54
as well. So thank you so much,
51:57
Jonas. Thank you so much for having
51:59
me.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More