Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
This is not just more of
0:02
the same that we've seen in
0:04
the past. We have now an
0:06
existence proof that computers are able
0:08
to do something that they've never
0:10
been able to do before in
0:12
history of humanity. Arc version 2
0:14
has just been released and even
0:16
the frontier foundation models are failing
0:18
spectacularly. Today we are using Arcadia2 the
0:20
next version of the benchmark. Arcadia and
0:22
Arcasia2 is pretty much the only unsaturated
0:25
benchmark that is feasible by regular people.
0:27
And so it's a very good yardstick
0:29
to measure how much through intelligence these
0:31
models have, how close we are to
0:34
through it here. And alongside that, we're
0:36
really excited to be welcoming everyone to
0:38
Arc Prize 2025. Contest kicks off officially
0:41
now. It's going to run all the
0:43
way through the end of 2025. The
0:45
run all the way through the end
0:47
of 2025. The structure of the contest
0:49
is very similar to last year. We're
0:52
going to have the Cagle leaderboard running.
0:54
We're going to be testing this big
0:56
prize. It's unclaimed. In order to get
0:58
the big prize, you have to open
1:00
source your solution, have high degree efficiency
1:02
running on Caggle. And now we're really,
1:04
really excited to see all the new
1:06
ideas. I think there was a lot
1:08
that came out last year in 2024
1:10
that really pushed the frontier. The next
1:13
version of the benchmark, it's more
1:15
changing. It's extremely unsaturated. All frontier
1:17
models are screwing effectively within single
1:19
legit percentage. It's the first time
1:22
where we've calibrated the human-facing difficulty
1:24
of the tasks. So we actually
1:26
hired roughly 400 people. We tested
1:29
every single task, and every single
1:31
task has been solved by at
1:33
least two people. So we know
1:35
it's very feasible for humans. It's
1:38
extremely out of reach for a
1:40
near system today. This is the
1:42
frontier. The ARC benchmark forces us
1:44
to confront an uncomfortable truth about
1:46
our pursuit of artificial general intelligence,
1:49
that the field has been overlooking.
1:51
Intelligence is not just about capabilities,
1:53
it's also about the efficiency with
1:55
which you acquire and deploy these
1:57
capabilities. Intelligence is about finding that
1:59
program. very few halves using
2:01
actually very little compute. Like look
2:04
at the amount of energy that
2:06
a human expands to solve one
2:08
after over, you know, two, three,
2:11
four minutes. It's almost zero, right?
2:13
And compare that to a model
2:15
like Austrian high-compute settings, for instance,
2:18
which is going to use like
2:20
over 3,000 bucks of compute. So
2:23
it's never just, and it can
2:25
explain. Efficiency is actually the question
2:27
we're asking. Efficiency is a problem
2:30
statement. It's not capability. The goal post
2:32
is AGI. That's like what we're here
2:34
to do. That was the whole point
2:36
of launching Arc Prize in the first
2:39
place was to raise awareness that there
2:41
was this really important benchmark that I
2:43
thought shared showed something important that like
2:45
the sort of research community and AI
2:47
was missing about the sort of nature
2:50
of artificial intelligence. You know, this is
2:52
like one of the things I think
2:54
makes Arc special. and very unique and
2:56
important, I would argue. You know, there's
2:58
a lot of benchmarks in the world
3:01
today. And to my understanding of knowledge,
3:03
pretty much every other benchmark, you know,
3:05
all the frontier benchmarks, basically are trying
3:07
to test for these like superhuman capabilities,
3:10
right? These like PhD plus plus. type
3:12
skills that you need to have in
3:14
order to succeed at the benchmark. It's
3:17
not just compute, it's not just scale,
3:19
you have to be scaling the right
3:21
thing, you have to be scaling the
3:23
right ideas, and maybe you have them.
3:26
I personally just keep getting like surprised
3:28
and impressed by our price community, how
3:30
much folks are pushing the frontier, and
3:33
I think it was really exciting too,
3:35
because it means that individual people and
3:37
individual teams can actually make a difference.
3:40
out there could actually make a significant
3:42
contribution to the frontier of H.I. So
3:44
if you're going to enter the contest,
3:46
go to arkfrize.org and good luck. Good
3:49
luck to see on the leaderboard. This
4:02
sort of like test time optimization
4:04
techniques or test time search techniques,
4:06
that's the current frontier for GI,
4:08
right? And there are many ways
4:10
to approach it. Of course, you
4:13
can do, you can do just
4:15
test time training on this case, you
4:17
know, test time, test time, or in
4:19
this case, you know, test time, test
4:21
time, or in this case, you know,
4:24
token space, or you can do search
4:26
in latent space as well, right. So
4:28
you have many different ways to do.
4:30
to novelty at this time by recombining
4:32
what you know into some level structure.
4:34
MLST is sponsored by Two for AI
4:37
Labs. Now they are the Deep Seek based
4:39
in Switzerland. They have an amazing team. You've
4:41
seen many of the folks on the team.
4:43
They acquired Mine's AI, of course. They did
4:45
a lot of great work on ARC. They're
4:48
now working on O1-style models and reasoning and
4:50
thinking and test time computation. The reason you
4:52
want to work for them is you get
4:54
loads of autonomy, you get visibility, you can
4:57
publish your research. And also they are hiring,
4:59
as well as ML engineers, the hiring a
5:01
chief scientist. They really, really want to find
5:03
the best possible person for this role. And
5:06
they're prepared to pay top dollar as
5:08
a joining bonus. So if you're interested
5:10
in working for them as an ML
5:12
engineer or their chief scientist, get in
5:15
touch with Benjamin Cruzier. Go to two
5:17
for Labs. AI and see what happens.
5:19
Well, Mike, it's amazing to have you
5:21
on MST. Welcome. Yeah, thank you so
5:23
much. We're very excited for to
5:25
be here today. Mike, I hear
5:27
that you guys have got some
5:29
very exciting news today. Tell me
5:31
about it. Yeah, super excited today.
5:33
We're back. We're really excited
5:35
to be launching both Arc
5:37
AGI-2 alongside an updated Arc
5:40
Prize, Arc Prize 2025 contest.
5:42
Both they're going to be
5:44
launching, both Arc Prize 2,
5:46
alongside an updated Arc Prize 2025
5:48
contest. Both are going to Arc
5:50
Prize 2025. sort of challenge deep learning.
5:53
And arcade GI2 in contrast is really a
5:55
benchmark that's designed to challenge these new AI
5:57
reasoning systems that we're starting to see from
5:59
printing. much all of the frontier labs.
6:02
And one of the really cool
6:04
things about Arcagi, too, is we're
6:06
basically seeing, you know, models, AI
6:08
systems that are purely based on
6:10
pre-training, effectively scoring 0%. And some of
6:12
the frontier AI reasoning systems were
6:14
in progress of testing them right
6:16
now, and we're sort of expecting
6:18
single-digit performance. So a really big
6:20
update in terms of over Arcageo
6:22
1 from 2024. So the original
6:25
version of Arc was very much
6:27
based at these kind of foundation
6:29
models that didn't do reasoning. Version
6:31
2 is tunes for the reasoning
6:33
models. What would you say to the
6:35
charge that you're moving the goalposts? I mean,
6:38
how is Ark v2 sort of like meaningfully
6:40
an evolution of the benchmark? Yeah, I mean,
6:42
I think the way that I think about
6:44
it is that the goalpost is AGI.
6:46
That's like what we're here to do. That
6:49
was the whole point of launching Ark Prize
6:51
in the first place was to raise awareness
6:53
that there was this really important benchmark that
6:56
I thought... shared something important that like the
6:58
sort of research community and I was missing
7:00
about the sort of nature of artificial
7:02
intelligence. So that's kind of our goalpost. And
7:04
you know the definition that I use for
7:06
AGI and the one that the foundation adopts
7:09
is assessing this capability gap between humans and
7:11
computers. And Arc Prizes Foundation is really to
7:13
drive that gap to zero. I think it
7:15
would be hard to argue that we don't
7:17
have AGI. If you look around and you
7:19
can't find any more tasks that are very
7:22
straightforward, simple and easy for humans, that computers
7:24
can do as well. And the fact is
7:26
that we were able to find still lots
7:28
of those tasks. In fact, all of the
7:30
tasks in Arcadia II data sets sort of
7:32
fit into this category of things that are
7:35
relatively easy and simple and straightforward for
7:37
humans, and comparatively very difficult and hard
7:39
for AI today. Okay cool so I
7:41
know you guys have done loads of
7:43
human calibration and we'll talk about that
7:45
in a minute but the fundamental philosophy
7:47
of the of the arc challenge is
7:49
focusing on human gaps but at the
7:51
same time AI models are becoming superhuman
7:53
in so many respects so is the
7:56
big story the human gaps or is
7:58
the big story the expansion of capabilities,
8:08
right?
8:10
These
8:12
like
8:14
PhD
8:18
plus plus. type skills that you
8:21
need to have in order to
8:23
succeed at the benchmark. Humans can't
8:25
solve the problems that are in
8:27
these benchmarks. So you have to
8:29
be very, very, like, have a
8:31
lot of experience, a lot of
8:33
education, a lot of training in
8:35
order to be able to sort
8:37
of even get close to sort
8:40
of solving the benchmarks as a
8:42
human. And I think those are
8:44
important. Those are useful. But I
8:46
think it's actually more illustrating about
8:48
something that like we're missing from the
8:50
nature in the sense of like artificial
8:53
intelligence by looking at the that's much
8:55
more of an inspiring story. I think
8:57
it's one where it's actually necessary to
8:59
target this in order to actually get
9:02
AGI that is capable of innovation. You
9:04
know I think this is one of
9:06
the main reasons I got into AI
9:09
and AGI in the first place was
9:11
being really inspired and excited about trying
9:13
to build these systems that would be capable
9:15
of like compressing science timelines. And if all
9:18
we have is AI... that looks like we
9:20
had at the beginning of 2024, right, based
9:22
on pre-training, based on a memorization regime, you're
9:24
never going to get to that because these
9:27
are systems that are really going to reflect
9:29
back the experience and the knowledge that humanity
9:31
has sort of gained over the last, you
9:33
know, 10,000 generations as opposed to being ones
9:36
that are capable of like producing new knowledge.
9:38
new technology adding to sort of humanities like
9:40
colossus of sort of knowledge and technology. If
9:42
we want systems that can actually do that,
9:44
we need AGI. And this definition that we've
9:47
sort of used for this foundation of easy
9:49
for humans and our hard for AI, I
9:51
think is if we can close that gap,
9:53
we'll actually get technology that's capable of doing
9:55
that. I wonder whether you think we are
9:57
just about five discoveries away from AGI because...
10:00
there's going to be version three of
10:02
the art challenge, presumably there'll be a
10:04
version four. Intelligence is multi-dimensional and I
10:06
can see this both ways right because
10:08
you know many critics of AI they
10:10
are almost gas lighting us they're saying that
10:12
this amazing technology that you're using it doesn't
10:15
work and I'm like well yeah it does
10:17
work and would it be the case that
10:19
the criticisms will become more and more kind
10:21
of philosophical and they'll say oh you know
10:23
because it's not biological or whatever it's not
10:26
biological or whatever it's not the same thing
10:28
or do you're I think this is why
10:30
benchmarks are important. And I had a similar
10:32
question actually, you know, when I was starting
10:34
to get back into AI in 2022 and
10:37
trying to understand the world. Like, are we
10:39
on track for AGI or not? How far
10:41
off are we? And I find that it's
10:43
really really hard to get a sense of
10:46
understanding of the capabilities of all these systems
10:48
purely by using them. You can certainly get
10:50
a sense by just interacting with them. But
10:52
if you really want to understand what are
10:54
they capable and not capable of, you really
10:56
need a benchmark to discern this fact. This
10:58
is one of the interesting things that I
11:00
picked up from building AI products at Zapier
11:02
as well. It's very different building AI with
11:05
AI than it is classic software. One of
11:07
the big differences is when you're building classic
11:09
software, you can build and test with five
11:11
users and know, OK, hey, this is going
11:13
to be like. this product can scale to
11:15
millions. It's going to work the exact same
11:17
way. And that's fundamentally just not the case
11:19
with AI technology. You really have to deploy
11:21
to a large scale in order to assess how
11:23
it works. You need a benchmark alongside that scaling in
11:25
order to tell you, hey, has a system working or
11:28
not? What were the main lessons that you learned from
11:30
version one that you moved into version two? I
11:32
think so Arkage IT has been in the works
11:34
actually for several years, France started working on it,
11:36
crowdsourcinging some tasks for years and years and years
11:38
ago. There was a bunch of sort of inherent
11:41
flaws we ran into with it that we learned
11:43
as we sort of started popularizing the benchmark over
11:45
the last year or so. You know, one of
11:47
the things we learned was that a lot of
11:50
the tasks were very susceptible to brute force search.
11:52
So if that's something that has zero intelligence at
11:54
all and we wanted to minimize the sort of
11:56
incident rate of tasks that were sort of susceptible
11:59
to that, we hadn't. calibrated it. We anecdotally,
12:01
we relied on some anecdotes to say
12:03
that hey Arkage I1 is easy for
12:05
humans. We had you know a couple
12:07
source STEM, two STEM folks who had
12:09
taken the whole data set including the
12:11
private set and were able to solve
12:13
you know 98 99% but you were
12:15
relying on anecdote. We didn't have that
12:17
calibrated across the sort of three different
12:19
data sets that we had. And then we
12:21
had all these AI friendship reasoning systems come
12:23
out over the last you know three or four
12:26
months and we've got a chance to
12:28
study these. of our tasks that remain
12:30
very very challenging for these AI reasoning
12:32
systems, which we can get into if
12:34
you're curious. And so those are the
12:36
main sort of insights and learnings that
12:38
we took from Arcagia 1 to try
12:40
and produce an Arcagia 2 benchmark that
12:42
I think will be a useful sort
12:44
of signal for development this year in
12:46
artificial intelligence. Can we quickly touch on
12:48
the open AI situation? So in December...
12:50
they didn't launch but they gave you
12:52
access to O3 and it got incredible
12:55
performance on Ark v1, human level performance,
12:57
something that we just didn't think really
12:59
would be possible so quickly. Yeah, it's
13:01
surprising. It came out of nowhere. I
13:03
mean can you just tell me the
13:06
story behind that? Yeah, yeah, this is,
13:08
you know, one of the reasons why
13:10
I'm always hesitant to make predictions in
13:12
AI, I bought timelines. You know, I
13:14
think it's very easy to sort of
13:16
make predictions along smooth scaling curves. But
13:19
the nature of innovation is it's a
13:21
step function, right? And step functions are
13:23
really really hard to predict when they're
13:25
going to come out. I think the
13:27
best thing that I can say, having
13:30
spent some time with O3 and looking
13:32
at how it performs at ARC, is
13:34
that systems like O3 demand serious study.
13:36
This is not just... you know more of the
13:38
same that we've seen in the past. This is,
13:41
you know, we have now an existence proof that
13:43
computers are able to do something that they've never
13:45
been able to do before in history of humanity,
13:47
which is I think really, really exciting. I think
13:49
there's still a long way to go to get
13:52
to AGI, but I do think that these things
13:54
are important to understand and sort of even discern
13:56
how they work from a capability standpoint in order
13:58
to make sure that future. systems that we're
14:00
developing building look more like this and not
14:03
like the sort of pre-training pure scaling regime
14:05
that we've had in the past. So I
14:07
still remember the, to sort of give you
14:09
the anecdote, I still remember the two-week period,
14:11
the sprint we had on testing 03. It
14:14
was right at the end of the contest.
14:16
We had wrapped up our choice 2024 in
14:18
I think early November last year, and we
14:20
had a three-week, four-week period where we were
14:22
really, really busy on judging all the final
14:24
submissions, the papers, getting together the technical report.
14:27
And we were dropping it on all the
14:29
sort of results on a Friday. And I
14:31
was really hoping and anticipating that I was
14:33
going to have a nice relaxing holiday
14:35
period in December. And the day that
14:37
we dropped the technical report, we had
14:39
to reach out from one of the
14:42
folks at Open AI who said, hey,
14:44
we'd really love you to test this
14:46
new thing that we're working on. We
14:48
think we've got some impressive results on
14:50
our ARC, AGI1. And so that kicked
14:52
off a very, very hectic, fast, frantic,
14:54
two-week period to try and understand, okay,
14:56
what is this system? Like, does it
14:59
reproduce the claims that opening I had
15:01
on testing it? And what does this
15:03
mean for the benchmark? What does this
15:05
mean for AGI? And I think we're
15:07
able to show the final result was
15:10
that 03 on its sort of high
15:12
efficiency setting, which fit within the sort
15:14
of budget constraints that we'd set out
15:16
for our public leaderboard, got about 75%
15:18
or so. And then they had a
15:21
high compute. version which used I think
15:23
like maybe 200x more compute than the
15:25
low compute setting which was able to
15:27
score even 85% and these are really
15:29
impressive and I think this shows a
15:32
system like O3 has this you know
15:34
sort of binary switch. We've gone from
15:36
a regime where these AI models have
15:38
no ability to adapt to novelty to
15:40
something like a three where in its
15:42
existence proof of now an AI system
15:44
that can adapt to novelty in a
15:46
small way. Breaking this down a little
15:48
bit, there were some interesting caveats that
15:50
you just alluded to. So first of all,
15:52
they did some kind of fine tuning and
15:54
people at the time joked that, you know,
15:56
isn't it scandalous that they were training on
15:59
the training set? Yeah, this is like a
16:01
very bad description. This is a very poor
16:03
critique. I think it misses the point of
16:05
the benchmark. I think the folks who feel
16:08
this way, it's just because they're so used
16:10
to thinking about benchmarks and AI from the
16:12
pre-training scaling regime, where like, hey, if I
16:14
trained on the data. that's cheating then to
16:17
test on the data, right? And that's true
16:19
in the pre-training regime, but ARC is a
16:21
very, very special different benchmark, where it explicitly
16:24
makes a training set available with the intention
16:26
to train on it. This is very explicit.
16:28
This is like what the benchmark expects
16:30
you to do. We expect AI researchers
16:32
to use the training set in order
16:35
to teach their AI systems about the
16:37
domain of ARC. And then what specials,
16:39
we've got a private data set that
16:41
very few humans have ever seen. It
16:44
requires you to generalize and abstract the
16:46
core knowledge concepts that you learn through
16:48
the training set at test time. Fundamentally,
16:50
you cannot solve like the ARC AGI
16:52
1 or 2 private data sets purely
16:55
by memorizing what's in. the pre-training set.
16:57
This would be like, you know, maybe
16:59
a crude analogy would be, you know,
17:01
if I was going to teach an
17:04
AI system on grade school math and
17:06
then test it on calculus. This is
17:08
very similar to the type of thing
17:10
that we do with art where, you
17:12
know, the training set is much simpler,
17:15
easy, or the training set is much
17:17
simpler, easy curriculum to learn on, and
17:19
then the test is a much more
17:21
difficult one where you actually have to
17:23
express true intelligence per task. or more,
17:26
that means they were probably doing samplings.
17:28
They were doing a ridiculous amount of
17:30
completions. They were doing a solution space prediction,
17:32
which is very interesting. But the main thing,
17:34
Mike, just deep in your bones, deep in
17:36
your bones, do you think that they were
17:38
training on API data or surely they were
17:41
training on a whole bunch of data to
17:43
do that well? And the extension of the
17:45
question is, when they released the vanilla version
17:47
of it, what performance would it
17:49
get compared to their tweaked their tweaked version?
17:52
We will test that as soon as it comes out and I
17:54
would love to report the results on that. They told us
17:56
all they did was training on the training set and I believe
17:58
that's what they did. Okay, very interesting. comment on the
18:01
solution space prediction I mean I
18:03
was amazed that just predicting the
18:05
output space directly they could do
18:07
so well I mean doesn't doesn't
18:09
that almost take away from the
18:11
idea that we need to have discrete
18:14
code DSL type approaches if you
18:16
can just predict the solution space
18:18
so well? Effectively what O3 is
18:20
doing is it's able to use its
18:22
pre-trained experience and recombine it on the
18:24
fly and face of a novel task.
18:26
It does this through a regime called
18:29
chain of thought. This is all in form
18:31
speculation, by the way. We don't
18:33
have confirmed details. This is just
18:35
my sort of personal assessment of
18:37
how these systems work, particularly things
18:39
like O1-Pro and O3. If you compare them
18:41
with systems like R1 or O1, these
18:43
are systems that basically sped out a
18:45
single chain of thought and then used
18:48
that chain of thought in order to
18:50
ground a final answer. distinct from how
18:52
systems like O1Pro and O3 work, where
18:54
they actually have the ability to do
18:56
multi-sampling and recomposition at test time of
18:58
that chain of thought. This allows them
19:00
to build novel COTs that don't show
19:02
up anywhere in the pre-training, not in
19:04
the existing experience, and allows these systems
19:06
to reach more sort of effect over
19:08
more situations based effectively based on what
19:10
was in the original pre-training. Fundamentally, these
19:12
systems are a combination of a deep
19:14
learning model and a synthesis engine that
19:16
is put on top, and I think
19:18
the right way to think of them
19:21
is these are really AI systems, not
19:23
single models anymore. Yeah, and I agree
19:25
with you. It's really funny, though, how
19:27
you see the critique in the community,
19:29
because, you know, Gary Marcus is now
19:31
saying, oh, it can't draw pictures of
19:34
bicycles and labels and label the parts,
19:36
whereas we see 01 Pro and O3,
19:38
and it really does seem like a
19:40
dramatic. So the real work that you
19:42
guys did was you got a whole
19:44
bunch of human subjects and you had
19:47
the, I think you had, was it
19:49
400 test subjects and they all needed,
19:51
like at least two people needed to
19:53
solve every single task and you had
19:55
to do this experiment design and you
19:57
had to balance complexity of the tasks.
20:00
and so on. How did he do all of that? Yeah, this was,
20:02
so this was one of the biggest things we
20:04
wanted to fix with archaic I-1. We never had
20:06
a formal human calibration study on how do humans
20:08
actually... do on these things, you know, we relied
20:10
on anecdote. So we had to set up a
20:13
testing center down in San Diego. We recruited tons
20:15
of just folks from the local community all the
20:17
way from, you know, Uber drivers to single moms
20:19
to UCSD students and brought these folks out to
20:22
go take art puzzles. It was really cool, like,
20:24
we'll have to share some of the photos, like,
20:26
these like, testing shots where you have, like, you
20:28
know, dozens and hundreds of people, like, taking our
20:31
tasks on laptops. Our goal, originally with the
20:33
data set, was to ensure that every single
20:35
task that we put in Arcagi II was
20:37
solvable by at least one human. And what
20:39
we actually found was something even a higher
20:42
standard, I think, which was that we found
20:44
that every single task in the new V2
20:46
data set is solvable by at least two
20:49
humans under two attempts. And these are the
20:51
same rules that we give to AI systems
20:53
on the benchmark, both on the contest, as
20:55
well as the public leaderboards. I think this
20:58
is a pretty good sort of assertion
21:00
of... a straightforward comparison we can actually use
21:02
now between, hey, are these tasks easy and
21:04
straightforward for humans? Yes, are they hard for
21:07
AI? Yes, like I said before, frontier systems
21:09
generally are getting close to zero or
21:11
single digit percentages on these tasks now.
21:13
Okay, but the idea though is this
21:15
more of X paradox, right, which is
21:17
that, you know, Basically while we can select
21:19
problems that are easy for humans and hard
21:21
for AIs we haven't got AGI yet But
21:23
I was looking through some of your challenges
21:25
and I felt that some of them were
21:27
very difficult like it would have taken me
21:30
Five or six minutes of deep thought to
21:32
get it like are you finding that it's
21:34
still easy to you know to find these
21:36
things that are easy for humans and hard
21:38
for AIs or are you kind of scraping
21:40
the baron a little bit? So I think
21:42
easy for humans hard for AI is a
21:44
relative statement The fact is, these V2 tasks
21:46
were solvable by human on a $5
21:48
per task solverate budget. They were solvable
21:50
in five minutes or so. And AI
21:53
cannot solve these at all today. And
21:55
so yes, I do think if you
21:57
look at tasks, you have to think
21:59
about them. There's. like, you know, some
22:01
thought you need to put in to
22:03
sort of ascertain the rule. But the
22:05
sort of data, I think, speaks for
22:07
itself that, you know, we've got every
22:09
single task now in the V2 data
22:12
set from the public training set, I'm
22:14
sorry, the public evil set to the
22:16
semi-private set to the private data evil
22:18
set, set to the private data evil
22:20
set, set, every single one is solvable
22:22
by at least two humans under two,
22:25
under two attempts. So, you guys have
22:27
been cooking, you're already working on version
22:29
3 of ARC, what can you tell us
22:31
about that? So, the way I kind of
22:33
think about the multi-versions here, Arkage
22:36
I 1 again was designed
22:38
to challenge deep learning as a
22:40
paradigm. Arkage I 2 is designed
22:42
to challenge these AI reasoning systems.
22:44
I don't expect that ArcadiaG2 is going to
22:46
be as durable. Arcadia1 lasted for five
22:49
years. I don't expect ArcadiaG2 is going
22:51
to be quite as durable as that.
22:53
I hope that will continue to be
22:55
a very useful signal for researchers over
22:57
the next year or two. But yeah,
22:59
we've been working on ArcadiaI3 and I
23:01
think the pithy way to talk about
23:03
Arcadia3 is it's going to challenge AGI
23:06
systems that don't even exist in the
23:08
world yet today. Can you tell me
23:10
about the foundation that you're setting
23:12
up? cool things I think from Ark Prize
23:14
2024. When we launched it it was very
23:16
much an experiment. You know, our ambitions
23:18
were not quite what they were. I
23:20
think when we went into 2024 our
23:22
main goals were just to raise awareness
23:24
of the fact that this benchmark was,
23:26
and I think what we found was,
23:29
or what I personally found, I just
23:31
kept getting surprised by the community around
23:33
Ark. I remember this really specific moment when
23:35
a one preview came out and there was
23:37
thousands of people on like Twitter. like demanding
23:39
that we test this like new model on
23:41
ARC. And that was not my mental model
23:44
of like what this benchmark for or what
23:46
the community was. And that was so cool.
23:48
And that moment happened again when we ended
23:50
the contest. That moment again happened when we
23:53
launched the results on O3. And this kind
23:55
of showed I think, hey, there's a real
23:57
demand for ARC. There's a real demand for
23:59
benchmark. that look like this, that ascertain
24:01
these like capability gaps between humans and
24:03
computers. And so we set up this
24:05
foundation in order to basically be the
24:07
North Star for AGI and continue to
24:10
produce useful, interesting, durable benchmarks in the
24:12
sort of spirit of trying to discern
24:14
like what are the things that are
24:16
simple, straightforward, easy for humans and still
24:18
remain impossible or very very difficult for
24:20
AI. And we're going to carry that
24:22
torch all the way until we get
24:24
to AGI. As you can see now,
24:26
all of the large AI labs, they're
24:28
focusing on reasoning and I'd like to
24:30
think that Ark was at least a
24:32
small part of that. And you folks
24:34
are very focused on open source as
24:37
well. Mark Jen said specifically on the
24:39
Open AI podcast that they've been thinking
24:41
about Ark v One for years. There
24:43
you go. Well, yeah, exactly. But just
24:45
tell me a little about that. So
24:48
there's the industry impact, but you guys
24:50
are really focused on open source as
24:52
well. So how do you see those
24:54
two things? the most important technology that
24:56
humanity is going to develop. And if it
24:58
is true that we are in an
25:00
idea constrained environment, we still need new
25:02
ideas to get GAGI, which I think
25:04
our GHAI2 shows is true. If that's
25:06
true about the world, then I think we
25:09
should be designing the most innovative sort
25:11
of ecosystem and environment across the world
25:13
that we possibly can. This was one of the
25:15
reasons why we launched Arc Prize originally internationally
25:17
to reach solar researchers. to inspire researchers again
25:19
to go try and work on these new
25:21
ideas to get past this pre-training regime, try
25:24
something that we knew that it needed to
25:26
be something beyond this and even beyond what
25:28
we have today. And I think if you
25:30
look at like a really healthy, strong innovation
25:32
ecosystem, you're gonna look at one that is
25:34
very open and there's a lot of sharing
25:36
and there's a lot of diversity of approach.
25:39
And this is in contrast to an ecosystem
25:41
that would be very closed, very secretive,
25:43
very dogmatic, very monocultural. And so those
25:45
values of openness, those values of sharing,
25:47
are what the Arc Price Foundation stands
25:49
for in order to sort of increase
25:51
the chance that we can get to
25:53
AGI soon. So talking about the version
25:55
2 of the Arc Challenge, can you
25:57
just give us the elevator picture of that?
25:59
Sure. So arc two is basically a new
26:01
version of arc that keeps the same format
26:04
but tries to address the main flaws that
26:06
we saw in arc one. So for
26:08
instance in arc one we knew that
26:10
there was a little bit of redundancy
26:12
across tasks. So we saw that actually
26:14
there early on as early as the
26:16
2020 career competition. And also arc one
26:18
was way too brute forcible. So back
26:20
in 2020 what we did after the
26:22
career competitions that we tried to look
26:24
at all tasks that were sold at
26:26
least once. by one entry in the
26:28
competition. And we found that half of
26:30
the product data sets could be sold, in
26:32
fact, just via the sort of basic
26:35
brute force program search methods that's where
26:37
they're probably during the first competition.
26:39
And so that means half the
26:41
data set actually doesn't give you
26:43
very good signal. about each year
26:45
at all. So the other half was
26:47
actually good enough, required in our generalization
26:49
that the benchmark overall was still useful,
26:52
and it still lasted quite a few
26:54
years after that. But it told you,
26:56
you know, from the start, that there
26:58
were some pretty significant flowers, which is
27:01
expected, by the way, like, you know,
27:03
when I started creating arc back in
27:05
2018, 2019, I was flying blind, you
27:07
know. I was trying to capture my own
27:09
thoughts, my own intuition about what does
27:12
it mean to generalize, what is abstraction
27:14
of his reasoning, and that turned into
27:16
this benchmark. But I could not anticipate
27:18
what kind of AI techniques would be
27:20
used against it. And so yeah, as
27:23
it turns out. a lot of it
27:25
could be brute force. So arc two
27:27
completely addresses that. You cannot score, you
27:29
know, higher than one or two percent
27:31
most using brute force techniques on arc
27:34
two. So that's good news. And other
27:36
than that, we generally try to make
27:38
it a little bit harder. So what
27:40
we saw with arc one is that
27:42
it was very easy to saturate for
27:44
humans. Like if you're, if you're, you
27:47
know, like a stem graph, for instance,
27:49
you could basically get 100 percent. within
27:51
noise range of 400% like something like
27:54
97, 98. And so that means that
27:56
you were not getting a lot of
27:58
useful bandwidth to come. compare AI
28:00
capabilities with the capabilities of smart
28:03
humans. And if you just make
28:05
it a little bit harder, then
28:07
you get more range, wherein if
28:10
you're not very intelligent to score
28:12
lower, if you're very intelligent to
28:14
score higher, and you're not super
28:17
likely to completely saturated until you
28:19
are the very top end of
28:21
the distribution. So that's what ARQ2
28:24
is. Same format, same basic rules.
28:26
So we're only using core
28:28
knowledge. You have these input output
28:30
pairs of grids that are at
28:32
most 30 by 30, but the
28:35
content is very different. You're not
28:37
going to find tasks where you
28:39
only have to apply one basic
28:41
rule that could be incipitated in
28:43
events like some kind of gravity,
28:46
things falling tasks or symmetry tasks.
28:48
All the tasks are very compositional.
28:50
So you have multiple rules, you
28:52
have more objects, the grids are
28:54
generally bigger, and the rules can
28:57
be changed together or can be
28:59
interacting together, and that makes it
29:01
completely out of fridge for brute
29:03
force methods. And as it turns
29:05
out, it also makes it out
29:07
of fridge for the base elementary
29:09
training paradigm. You're saying that you've
29:12
made them more compositional and iterative
29:14
and harder for humans. That's right.
29:16
Could you give me a little
29:18
bit more detail on that? I
29:20
mean, if you think about it, there are
29:22
different dimensions of things that AI models can
29:24
do, and there are different dimensions of things
29:27
that humans can do. Have you, sort of,
29:29
quite diversely explored that? Or, I mean, could
29:31
you just give me a bit of a
29:33
breakdown of the task characteristics? So, in Harkman,
29:35
you had many tasks that were very basic.
29:38
We just had one rule. Let's say, for
29:40
instance, you have a few objects. and you
29:42
have to flip them, right? So this is
29:44
an example of a task that's easy to
29:46
brute force, because flipping is something that you
29:49
can acquire via pre-training as a concept, or
29:51
that you could just hard code in a
29:53
brute force program such system. So if that's
29:55
the only rule you have to apply, and
29:58
just apply it once, that's not compositional. that's
30:00
actually pretty easy to anticipate, that's
30:02
easy to brute force. So a
30:04
conversational task is going to be
30:06
a task where you have more than
30:09
one concept and typically they're going
30:11
to be interacting together. Like an
30:13
example of a very simple compositional
30:15
task is let's say you have
30:18
object flipping but also the objects
30:20
are falling. So you have two rules
30:22
to apply to each object at once.
30:24
But that again is a kind of
30:26
task that could still be found via
30:29
brute force program search. if you have
30:31
as key elements in your DSM, gravity
30:33
and flipping, for instance. And so you
30:35
want to create tasks that are where
30:38
the rules are chained to a sufficient
30:40
level of depths that there's no way
30:42
you could find a chain by just
30:44
trying every possible chain, where it will
30:46
become too expensive. Of course, humans can
30:49
still do it, because humans are not
30:51
just trying every possible combination of
30:53
things they know on every problem
30:55
that they see, they just have.
30:57
very efficient, very intuitive way of searching
30:59
for a theory that explains what they
31:01
see. You know, my co-host Keith, he
31:04
had this idea of doing a recursive
31:06
version of arc. But the thoughts occurred
31:08
to me is that even though
31:10
we do this systematic compositional reasoning,
31:12
we still have some kind of
31:14
cognitive limit. So if we nested,
31:16
let's say, four levels of arc
31:19
challenges within the same problem, wouldn't
31:21
you find very quickly that humans
31:23
just can't solve it? Right, so
31:25
if you just like concatenate two arc
31:27
tasks for instance, you get something that's
31:30
much less but forcible, that's much harder
31:32
because there are more rules going on.
31:34
It's not quite what I would call
31:36
compositional though, because even though you have
31:39
two rules at once, they're not interacting
31:41
with each other, right? You can solve
31:43
them separately and they concatenate the solutions.
31:45
And I think it's not a bad
31:47
idea at all, like it will work
31:49
as a way to make arc more
31:51
difficult. with again this caveat
31:54
that you're not actually
31:56
testing for depth of
31:59
compositionality. One issue, though, is that
32:01
it would only really work once, because
32:03
as soon as the person developing the
32:06
AI system notices that the task can
32:08
actually be decomposed into subtasks, then it's
32:10
game over. So I think it's actually
32:13
more interesting to have multiple
32:15
rules at once, but they're actually
32:17
being chained together, or they're
32:19
interacting together. For instance, one
32:21
rule might be writing some information on
32:24
the grid that needs to be read
32:26
by the signal rule. Right. What performance
32:28
do the frontier models get on
32:30
ARGV2? So what we saw was
32:32
a big gap between models that
32:34
don't do any kind of test
32:36
time adaptation, like any kind of
32:38
test time search or test time
32:40
training, and models that do. And
32:42
the base elements, even models like
32:44
GPT4.5, they're basically scoring zero. I
32:46
think one of them, I think
32:48
it was R1 maybe, score like
32:50
startly above zero, something like 1%. But
32:53
you know, it's within noise range of
32:55
zero. So any model that cannot do
32:57
a testimony adaptation, that is to that
32:59
does not possess fluid intelligence, does effectively
33:02
zero. So in that sense, arc two
33:04
is actually a very strong sign that
33:06
you have fluid intelligence. Better than arc
33:08
one. I think arc one could already
33:11
tell you that, but less perfectly. So
33:13
an arc one, if you do not
33:15
do testinar adaptation, you can still do
33:17
up to roughly 10%. So on arc
33:20
two, that's actually zero. So it's a
33:22
better test. Now, when it comes to
33:24
model that do test time adaptation, so
33:26
we tried, for instance, some of the
33:29
top entries from the Cal Competition last
33:31
year in the models that we are
33:33
doing at this time training in particular
33:35
or some kind of program search. And
33:37
the best model, the model that
33:40
actually won the Cal Competition, can
33:42
do, I believe, 3% on arc
33:44
two. And if you take an
33:46
ensemble of the top entries from
33:48
the competition, you get to 4%.
33:51
Right. So that's not very high.
33:53
We also estimate that so O3 would
33:55
be the current state of the
33:57
art in terms of an AI model.
34:00
that does exhibit through intelligence. And so
34:02
we haven't been able to test O3
34:04
on low-compute settings on all of the
34:06
tasks that you wanted to test it
34:09
on, but we've tested it on a
34:11
subset. And so we can extrapolate which
34:13
we score on the entire set. And
34:16
it sounds like it is going to
34:18
be about 4%. Right. So not super high.
34:20
There's a lot of to go high on
34:22
that. And we haven't been able to
34:24
test O3 on high compute settings at
34:26
all. So you know the model that
34:28
we are scoring. 88% on RQ1. So
34:30
I can make a guess. I
34:33
guess, based on what we
34:35
saw from O3 Look and
34:37
other models, I think you
34:39
might get up to
34:41
like 15, maybe even
34:43
20% if you were
34:45
remixing out the computer
34:47
setting and spending like
34:50
10K per task, for instance.
34:52
But we'll see be
34:54
like far below. average
34:56
human performance which would be more
34:58
like 60%. So that 4% that O3
35:00
gets on ARGV2, do you think
35:02
of that as fluid intelligence or
35:05
do you think of that as
35:07
a potential gap in, I mean
35:09
presumably you could have designed ARGV2,
35:11
if you selected the correct sets
35:13
of human calibrated challenges, you could
35:15
have found a set which was
35:17
still 0-103. Yeah, absolutely, no,
35:19
you could have, you know, obviously selected
35:21
against Austria and then Austria would do
35:23
zero. It would be very, very easy
35:25
to go from 4% to 0% right?
35:27
It's just a few, a few, a few
35:29
tasks that you need to change. So
35:31
we're not, we're not actually trying to
35:34
do that. Yes, I do believe that
35:36
4% does show that you have non-zero
35:38
fluid intelligence, which is also something that
35:41
you could get as a signal from
35:43
arc one. And I think the sign
35:45
that you see through the intelligence in
35:48
these models is the performance gap between
35:50
the huge pre-trained only models that don't
35:52
do test adaptation, that's quite effectively zero,
35:54
maybe one. And you could say that
35:57
one person is in fact a flat
35:59
data set. sure. It should in practice
36:01
be zero. And the models that
36:03
do test some adaptation and do
36:05
non-zero, three percent, four percent, be
36:07
five percent, right. And that means that
36:10
there's something like 95 percent of
36:12
the data set that will actually
36:14
give you these useful bandwidths for
36:16
measuring how much fluid intelligence the
36:18
model has. And that's something you
36:21
were not getting with Ark One.
36:23
Ark One was more binary. where
36:25
if you don't have fluid intelligence,
36:27
you're going to do very, very
36:29
low, like below 10% roughly. If
36:31
you do, you're going to score
36:34
significantly higher and getting above 50%
36:36
would be very easy. But because
36:38
the measure would saturate very quickly,
36:40
as soon as you start adding
36:43
non-zero fluid intelligence, you did not
36:45
get that useful bandwidths that you're
36:47
getting with ARQ, so I think
36:50
ARQ should allow for answering the question,
36:52
is this model? actually as fluidly
36:54
intelligent as the average human, which is
36:56
something you could not get at far
36:58
one. I guess it's just an economics
37:00
thing at this point, so if you
37:02
spent, let's say, a billion dollars or
37:04
half a billion dollars, you could saturate
37:06
ARGV2, I'm not sure if you would
37:08
agree with that, but... If that isn't the
37:10
case, I mean, what do you think
37:12
are the specific things that are missing
37:15
from O3, that are stopping it from
37:17
doing better? So it's never just an
37:19
economics question, because intelligence is not just
37:21
about capabilities, it's also about the efficiency
37:24
with which you acquire and deploy these
37:26
capabilities. And sure, if you spend billions
37:28
and billions of dollars, maybe you can
37:30
saturate arc two, but that would already
37:32
have been true back in 2020, using
37:35
like extremely crude... workforce program search, if
37:37
you have a DS cell, it's actually
37:39
true and complete, then you know that
37:41
for every arc task, there exists a
37:43
program that may not in fact be
37:45
all that long that will solve the
37:47
task. And all you need to do
37:49
is each rate over all possible programs
37:52
in order of length. And then the
37:54
first one you find is the one
37:56
that's going to generalize, right? Because it's
37:58
the shortest. It's the most part. So
38:00
if you spend unlimited resources, you
38:02
already have EGR in that sense,
38:04
just in a pure skills sense,
38:07
you can always just drive a
38:09
possible program until you find one
38:12
that works. But that's not what
38:14
intelligence is. Interagency about finding that
38:16
program in very few hops using
38:19
actually very little compute. Like look
38:21
at the amount of energy that
38:23
a human expands to sort of
38:26
one our task over two or
38:28
three four minutes. It's almost zero,
38:30
right? And compare that to a
38:33
model like Austrian high-compute settings, for
38:35
instance, which is going to use
38:37
like over 3,000 bucks of compute.
38:39
So it's never just an economic
38:41
problem. Efficiency is actually the
38:44
question we're asking. Efficiency is
38:46
a problem statement. It's not capability.
38:48
So intelligence is knowledge
38:50
acquisition efficiency. O3 did very
38:53
well on Ark v1. And now that
38:55
it does so badly on ARQV2,
38:57
the whole point of your definition
38:59
of intelligence is that given some
39:01
basis knowledge, you efficiently recombine, you
39:03
produce new skill programs, you're saying
39:05
that in the absence of the
39:07
base knowledge in V2, there is
39:09
no intelligence. Therefore, is O3 not
39:11
actually as intelligent as we thought
39:13
it was? I think O3 is
39:15
one of the first models, perhaps,
39:17
arguably in the first model that
39:19
does, show fluid intelligence. So now,
39:21
what the results on AR2, are telling
39:24
you is that it's not human
39:26
level fluid intelligence, right? But still,
39:28
I would consider Austria as a
39:30
kind of prototype, with two
39:32
big flaws, two big caveats, one's
39:34
of course efficiency. Efficiency is
39:37
part of the problem statements, in
39:39
fact, the central point. So as
39:41
long as you're not as efficient
39:43
in terms of, for instance, you know,
39:45
data efficiency, compute efficiency,
39:48
energy efficiency, then it's only
39:50
a temporary solution. We'll find...
39:52
a better solution in the
39:54
future. And also, it's not
39:56
quite human level. If it
39:58
were human level, I... expected to
40:00
score, you know, something like oh three
40:02
dollars should score like over 60% an
40:05
hour, too. And we don't know what
40:07
the exact number is going to be,
40:09
but you know, probably probably probably like
40:11
four to five percent, right? Do you
40:13
think that general intelligence is a category
40:16
or a spectrum? So general through the
40:18
intelligence, it's, I would say it's both,
40:20
because there's a huge difference between just
40:22
having memorized a bunch of skill
40:25
programs that are static and knowledge
40:27
fact of its... versus being able
40:30
to adapt novelty to a non-zero
40:32
extent. So that is a binary
40:34
distinction. Either you have fluid intelligence
40:37
or you don't, right? And arc
40:39
one could answer that question for
40:41
any system. But once you have
40:44
non-zero fluid intelligence, then the question
40:46
is how much do you actually
40:49
have and how it compares to
40:51
humans? And that's related to
40:53
the notion of... recumbination
40:55
of the skill programs that
40:57
you have, the knowledge that
40:59
you have, and depths of
41:01
recumbination. So if you do
41:03
no recumbination at all, you
41:05
don't have free intentions. If
41:07
you do some recumbination, you
41:09
do. But then the question
41:11
is how deeply can you
41:14
recombine? Like, for instance, if
41:16
you're using a program synthesis
41:18
analogy, the question is
41:20
how big of a program. can you
41:22
write on the fly to adapt to a
41:24
new problem, right? And of course, as well,
41:27
how efficiently, how fast and how efficiently
41:29
you can write it, right? So it
41:31
is a binary, but it's also a
41:33
spectrum. And arc one was trying
41:35
to ask the binary question, does this
41:38
system has any fluid intelligence at all?
41:40
And arc two is more on the
41:42
side of trying to measure how much
41:45
fluid intelligence you actually have compared to
41:47
humans. How long do you think it
41:49
will take for V2 to be saturated
41:52
and do you think it will survive
41:54
until V3 comes up? So that's a
41:56
question where you have to take into
41:59
account resource efficiency. So if you're asking
42:01
how long it we take before
42:03
we have a system that can
42:05
score higher than, let's say, 80% on
42:07
arc two, using less than $10,000
42:09
of compute, for instance, I think
42:11
probably around a couple years. So
42:13
it's very difficult to make predictions
42:15
here. I think if you're just
42:17
looking at current techniques. and scaling
42:20
up current techniques I think could
42:22
take a while. I think the
42:24
arc two is actually way out
42:26
of fridge of current techniques. But
42:28
of course we are not limited
42:30
to current techniques you know in
42:32
2025 we're probably going to see
42:34
new breakthroughs in the same way
42:36
if we saw new breakthroughs last
42:38
year. And these breakthroughs are actually
42:40
very difficult to predict. I was
42:42
personally very surprised with the performance
42:44
at all three. could get an
42:47
arc one last year. That came
42:49
as a surprise. So maybe we'll
42:51
have new surprises this year. But
42:53
I would be extremely surprised if
42:55
we see an efficient solution
42:58
that's human level on arc two
43:00
by the end of 2025. I
43:02
would basically pull that out. By
43:04
the end of 2026, maybe. Right.
43:06
which is why we have the ARX3
43:08
coming, of course. So on analysis of
43:10
failure modes, I'm sure you saw the
43:12
blog post that I read where it
43:14
went through all of the different failure
43:16
modes of O3, and of course it
43:19
was solution space prediction which made it
43:21
more surprising to me. My take on
43:23
it was I was really impressed that
43:25
even when it failed it was because
43:27
the solution space got too big or
43:29
it was just getting minor mistakes but
43:31
broadly it got the direction of many
43:34
of the problems quite well. You know
43:36
similarly tell me about the failure modes
43:38
on V2. Right so we were not able
43:40
to test Austria as much on V2 but
43:42
I can tell you a bit fair mode
43:45
based on what we saw on V1 and well
43:47
there are many but generally This
43:49
is a model where
43:51
reasoning abilities can decrease
43:54
exponentially with problem size.
43:56
If you have more objects in
43:58
the scene, if you have more... more
44:00
rules, more concepts, interacting. You see
44:02
this exponential decrease in capabilities. It's
44:04
also, you know, because it's a
44:06
model that needs to, it works
44:08
by writing a kind of natural
44:10
language program that describes what it's
44:12
seeing, that describes the problem and
44:15
the sequence of steps to solve
44:17
it. So in that sense, it's
44:19
100% a natural language program. And
44:21
that means that in order to
44:23
solve a problem that's to talk
44:25
about it using words. And as
44:27
a result, if you have a
44:29
task where the rule is very
44:31
simple to grow up for human,
44:33
but in a non-verbal way, but
44:36
it's very difficult to put it
44:38
into words, it has no verbal
44:40
analogy. That's actually much harder to
44:42
solve for this direction of that
44:44
model. So other than that, we
44:46
saw that just one of the
44:48
big challenges is, you know, compositionality,
44:50
having multiple rules interact. There's also,
44:52
it seems there's a bit of
44:54
a... locality bias going on as
44:56
well, where if you have to
44:59
combine together bits of information that
45:01
are especially co-located together on the
45:03
grid, that's easier for the model
45:05
than if you have to do
45:07
the exact same thing, but the
45:09
two bits of information you have
45:11
to synthesize are pretty distant. So
45:13
having to combine together bits of
45:15
information that are separate, having to...
45:17
It seems as well that the
45:20
model has a trouble simulating the
45:22
execution of a rule and then
45:24
reading the results. Like for instance
45:26
if you're solving our task and
45:28
you grock a certain rule and
45:30
then you start applying it, let's
45:32
say it's like you're continuing your
45:34
line continuation or something, and then
45:36
you have to take another rule.
45:38
and use that rule to read
45:41
a bit of information that you
45:43
have written in the process of
45:45
existing in the first rule. That
45:47
sort of thing is completely out
45:49
of ridge for the chain of
45:51
thought models. How multi-dimensional do you
45:53
think? is, you know, one school
45:55
of thought, and I think you
45:57
might subscribe to this, is that
45:59
the universe is kind of almost
46:01
made up of platonic rules that
46:04
are disconnected from the world that
46:06
we live in, and then there's
46:08
this kaleidoscope idea you talk about,
46:10
and they get combined together, and
46:12
that's what we see. But another
46:14
school of thought is that there'll
46:16
always be another dimension of intelligence.
46:18
We'll always need Ark V4, V5,
46:20
V6, and there'll always be something
46:22
missing. Each step of generality that
46:25
you cross, you're gaining a nonlinear
46:27
amount of capabilities, right? And so
46:29
after a few steps, you are
46:31
so overwhelmingly superhuman across every possible
46:33
dimension that, yeah, you can say
46:35
without a doubt that you have
46:37
a giant fact, in fact, you
46:39
have super intelligence. But yeah, intelligence
46:41
is, in a sense, multidimensional. And
46:43
what Ark is trying to capture
46:46
is really just this. fluid intelligence
46:48
aspect, disability to recombine core knowledge,
46:50
building blocks. So in my definition
46:52
of intelligence, intelligence is about efficiently
46:54
acquiring skills and knowledge and recombining
46:56
them to, well, again, efficiently recombining
46:58
them to adapt to novel tasks,
47:00
to novel situations that you cannot
47:02
prepare for explicitly. purely the ability
47:04
to take a bunch of building
47:06
blocks and recombining them, doing kind
47:09
of program sentences, that's one aspect
47:11
of that. That's probably the most
47:13
central aspect, which is why, you
47:15
know, this is why we're focusing
47:17
on with ARC. But it's not
47:19
the only aspect, because this is
47:21
assuming that you already have this
47:23
pile of knowledge available, so it's
47:25
not, it's overlooking the acquisition of
47:27
information about the task in ARC.
47:30
You're provided all the information about
47:32
the task at once, but in
47:34
the real world, you have to
47:36
collect that information, you have to
47:38
take actions. goals to discover what
47:40
your environment is even about what
47:42
you can do within it. And
47:44
you have to do these things
47:46
efficiently, of course. And that efficiency
47:48
aspect is very important because intelligence
47:51
was developed by evolution in evolution
47:53
adaptation. And when you're exploring the
47:55
world, you are taking on some
47:57
risk. You might get killed by
47:59
a predator, for instance. And so
48:01
you want to be being. You
48:03
want to gain the maximum amount
48:05
of information and thereby power over
48:07
your environment by taking on a
48:09
minimum amount of frisk and expanding
48:11
a minimum amount of energy. That's
48:14
not something you can measure. That's
48:16
something you can capture with ARC
48:18
v1 or v2 alone. Can you
48:20
just expand on the significance of
48:22
the solution space prediction with O3?
48:24
Because that rather suggests to me
48:26
that... it's almost this rich sudden
48:28
idea where it's nearly a blank
48:30
slate and it's very empiricist and
48:32
we just take the data in
48:35
and the neural network, does all
48:37
of the things. I always imagine
48:39
that we would need to have
48:41
some kind of structured approach which
48:43
took, you know, the core knowledge
48:45
into account. Do you think that,
48:47
you know, it's actually simpler than
48:49
we thought? Trying to directly predict
48:51
the output versus trying to write
48:53
down... the steps to get the
48:56
output. They are not entirely separate
48:58
things. Because of course, once you've
49:00
written down the steps, you can
49:02
do what looks like transaction. And
49:04
O3 is not actually a real
49:06
transaction model, because it's much closer
49:08
to program synthesis model, where it's
49:10
searching. for the right chain of
49:12
thought to describe the task and
49:14
list the sequence of steps to
49:16
solve it. And once you have
49:19
the chain of thought, you can
49:21
just use the model to execute
49:23
it and it gives you the
49:25
output. So from the outside, if
49:27
you treat the entire system as
49:29
a black box, it looks like
49:31
transaction, but the same would be
49:33
true of any program search system.
49:35
what it's actually doing, and the
49:37
reason why it's able to adapt
49:40
to novelty so well, is because
49:42
it's synthesizing this chain of thought,
49:44
which serves as a recombination artifact
49:46
for the knowledge and the skills
49:48
that the model has. A recombination
49:50
artifact is adapted to the particular
49:52
task at hand. So it's much
49:54
closer to bronze this model. this
49:56
is something that the community found
49:58
very confusing because in the last
50:01
interview you were describing I think
50:03
it was a one pro as
50:05
being a kind of explicit search
50:07
process and yeah what seems to
50:09
be the case is that it's
50:11
you know there is some kind
50:13
of reinforcement learning thing in in
50:15
the pre-training and then it maybe
50:17
does some sampling inference time so
50:19
we you know we're doing a
50:21
whole bunch of compleitions and are
50:24
you saying it's as if it's
50:26
doing a program search? It's searching
50:28
over the space of possible channel
50:30
thoughts and finding the one that
50:32
seems most appropriate. So in that
50:34
case it's entirely analogous to a
50:36
problem search system where the program
50:38
you're synthesizing is a natural language
50:40
program, right? A program written in
50:42
English. Okay, it just seems a
50:45
bit strange doing auto regression on
50:47
a language model, how that could
50:49
be characterized as a search process.
50:51
It's so a model like O1
50:53
pro, for instance or O3. is
50:55
not just auto-aggressive. It has actually
50:57
this test time search step, which
50:59
is why it can adapt to
51:01
novelty much much better than the
51:03
base models of purely auto-aggressive. That's
51:06
why, again, you see on the...
51:08
So in general, ARC, even ARC1,
51:10
has completely resisted the pre-training, purely
51:12
auto-aggressive, pre-training scaling by dime, like
51:14
from... 2019 to 2025, we scaled
51:16
up these models back 50,000 X,
51:18
like from GPT-2 to GPT-4.5. And
51:20
even on Arc 1, you went
51:22
from 0% to something like 10%
51:24
and on Arc 2, you're going
51:26
from 0% to 0% right. And
51:29
meanwhile, if you. If you have
51:31
any system that's actually capable of
51:33
doing test time adaptation, like test
51:35
time search, like O1 Pro or
51:37
O3, then you're getting much, much
51:39
better performance. There's this huge performance
51:41
gap. So, you can tell the
51:43
difference between the model that does
51:45
not do test time adaptation and
51:47
the model that does by looking
51:50
at this performance gap, this generalization
51:52
gap on arc, also by looking
51:54
at latency and by looking at
51:56
cost. So, of course, a model
51:58
that does test time search is
52:00
going to give you answer. It's
52:02
going to take much longer. Like
52:04
if you look at 01 Pro,
52:06
for instance, it's taking 10 minutes
52:08
to answer your queries, and it's
52:11
going to cost you much more
52:13
as well, because of all this
52:15
work it's doing. So I could
52:17
download the Deep Sea Car 1
52:19
model, and I'm running it on
52:21
my machine, and as far as
52:23
my machine is concerned, it's just
52:25
a normal LLM, it's doing greedy
52:27
sampling, maybe. Oh, so you're saying
52:29
there is something different about... Oh,
52:31
three is qualitatively different. That's correct.
52:34
It is qualitatively different from all
52:36
the other models that can be
52:38
for. It is actually a model
52:40
that has fluid intelligence. It has
52:42
non-zero amount of fluid intelligence. And
52:44
R1, for instance, is not. Okay,
52:46
so categorically it's doing some kind
52:48
of active search processor inference. That's
52:50
what it looks like. So of
52:52
course, I don't actually know how
52:55
it works, but that's what I
52:57
would speculate it looks like. Yes.
52:59
and you see it in the
53:01
latency in the cost and of
53:03
course the arc performance. Would you
53:05
be shocked and surprised if it
53:07
came to light that it was
53:09
just doing like auto-agressive greedy sampling?
53:11
Honestly I think it's very very
53:13
likely because it's completely incompatible with
53:16
the characteristics of the system that
53:18
we know of that we were
53:20
exposed to when we tested those
53:22
three. Awesome. And do you think
53:24
that there will always be human
53:26
gaps? Probably not always. Today they
53:28
are very clear, very significant gaps,
53:30
right? Like we're not actually that
53:32
close to a GI right now.
53:34
But eventually, you know, as we
53:36
get closer and closer, there will
53:39
be few and fewer gaps.
53:41
At some point, we're going
53:43
to have an AI going
53:45
to have that is
53:47
just that are
53:49
just overwhelmingly every possible
53:51
every possible to look at.
53:53
So to look
53:55
at. that there don't
53:57
think gaps be gaps
54:00
right, Tim. Thank you so much
54:02
thank you so
54:04
much for doing
54:06
We're looking Tim. forward to
54:08
looking forward to
54:10
seeing you in
54:12
a couple weeks.
54:14
weeks.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More