ARC Prize v2 Launch! (Francois Chollet and Mike Knoop) by Machine Learning Street Talk (MLST) | Podchaser

Episode from the podcastMachine Learning Street Talk (MLST)

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)

Released Monday, 24th March 2025

Good episode? Give it some love!

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)

Monday, 24th March 2025

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

This is not just more of

0:02

the same that we've seen in

0:04

the past. We have now an

0:06

existence proof that computers are able

0:08

to do something that they've never

0:10

been able to do before in

0:12

history of humanity. Arc version 2

0:14

has just been released and even

0:16

the frontier foundation models are failing

0:18

spectacularly. Today we are using Arcadia2 the

0:20

next version of the benchmark. Arcadia and

0:22

Arcasia2 is pretty much the only unsaturated

0:25

benchmark that is feasible by regular people.

0:27

And so it's a very good yardstick

0:29

to measure how much through intelligence these

0:31

models have, how close we are to

0:34

through it here. And alongside that, we're

0:36

really excited to be welcoming everyone to

0:38

Arc Prize 2025. Contest kicks off officially

0:41

now. It's going to run all the

0:43

way through the end of 2025. The

0:45

run all the way through the end

0:47

of 2025. The structure of the contest

0:49

is very similar to last year. We're

0:52

going to have the Cagle leaderboard running.

0:54

We're going to be testing this big

0:56

prize. It's unclaimed. In order to get

0:58

the big prize, you have to open

1:00

source your solution, have high degree efficiency

1:02

running on Caggle. And now we're really,

1:04

really excited to see all the new

1:06

ideas. I think there was a lot

1:08

that came out last year in 2024

1:10

that really pushed the frontier. The next

1:13

version of the benchmark, it's more

1:15

changing. It's extremely unsaturated. All frontier

1:17

models are screwing effectively within single

1:19

legit percentage. It's the first time

1:22

where we've calibrated the human-facing difficulty

1:24

of the tasks. So we actually

1:26

hired roughly 400 people. We tested

1:29

every single task, and every single

1:31

task has been solved by at

1:33

least two people. So we know

1:35

it's very feasible for humans. It's

1:38

extremely out of reach for a

1:40

near system today. This is the

1:42

frontier. The ARC benchmark forces us

1:44

to confront an uncomfortable truth about

1:46

our pursuit of artificial general intelligence,

1:49

that the field has been overlooking.

1:51

Intelligence is not just about capabilities,

1:53

it's also about the efficiency with

1:55

which you acquire and deploy these

1:57

capabilities. Intelligence is about finding that

1:59

program. very few halves using

2:01

actually very little compute. Like look

2:04

at the amount of energy that

2:06

a human expands to solve one

2:08

after over, you know, two, three,

2:11

four minutes. It's almost zero, right?

2:13

And compare that to a model

2:15

like Austrian high-compute settings, for instance,

2:18

which is going to use like

2:20

over 3,000 bucks of compute. So

2:23

it's never just, and it can

2:25

explain. Efficiency is actually the question

2:27

we're asking. Efficiency is a problem

2:30

statement. It's not capability. The goal post

2:32

is AGI. That's like what we're here

2:34

to do. That was the whole point

2:36

of launching Arc Prize in the first

2:39

place was to raise awareness that there

2:41

was this really important benchmark that I

2:43

thought shared showed something important that like

2:45

the sort of research community and AI

2:47

was missing about the sort of nature

2:50

of artificial intelligence. You know, this is

2:52

like one of the things I think

2:54

makes Arc special. and very unique and

2:56

important, I would argue. You know, there's

2:58

a lot of benchmarks in the world

3:01

today. And to my understanding of knowledge,

3:03

pretty much every other benchmark, you know,

3:05

all the frontier benchmarks, basically are trying

3:07

to test for these like superhuman capabilities,

3:10

right? These like PhD plus plus. type

3:12

skills that you need to have in

3:14

order to succeed at the benchmark. It's

3:17

not just compute, it's not just scale,

3:19

you have to be scaling the right

3:21

thing, you have to be scaling the

3:23

right ideas, and maybe you have them.

3:26

I personally just keep getting like surprised

3:28

and impressed by our price community, how

3:30

much folks are pushing the frontier, and

3:33

I think it was really exciting too,

3:35

because it means that individual people and

3:37

individual teams can actually make a difference.

3:40

out there could actually make a significant

3:42

contribution to the frontier of H.I. So

3:44

if you're going to enter the contest,

3:46

go to arkfrize.org and good luck. Good

3:49

luck to see on the leaderboard. This

4:02

sort of like test time optimization

4:04

techniques or test time search techniques,

4:06

that's the current frontier for GI,

4:08

right? And there are many ways

4:10

to approach it. Of course, you

4:13

can do, you can do just

4:15

test time training on this case, you

4:17

know, test time, test time, or in

4:19

this case, you know, test time, test

4:21

time, or in this case, you know,

4:24

token space, or you can do search

4:26

in latent space as well, right. So

4:28

you have many different ways to do.

4:30

to novelty at this time by recombining

4:32

what you know into some level structure.

4:34

MLST is sponsored by Two for AI

4:37

Labs. Now they are the Deep Seek based

4:39

in Switzerland. They have an amazing team. You've

4:41

seen many of the folks on the team.

4:43

They acquired Mine's AI, of course. They did

4:45

a lot of great work on ARC. They're

4:48

now working on O1-style models and reasoning and

4:50

thinking and test time computation. The reason you

4:52

want to work for them is you get

4:54

loads of autonomy, you get visibility, you can

4:57

publish your research. And also they are hiring,

4:59

as well as ML engineers, the hiring a

5:01

chief scientist. They really, really want to find

5:03

the best possible person for this role. And

5:06

they're prepared to pay top dollar as

5:08

a joining bonus. So if you're interested

5:10

in working for them as an ML

5:12

engineer or their chief scientist, get in

5:15

touch with Benjamin Cruzier. Go to two

5:17

for Labs. AI and see what happens.

5:19

Well, Mike, it's amazing to have you

5:21

on MST. Welcome. Yeah, thank you so

5:23

much. We're very excited for to

5:25

be here today. Mike, I hear

5:27

that you guys have got some

5:29

very exciting news today. Tell me

5:31

about it. Yeah, super excited today.

5:33

We're back. We're really excited

5:35

to be launching both Arc

5:37

AGI-2 alongside an updated Arc

5:40

Prize, Arc Prize 2025 contest.

5:42

Both they're going to be

5:44

launching, both Arc Prize 2,

5:46

alongside an updated Arc Prize 2025

5:48

contest. Both are going to Arc

5:50

Prize 2025. sort of challenge deep learning.

5:53

And arcade GI2 in contrast is really a

5:55

benchmark that's designed to challenge these new AI

5:57

reasoning systems that we're starting to see from

5:59

printing. much all of the frontier labs.

6:02

And one of the really cool

6:04

things about Arcagi, too, is we're

6:06

basically seeing, you know, models, AI

6:08

systems that are purely based on

6:10

pre-training, effectively scoring 0%. And some of

6:12

the frontier AI reasoning systems were

6:14

in progress of testing them right

6:16

now, and we're sort of expecting

6:18

single-digit performance. So a really big

6:20

update in terms of over Arcageo

6:22

1 from 2024. So the original

6:25

version of Arc was very much

6:27

based at these kind of foundation

6:29

models that didn't do reasoning. Version

6:31

2 is tunes for the reasoning

6:33

models. What would you say to the

6:35

charge that you're moving the goalposts? I mean,

6:38

how is Ark v2 sort of like meaningfully

6:40

an evolution of the benchmark? Yeah, I mean,

6:42

I think the way that I think about

6:44

it is that the goalpost is AGI.

6:46

That's like what we're here to do. That

6:49

was the whole point of launching Ark Prize

6:51

in the first place was to raise awareness

6:53

that there was this really important benchmark that

6:56

I thought... shared something important that like the

6:58

sort of research community and I was missing

7:00

about the sort of nature of artificial

7:02

intelligence. So that's kind of our goalpost. And

7:04

you know the definition that I use for

7:06

AGI and the one that the foundation adopts

7:09

is assessing this capability gap between humans and

7:11

computers. And Arc Prizes Foundation is really to

7:13

drive that gap to zero. I think it

7:15

would be hard to argue that we don't

7:17

have AGI. If you look around and you

7:19

can't find any more tasks that are very

7:22

straightforward, simple and easy for humans, that computers

7:24

can do as well. And the fact is

7:26

that we were able to find still lots

7:28

of those tasks. In fact, all of the

7:30

tasks in Arcadia II data sets sort of

7:32

fit into this category of things that are

7:35

relatively easy and simple and straightforward for

7:37

humans, and comparatively very difficult and hard

7:39

for AI today. Okay cool so I

7:41

know you guys have done loads of

7:43

human calibration and we'll talk about that

7:45

in a minute but the fundamental philosophy

7:47

of the of the arc challenge is

7:49

focusing on human gaps but at the

7:51

same time AI models are becoming superhuman

7:53

in so many respects so is the

7:56

big story the human gaps or is

7:58

the big story the expansion of capabilities,

8:08

right?

8:10

These

8:12

like

8:14

PhD

8:18

plus plus. type skills that you

8:21

need to have in order to

8:23

succeed at the benchmark. Humans can't

8:25

solve the problems that are in

8:27

these benchmarks. So you have to

8:29

be very, very, like, have a

8:31

lot of experience, a lot of

8:33

education, a lot of training in

8:35

order to be able to sort

8:37

of even get close to sort

8:40

of solving the benchmarks as a

8:42

human. And I think those are

8:44

important. Those are useful. But I

8:46

think it's actually more illustrating about

8:48

something that like we're missing from the

8:50

nature in the sense of like artificial

8:53

intelligence by looking at the that's much

8:55

more of an inspiring story. I think

8:57

it's one where it's actually necessary to

8:59

target this in order to actually get

9:02

AGI that is capable of innovation. You

9:04

know I think this is one of

9:06

the main reasons I got into AI

9:09

and AGI in the first place was

9:11

being really inspired and excited about trying

9:13

to build these systems that would be capable

9:15

of like compressing science timelines. And if all

9:18

we have is AI... that looks like we

9:20

had at the beginning of 2024, right, based

9:22

on pre-training, based on a memorization regime, you're

9:24

never going to get to that because these

9:27

are systems that are really going to reflect

9:29

back the experience and the knowledge that humanity

9:31

has sort of gained over the last, you

9:33

know, 10,000 generations as opposed to being ones

9:36

that are capable of like producing new knowledge.

9:38

new technology adding to sort of humanities like

9:40

colossus of sort of knowledge and technology. If

9:42

we want systems that can actually do that,

9:44

we need AGI. And this definition that we've

9:47

sort of used for this foundation of easy

9:49

for humans and our hard for AI, I

9:51

think is if we can close that gap,

9:53

we'll actually get technology that's capable of doing

9:55

that. I wonder whether you think we are

9:57

just about five discoveries away from AGI because...

10:00

there's going to be version three of

10:02

the art challenge, presumably there'll be a

10:04

version four. Intelligence is multi-dimensional and I

10:06

can see this both ways right because

10:08

you know many critics of AI they

10:10

are almost gas lighting us they're saying that

10:12

this amazing technology that you're using it doesn't

10:15

work and I'm like well yeah it does

10:17

work and would it be the case that

10:19

the criticisms will become more and more kind

10:21

of philosophical and they'll say oh you know

10:23

because it's not biological or whatever it's not

10:26

biological or whatever it's not the same thing

10:28

or do you're I think this is why

10:30

benchmarks are important. And I had a similar

10:32

question actually, you know, when I was starting

10:34

to get back into AI in 2022 and

10:37

trying to understand the world. Like, are we

10:39

on track for AGI or not? How far

10:41

off are we? And I find that it's

10:43

really really hard to get a sense of

10:46

understanding of the capabilities of all these systems

10:48

purely by using them. You can certainly get

10:50

a sense by just interacting with them. But

10:52

if you really want to understand what are

10:54

they capable and not capable of, you really

10:56

need a benchmark to discern this fact. This

10:58

is one of the interesting things that I

11:00

picked up from building AI products at Zapier

11:02

as well. It's very different building AI with

11:05

AI than it is classic software. One of

11:07

the big differences is when you're building classic

11:09

software, you can build and test with five

11:11

users and know, OK, hey, this is going

11:13

to be like. this product can scale to

11:15

millions. It's going to work the exact same

11:17

way. And that's fundamentally just not the case

11:19

with AI technology. You really have to deploy

11:21

to a large scale in order to assess how

11:23

it works. You need a benchmark alongside that scaling in

11:25

order to tell you, hey, has a system working or

11:28

not? What were the main lessons that you learned from

11:30

version one that you moved into version two? I

11:32

think so Arkage IT has been in the works

11:34

actually for several years, France started working on it,

11:36

crowdsourcinging some tasks for years and years and years

11:38

ago. There was a bunch of sort of inherent

11:41

flaws we ran into with it that we learned

11:43

as we sort of started popularizing the benchmark over

11:45

the last year or so. You know, one of

11:47

the things we learned was that a lot of

11:50

the tasks were very susceptible to brute force search.

11:52

So if that's something that has zero intelligence at

11:54

all and we wanted to minimize the sort of

11:56

incident rate of tasks that were sort of susceptible

11:59

to that, we hadn't. calibrated it. We anecdotally,

12:01

we relied on some anecdotes to say

12:03

that hey Arkage I1 is easy for

12:05

humans. We had you know a couple

12:07

source STEM, two STEM folks who had

12:09

taken the whole data set including the

12:11

private set and were able to solve

12:13

you know 98 99% but you were

12:15

relying on anecdote. We didn't have that

12:17

calibrated across the sort of three different

12:19

data sets that we had. And then we

12:21

had all these AI friendship reasoning systems come

12:23

out over the last you know three or four

12:26

months and we've got a chance to

12:28

study these. of our tasks that remain

12:30

very very challenging for these AI reasoning

12:32

systems, which we can get into if

12:34

you're curious. And so those are the

12:36

main sort of insights and learnings that

12:38

we took from Arcagia 1 to try

12:40

and produce an Arcagia 2 benchmark that

12:42

I think will be a useful sort

12:44

of signal for development this year in

12:46

artificial intelligence. Can we quickly touch on

12:48

the open AI situation? So in December...

12:50

they didn't launch but they gave you

12:52

access to O3 and it got incredible

12:55

performance on Ark v1, human level performance,

12:57

something that we just didn't think really

12:59

would be possible so quickly. Yeah, it's

13:01

surprising. It came out of nowhere. I

13:03

mean can you just tell me the

13:06

story behind that? Yeah, yeah, this is,

13:08

you know, one of the reasons why

13:10

I'm always hesitant to make predictions in

13:12

AI, I bought timelines. You know, I

13:14

think it's very easy to sort of

13:16

make predictions along smooth scaling curves. But

13:19

the nature of innovation is it's a

13:21

step function, right? And step functions are

13:23

really really hard to predict when they're

13:25

going to come out. I think the

13:27

best thing that I can say, having

13:30

spent some time with O3 and looking

13:32

at how it performs at ARC, is

13:34

that systems like O3 demand serious study.

13:36

This is not just... you know more of the

13:38

same that we've seen in the past. This is,

13:41

you know, we have now an existence proof that

13:43

computers are able to do something that they've never

13:45

been able to do before in history of humanity,

13:47

which is I think really, really exciting. I think

13:49

there's still a long way to go to get

13:52

to AGI, but I do think that these things

13:54

are important to understand and sort of even discern

13:56

how they work from a capability standpoint in order

13:58

to make sure that future. systems that we're

14:00

developing building look more like this and not

14:03

like the sort of pre-training pure scaling regime

14:05

that we've had in the past. So I

14:07

still remember the, to sort of give you

14:09

the anecdote, I still remember the two-week period,

14:11

the sprint we had on testing 03. It

14:14

was right at the end of the contest.

14:16

We had wrapped up our choice 2024 in

14:18

I think early November last year, and we

14:20

had a three-week, four-week period where we were

14:22

really, really busy on judging all the final

14:24

submissions, the papers, getting together the technical report.

14:27

And we were dropping it on all the

14:29

sort of results on a Friday. And I

14:31

was really hoping and anticipating that I was

14:33

going to have a nice relaxing holiday

14:35

period in December. And the day that

14:37

we dropped the technical report, we had

14:39

to reach out from one of the

14:42

folks at Open AI who said, hey,

14:44

we'd really love you to test this

14:46

new thing that we're working on. We

14:48

think we've got some impressive results on

14:50

our ARC, AGI1. And so that kicked

14:52

off a very, very hectic, fast, frantic,

14:54

two-week period to try and understand, okay,

14:56

what is this system? Like, does it

14:59

reproduce the claims that opening I had

15:01

on testing it? And what does this

15:03

mean for the benchmark? What does this

15:05

mean for AGI? And I think we're

15:07

able to show the final result was

15:10

that 03 on its sort of high

15:12

efficiency setting, which fit within the sort

15:14

of budget constraints that we'd set out

15:16

for our public leaderboard, got about 75%

15:18

or so. And then they had a

15:21

high compute. version which used I think

15:23

like maybe 200x more compute than the

15:25

low compute setting which was able to

15:27

score even 85% and these are really

15:29

impressive and I think this shows a

15:32

system like O3 has this you know

15:34

sort of binary switch. We've gone from

15:36

a regime where these AI models have

15:38

no ability to adapt to novelty to

15:40

something like a three where in its

15:42

existence proof of now an AI system

15:44

that can adapt to novelty in a

15:46

small way. Breaking this down a little

15:48

bit, there were some interesting caveats that

15:50

you just alluded to. So first of all,

15:52

they did some kind of fine tuning and

15:54

people at the time joked that, you know,

15:56

isn't it scandalous that they were training on

15:59

the training set? Yeah, this is like a

16:01

very bad description. This is a very poor

16:03

critique. I think it misses the point of

16:05

the benchmark. I think the folks who feel

16:08

this way, it's just because they're so used

16:10

to thinking about benchmarks and AI from the

16:12

pre-training scaling regime, where like, hey, if I

16:14

trained on the data. that's cheating then to

16:17

test on the data, right? And that's true

16:19

in the pre-training regime, but ARC is a

16:21

very, very special different benchmark, where it explicitly

16:24

makes a training set available with the intention

16:26

to train on it. This is very explicit.

16:28

This is like what the benchmark expects

16:30

you to do. We expect AI researchers

16:32

to use the training set in order

16:35

to teach their AI systems about the

16:37

domain of ARC. And then what specials,

16:39

we've got a private data set that

16:41

very few humans have ever seen. It

16:44

requires you to generalize and abstract the

16:46

core knowledge concepts that you learn through

16:48

the training set at test time. Fundamentally,

16:50

you cannot solve like the ARC AGI

16:52

1 or 2 private data sets purely

16:55

by memorizing what's in. the pre-training set.

16:57

This would be like, you know, maybe

16:59

a crude analogy would be, you know,

17:01

if I was going to teach an

17:04

AI system on grade school math and

17:06

then test it on calculus. This is

17:08

very similar to the type of thing

17:10

that we do with art where, you

17:12

know, the training set is much simpler,

17:15

easy, or the training set is much

17:17

simpler, easy curriculum to learn on, and

17:19

then the test is a much more

17:21

difficult one where you actually have to

17:23

express true intelligence per task. or more,

17:26

that means they were probably doing samplings.

17:28

They were doing a ridiculous amount of

17:30

completions. They were doing a solution space prediction,

17:32

which is very interesting. But the main thing,

17:34

Mike, just deep in your bones, deep in

17:36

your bones, do you think that they were

17:38

training on API data or surely they were

17:41

training on a whole bunch of data to

17:43

do that well? And the extension of the

17:45

question is, when they released the vanilla version

17:47

of it, what performance would it

17:49

get compared to their tweaked their tweaked version?

17:52

We will test that as soon as it comes out and I

17:54

would love to report the results on that. They told us

17:56

all they did was training on the training set and I believe

17:58

that's what they did. Okay, very interesting. comment on the

18:01

solution space prediction I mean I

18:03

was amazed that just predicting the

18:05

output space directly they could do

18:07

so well I mean doesn't doesn't

18:09

that almost take away from the

18:11

idea that we need to have discrete

18:14

code DSL type approaches if you

18:16

can just predict the solution space

18:18

so well? Effectively what O3 is

18:20

doing is it's able to use its

18:22

pre-trained experience and recombine it on the

18:24

fly and face of a novel task.

18:26

It does this through a regime called

18:29

chain of thought. This is all in form

18:31

speculation, by the way. We don't

18:33

have confirmed details. This is just

18:35

my sort of personal assessment of

18:37

how these systems work, particularly things

18:39

like O1-Pro and O3. If you compare them

18:41

with systems like R1 or O1, these

18:43

are systems that basically sped out a

18:45

single chain of thought and then used

18:48

that chain of thought in order to

18:50

ground a final answer. distinct from how

18:52

systems like O1Pro and O3 work, where

18:54

they actually have the ability to do

18:56

multi-sampling and recomposition at test time of

18:58

that chain of thought. This allows them

19:00

to build novel COTs that don't show

19:02

up anywhere in the pre-training, not in

19:04

the existing experience, and allows these systems

19:06

to reach more sort of effect over

19:08

more situations based effectively based on what

19:10

was in the original pre-training. Fundamentally, these

19:12

systems are a combination of a deep

19:14

learning model and a synthesis engine that

19:16

is put on top, and I think

19:18

the right way to think of them

19:21

is these are really AI systems, not

19:23

single models anymore. Yeah, and I agree

19:25

with you. It's really funny, though, how

19:27

you see the critique in the community,

19:29

because, you know, Gary Marcus is now

19:31

saying, oh, it can't draw pictures of

19:34

bicycles and labels and label the parts,

19:36

whereas we see 01 Pro and O3,

19:38

and it really does seem like a

19:40

dramatic. So the real work that you

19:42

guys did was you got a whole

19:44

bunch of human subjects and you had

19:47

the, I think you had, was it

19:49

400 test subjects and they all needed,

19:51

like at least two people needed to

19:53

solve every single task and you had

19:55

to do this experiment design and you

19:57

had to balance complexity of the tasks.

20:00

and so on. How did he do all of that? Yeah, this was,

20:02

so this was one of the biggest things we

20:04

wanted to fix with archaic I-1. We never had

20:06

a formal human calibration study on how do humans

20:08

actually... do on these things, you know, we relied

20:10

on anecdote. So we had to set up a

20:13

testing center down in San Diego. We recruited tons

20:15

of just folks from the local community all the

20:17

way from, you know, Uber drivers to single moms

20:19

to UCSD students and brought these folks out to

20:22

go take art puzzles. It was really cool, like,

20:24

we'll have to share some of the photos, like,

20:26

these like, testing shots where you have, like, you

20:28

know, dozens and hundreds of people, like, taking our

20:31

tasks on laptops. Our goal, originally with the

20:33

data set, was to ensure that every single

20:35

task that we put in Arcagi II was

20:37

solvable by at least one human. And what

20:39

we actually found was something even a higher

20:42

standard, I think, which was that we found

20:44

that every single task in the new V2

20:46

data set is solvable by at least two

20:49

humans under two attempts. And these are the

20:51

same rules that we give to AI systems

20:53

on the benchmark, both on the contest, as

20:55

well as the public leaderboards. I think this

20:58

is a pretty good sort of assertion

21:00

of... a straightforward comparison we can actually use

21:02

now between, hey, are these tasks easy and

21:04

straightforward for humans? Yes, are they hard for

21:07

AI? Yes, like I said before, frontier systems

21:09

generally are getting close to zero or

21:11

single digit percentages on these tasks now.

21:13

Okay, but the idea though is this

21:15

more of X paradox, right, which is

21:17

that, you know, Basically while we can select

21:19

problems that are easy for humans and hard

21:21

for AIs we haven't got AGI yet But

21:23

I was looking through some of your challenges

21:25

and I felt that some of them were

21:27

very difficult like it would have taken me

21:30

Five or six minutes of deep thought to

21:32

get it like are you finding that it's

21:34

still easy to you know to find these

21:36

things that are easy for humans and hard

21:38

for AIs or are you kind of scraping

21:40

the baron a little bit? So I think

21:42

easy for humans hard for AI is a

21:44

relative statement The fact is, these V2 tasks

21:46

were solvable by human on a $5

21:48

per task solverate budget. They were solvable

21:50

in five minutes or so. And AI

21:53

cannot solve these at all today. And

21:55

so yes, I do think if you

21:57

look at tasks, you have to think

21:59

about them. There's. like, you know, some

22:01

thought you need to put in to

22:03

sort of ascertain the rule. But the

22:05

sort of data, I think, speaks for

22:07

itself that, you know, we've got every

22:09

single task now in the V2 data

22:12

set from the public training set, I'm

22:14

sorry, the public evil set to the

22:16

semi-private set to the private data evil

22:18

set, set to the private data evil

22:20

set, set, every single one is solvable

22:22

by at least two humans under two,

22:25

under two attempts. So, you guys have

22:27

been cooking, you're already working on version

22:29

3 of ARC, what can you tell us

22:31

about that? So, the way I kind of

22:33

think about the multi-versions here, Arkage

22:36

I 1 again was designed

22:38

to challenge deep learning as a

22:40

paradigm. Arkage I 2 is designed

22:42

to challenge these AI reasoning systems.

22:44

I don't expect that ArcadiaG2 is going to

22:46

be as durable. Arcadia1 lasted for five

22:49

years. I don't expect ArcadiaG2 is going

22:51

to be quite as durable as that.

22:53

I hope that will continue to be

22:55

a very useful signal for researchers over

22:57

the next year or two. But yeah,

22:59

we've been working on ArcadiaI3 and I

23:01

think the pithy way to talk about

23:03

Arcadia3 is it's going to challenge AGI

23:06

systems that don't even exist in the

23:08

world yet today. Can you tell me

23:10

about the foundation that you're setting

23:12

up? cool things I think from Ark Prize

23:14

2024. When we launched it it was very

23:16

much an experiment. You know, our ambitions

23:18

were not quite what they were. I

23:20

think when we went into 2024 our

23:22

main goals were just to raise awareness

23:24

of the fact that this benchmark was,

23:26

and I think what we found was,

23:29

or what I personally found, I just

23:31

kept getting surprised by the community around

23:33

Ark. I remember this really specific moment when

23:35

a one preview came out and there was

23:37

thousands of people on like Twitter. like demanding

23:39

that we test this like new model on

23:41

ARC. And that was not my mental model

23:44

of like what this benchmark for or what

23:46

the community was. And that was so cool.

23:48

And that moment happened again when we ended

23:50

the contest. That moment again happened when we

23:53

launched the results on O3. And this kind

23:55

of showed I think, hey, there's a real

23:57

demand for ARC. There's a real demand for

23:59

benchmark. that look like this, that ascertain

24:01

these like capability gaps between humans and

24:03

computers. And so we set up this

24:05

foundation in order to basically be the

24:07

North Star for AGI and continue to

24:10

produce useful, interesting, durable benchmarks in the

24:12

sort of spirit of trying to discern

24:14

like what are the things that are

24:16

simple, straightforward, easy for humans and still

24:18

remain impossible or very very difficult for

24:20

AI. And we're going to carry that

24:22

torch all the way until we get

24:24

to AGI. As you can see now,

24:26

all of the large AI labs, they're

24:28

focusing on reasoning and I'd like to

24:30

think that Ark was at least a

24:32

small part of that. And you folks

24:34

are very focused on open source as

24:37

well. Mark Jen said specifically on the

24:39

Open AI podcast that they've been thinking

24:41

about Ark v One for years. There

24:43

you go. Well, yeah, exactly. But just

24:45

tell me a little about that. So

24:48

there's the industry impact, but you guys

24:50

are really focused on open source as

24:52

well. So how do you see those

24:54

two things? the most important technology that

24:56

humanity is going to develop. And if it

24:58

is true that we are in an

25:00

idea constrained environment, we still need new

25:02

ideas to get GAGI, which I think

25:04

our GHAI2 shows is true. If that's

25:06

true about the world, then I think we

25:09

should be designing the most innovative sort

25:11

of ecosystem and environment across the world

25:13

that we possibly can. This was one of the

25:15

reasons why we launched Arc Prize originally internationally

25:17

to reach solar researchers. to inspire researchers again

25:19

to go try and work on these new

25:21

ideas to get past this pre-training regime, try

25:24

something that we knew that it needed to

25:26

be something beyond this and even beyond what

25:28

we have today. And I think if you

25:30

look at like a really healthy, strong innovation

25:32

ecosystem, you're gonna look at one that is

25:34

very open and there's a lot of sharing

25:36

and there's a lot of diversity of approach.

25:39

And this is in contrast to an ecosystem

25:41

that would be very closed, very secretive,

25:43

very dogmatic, very monocultural. And so those

25:45

values of openness, those values of sharing,

25:47

are what the Arc Price Foundation stands

25:49

for in order to sort of increase

25:51

the chance that we can get to

25:53

AGI soon. So talking about the version

25:55

2 of the Arc Challenge, can you

25:57

just give us the elevator picture of that?

25:59

Sure. So arc two is basically a new

26:01

version of arc that keeps the same format

26:04

but tries to address the main flaws that

26:06

we saw in arc one. So for

26:08

instance in arc one we knew that

26:10

there was a little bit of redundancy

26:12

across tasks. So we saw that actually

26:14

there early on as early as the

26:16

2020 career competition. And also arc one

26:18

was way too brute forcible. So back

26:20

in 2020 what we did after the

26:22

career competitions that we tried to look

26:24

at all tasks that were sold at

26:26

least once. by one entry in the

26:28

competition. And we found that half of

26:30

the product data sets could be sold, in

26:32

fact, just via the sort of basic

26:35

brute force program search methods that's where

26:37

they're probably during the first competition.

26:39

And so that means half the

26:41

data set actually doesn't give you

26:43

very good signal. about each year

26:45

at all. So the other half was

26:47

actually good enough, required in our generalization

26:49

that the benchmark overall was still useful,

26:52

and it still lasted quite a few

26:54

years after that. But it told you,

26:56

you know, from the start, that there

26:58

were some pretty significant flowers, which is

27:01

expected, by the way, like, you know,

27:03

when I started creating arc back in

27:05

2018, 2019, I was flying blind, you

27:07

know. I was trying to capture my own

27:09

thoughts, my own intuition about what does

27:12

it mean to generalize, what is abstraction

27:14

of his reasoning, and that turned into

27:16

this benchmark. But I could not anticipate

27:18

what kind of AI techniques would be

27:20

used against it. And so yeah, as

27:23

it turns out. a lot of it

27:25

could be brute force. So arc two

27:27

completely addresses that. You cannot score, you

27:29

know, higher than one or two percent

27:31

most using brute force techniques on arc

27:34

two. So that's good news. And other

27:36

than that, we generally try to make

27:38

it a little bit harder. So what

27:40

we saw with arc one is that

27:42

it was very easy to saturate for

27:44

humans. Like if you're, if you're, you

27:47

know, like a stem graph, for instance,

27:49

you could basically get 100 percent. within

27:51

noise range of 400% like something like

27:54

97, 98. And so that means that

27:56

you were not getting a lot of

27:58

useful bandwidth to come. compare AI

28:00

capabilities with the capabilities of smart

28:03

humans. And if you just make

28:05

it a little bit harder, then

28:07

you get more range, wherein if

28:10

you're not very intelligent to score

28:12

lower, if you're very intelligent to

28:14

score higher, and you're not super

28:17

likely to completely saturated until you

28:19

are the very top end of

28:21

the distribution. So that's what ARQ2

28:24

is. Same format, same basic rules.

28:26

So we're only using core

28:28

knowledge. You have these input output

28:30

pairs of grids that are at

28:32

most 30 by 30, but the

28:35

content is very different. You're not

28:37

going to find tasks where you

28:39

only have to apply one basic

28:41

rule that could be incipitated in

28:43

events like some kind of gravity,

28:46

things falling tasks or symmetry tasks.

28:48

All the tasks are very compositional.

28:50

So you have multiple rules, you

28:52

have more objects, the grids are

28:54

generally bigger, and the rules can

28:57

be changed together or can be

28:59

interacting together, and that makes it

29:01

completely out of fridge for brute

29:03

force methods. And as it turns

29:05

out, it also makes it out

29:07

of fridge for the base elementary

29:09

training paradigm. You're saying that you've

29:12

made them more compositional and iterative

29:14

and harder for humans. That's right.

29:16

Could you give me a little

29:18

bit more detail on that? I

29:20

mean, if you think about it, there are

29:22

different dimensions of things that AI models can

29:24

do, and there are different dimensions of things

29:27

that humans can do. Have you, sort of,

29:29

quite diversely explored that? Or, I mean, could

29:31

you just give me a bit of a

29:33

breakdown of the task characteristics? So, in Harkman,

29:35

you had many tasks that were very basic.

29:38

We just had one rule. Let's say, for

29:40

instance, you have a few objects. and you

29:42

have to flip them, right? So this is

29:44

an example of a task that's easy to

29:46

brute force, because flipping is something that you

29:49

can acquire via pre-training as a concept, or

29:51

that you could just hard code in a

29:53

brute force program such system. So if that's

29:55

the only rule you have to apply, and

29:58

just apply it once, that's not compositional. that's

30:00

actually pretty easy to anticipate, that's

30:02

easy to brute force. So a

30:04

conversational task is going to be

30:06

a task where you have more than

30:09

one concept and typically they're going

30:11

to be interacting together. Like an

30:13

example of a very simple compositional

30:15

task is let's say you have

30:18

object flipping but also the objects

30:20

are falling. So you have two rules

30:22

to apply to each object at once.

30:24

But that again is a kind of

30:26

task that could still be found via

30:29

brute force program search. if you have

30:31

as key elements in your DSM, gravity

30:33

and flipping, for instance. And so you

30:35

want to create tasks that are where

30:38

the rules are chained to a sufficient

30:40

level of depths that there's no way

30:42

you could find a chain by just

30:44

trying every possible chain, where it will

30:46

become too expensive. Of course, humans can

30:49

still do it, because humans are not

30:51

just trying every possible combination of

30:53

things they know on every problem

30:55

that they see, they just have.

30:57

very efficient, very intuitive way of searching

30:59

for a theory that explains what they

31:01

see. You know, my co-host Keith, he

31:04

had this idea of doing a recursive

31:06

version of arc. But the thoughts occurred

31:08

to me is that even though

31:10

we do this systematic compositional reasoning,

31:12

we still have some kind of

31:14

cognitive limit. So if we nested,

31:16

let's say, four levels of arc

31:19

challenges within the same problem, wouldn't

31:21

you find very quickly that humans

31:23

just can't solve it? Right, so

31:25

if you just like concatenate two arc

31:27

tasks for instance, you get something that's

31:30

much less but forcible, that's much harder

31:32

because there are more rules going on.

31:34

It's not quite what I would call

31:36

compositional though, because even though you have

31:39

two rules at once, they're not interacting

31:41

with each other, right? You can solve

31:43

them separately and they concatenate the solutions.

31:45

And I think it's not a bad

31:47

idea at all, like it will work

31:49

as a way to make arc more

31:51

difficult. with again this caveat

31:54

that you're not actually

31:56

testing for depth of

31:59

compositionality. One issue, though, is that

32:01

it would only really work once, because

32:03

as soon as the person developing the

32:06

AI system notices that the task can

32:08

actually be decomposed into subtasks, then it's

32:10

game over. So I think it's actually

32:13

more interesting to have multiple

32:15

rules at once, but they're actually

32:17

being chained together, or they're

32:19

interacting together. For instance, one

32:21

rule might be writing some information on

32:24

the grid that needs to be read

32:26

by the signal rule. Right. What performance

32:28

do the frontier models get on

32:30

ARGV2? So what we saw was

32:32

a big gap between models that

32:34

don't do any kind of test

32:36

time adaptation, like any kind of

32:38

test time search or test time

32:40

training, and models that do. And

32:42

the base elements, even models like

32:44

GPT4.5, they're basically scoring zero. I

32:46

think one of them, I think

32:48

it was R1 maybe, score like

32:50

startly above zero, something like 1%. But

32:53

you know, it's within noise range of

32:55

zero. So any model that cannot do

32:57

a testimony adaptation, that is to that

32:59

does not possess fluid intelligence, does effectively

33:02

zero. So in that sense, arc two

33:04

is actually a very strong sign that

33:06

you have fluid intelligence. Better than arc

33:08

one. I think arc one could already

33:11

tell you that, but less perfectly. So

33:13

an arc one, if you do not

33:15

do testinar adaptation, you can still do

33:17

up to roughly 10%. So on arc

33:20

two, that's actually zero. So it's a

33:22

better test. Now, when it comes to

33:24

model that do test time adaptation, so

33:26

we tried, for instance, some of the

33:29

top entries from the Cal Competition last

33:31

year in the models that we are

33:33

doing at this time training in particular

33:35

or some kind of program search. And

33:37

the best model, the model that

33:40

actually won the Cal Competition, can

33:42

do, I believe, 3% on arc

33:44

two. And if you take an

33:46

ensemble of the top entries from

33:48

the competition, you get to 4%.

33:51

Right. So that's not very high.

33:53

We also estimate that so O3 would

33:55

be the current state of the

33:57

art in terms of an AI model.

34:00

that does exhibit through intelligence. And so

34:02

we haven't been able to test O3

34:04

on low-compute settings on all of the

34:06

tasks that you wanted to test it

34:09

on, but we've tested it on a

34:11

subset. And so we can extrapolate which

34:13

we score on the entire set. And

34:16

it sounds like it is going to

34:18

be about 4%. Right. So not super high.

34:20

There's a lot of to go high on

34:22

that. And we haven't been able to

34:24

test O3 on high compute settings at

34:26

all. So you know the model that

34:28

we are scoring. 88% on RQ1. So

34:30

I can make a guess. I

34:33

guess, based on what we

34:35

saw from O3 Look and

34:37

other models, I think you

34:39

might get up to

34:41

like 15, maybe even

34:43

20% if you were

34:45

remixing out the computer

34:47

setting and spending like

34:50

10K per task, for instance.

34:52

But we'll see be

34:54

like far below. average

34:56

human performance which would be more

34:58

like 60%. So that 4% that O3

35:00

gets on ARGV2, do you think

35:02

of that as fluid intelligence or

35:05

do you think of that as

35:07

a potential gap in, I mean

35:09

presumably you could have designed ARGV2,

35:11

if you selected the correct sets

35:13

of human calibrated challenges, you could

35:15

have found a set which was

35:17

still 0-103. Yeah, absolutely, no,

35:19

you could have, you know, obviously selected

35:21

against Austria and then Austria would do

35:23

zero. It would be very, very easy

35:25

to go from 4% to 0% right?

35:27

It's just a few, a few, a few

35:29

tasks that you need to change. So

35:31

we're not, we're not actually trying to

35:34

do that. Yes, I do believe that

35:36

4% does show that you have non-zero

35:38

fluid intelligence, which is also something that

35:41

you could get as a signal from

35:43

arc one. And I think the sign

35:45

that you see through the intelligence in

35:48

these models is the performance gap between

35:50

the huge pre-trained only models that don't

35:52

do test adaptation, that's quite effectively zero,

35:54

maybe one. And you could say that

35:57

one person is in fact a flat

35:59

data set. sure. It should in practice

36:01

be zero. And the models that

36:03

do test some adaptation and do

36:05

non-zero, three percent, four percent, be

36:07

five percent, right. And that means that

36:10

there's something like 95 percent of

36:12

the data set that will actually

36:14

give you these useful bandwidths for

36:16

measuring how much fluid intelligence the

36:18

model has. And that's something you

36:21

were not getting with Ark One.

36:23

Ark One was more binary. where

36:25

if you don't have fluid intelligence,

36:27

you're going to do very, very

36:29

low, like below 10% roughly. If

36:31

you do, you're going to score

36:34

significantly higher and getting above 50%

36:36

would be very easy. But because

36:38

the measure would saturate very quickly,

36:40

as soon as you start adding

36:43

non-zero fluid intelligence, you did not

36:45

get that useful bandwidths that you're

36:47

getting with ARQ, so I think

36:50

ARQ should allow for answering the question,

36:52

is this model? actually as fluidly

36:54

intelligent as the average human, which is

36:56

something you could not get at far

36:58

one. I guess it's just an economics

37:00

thing at this point, so if you

37:02

spent, let's say, a billion dollars or

37:04

half a billion dollars, you could saturate

37:06

ARGV2, I'm not sure if you would

37:08

agree with that, but... If that isn't the

37:10

case, I mean, what do you think

37:12

are the specific things that are missing

37:15

from O3, that are stopping it from

37:17

doing better? So it's never just an

37:19

economics question, because intelligence is not just

37:21

about capabilities, it's also about the efficiency

37:24

with which you acquire and deploy these

37:26

capabilities. And sure, if you spend billions

37:28

and billions of dollars, maybe you can

37:30

saturate arc two, but that would already

37:32

have been true back in 2020, using

37:35

like extremely crude... workforce program search, if

37:37

you have a DS cell, it's actually

37:39

true and complete, then you know that

37:41

for every arc task, there exists a

37:43

program that may not in fact be

37:45

all that long that will solve the

37:47

task. And all you need to do

37:49

is each rate over all possible programs

37:52

in order of length. And then the

37:54

first one you find is the one

37:56

that's going to generalize, right? Because it's

37:58

the shortest. It's the most part. So

38:00

if you spend unlimited resources, you

38:02

already have EGR in that sense,

38:04

just in a pure skills sense,

38:07

you can always just drive a

38:09

possible program until you find one

38:12

that works. But that's not what

38:14

intelligence is. Interagency about finding that

38:16

program in very few hops using

38:19

actually very little compute. Like look

38:21

at the amount of energy that

38:23

a human expands to sort of

38:26

one our task over two or

38:28

three four minutes. It's almost zero,

38:30

right? And compare that to a

38:33

model like Austrian high-compute settings, for

38:35

instance, which is going to use

38:37

like over 3,000 bucks of compute.

38:39

So it's never just an economic

38:41

problem. Efficiency is actually the

38:44

question we're asking. Efficiency is

38:46

a problem statement. It's not capability.

38:48

So intelligence is knowledge

38:50

acquisition efficiency. O3 did very

38:53

well on Ark v1. And now that

38:55

it does so badly on ARQV2,

38:57

the whole point of your definition

38:59

of intelligence is that given some

39:01

basis knowledge, you efficiently recombine, you

39:03

produce new skill programs, you're saying

39:05

that in the absence of the

39:07

base knowledge in V2, there is

39:09

no intelligence. Therefore, is O3 not

39:11

actually as intelligent as we thought

39:13

it was? I think O3 is

39:15

one of the first models, perhaps,

39:17

arguably in the first model that

39:19

does, show fluid intelligence. So now,

39:21

what the results on AR2, are telling

39:24

you is that it's not human

39:26

level fluid intelligence, right? But still,

39:28

I would consider Austria as a

39:30

kind of prototype, with two

39:32

big flaws, two big caveats, one's

39:34

of course efficiency. Efficiency is

39:37

part of the problem statements, in

39:39

fact, the central point. So as

39:41

long as you're not as efficient

39:43

in terms of, for instance, you know,

39:45

data efficiency, compute efficiency,

39:48

energy efficiency, then it's only

39:50

a temporary solution. We'll find...

39:52

a better solution in the

39:54

future. And also, it's not

39:56

quite human level. If it

39:58

were human level, I... expected to

40:00

score, you know, something like oh three

40:02

dollars should score like over 60% an

40:05

hour, too. And we don't know what

40:07

the exact number is going to be,

40:09

but you know, probably probably probably like

40:11

four to five percent, right? Do you

40:13

think that general intelligence is a category

40:16

or a spectrum? So general through the

40:18

intelligence, it's, I would say it's both,

40:20

because there's a huge difference between just

40:22

having memorized a bunch of skill

40:25

programs that are static and knowledge

40:27

fact of its... versus being able

40:30

to adapt novelty to a non-zero

40:32

extent. So that is a binary

40:34

distinction. Either you have fluid intelligence

40:37

or you don't, right? And arc

40:39

one could answer that question for

40:41

any system. But once you have

40:44

non-zero fluid intelligence, then the question

40:46

is how much do you actually

40:49

have and how it compares to

40:51

humans? And that's related to

40:53

the notion of... recumbination

40:55

of the skill programs that

40:57

you have, the knowledge that

40:59

you have, and depths of

41:01

recumbination. So if you do

41:03

no recumbination at all, you

41:05

don't have free intentions. If

41:07

you do some recumbination, you

41:09

do. But then the question

41:11

is how deeply can you

41:14

recombine? Like, for instance, if

41:16

you're using a program synthesis

41:18

analogy, the question is

41:20

how big of a program. can you

41:22

write on the fly to adapt to a

41:24

new problem, right? And of course, as well,

41:27

how efficiently, how fast and how efficiently

41:29

you can write it, right? So it

41:31

is a binary, but it's also a

41:33

spectrum. And arc one was trying

41:35

to ask the binary question, does this

41:38

system has any fluid intelligence at all?

41:40

And arc two is more on the

41:42

side of trying to measure how much

41:45

fluid intelligence you actually have compared to

41:47

humans. How long do you think it

41:49

will take for V2 to be saturated

41:52

and do you think it will survive

41:54

until V3 comes up? So that's a

41:56

question where you have to take into

41:59

account resource efficiency. So if you're asking

42:01

how long it we take before

42:03

we have a system that can

42:05

score higher than, let's say, 80% on

42:07

arc two, using less than $10,000

42:09

of compute, for instance, I think

42:11

probably around a couple years. So

42:13

it's very difficult to make predictions

42:15

here. I think if you're just

42:17

looking at current techniques. and scaling

42:20

up current techniques I think could

42:22

take a while. I think the

42:24

arc two is actually way out

42:26

of fridge of current techniques. But

42:28

of course we are not limited

42:30

to current techniques you know in

42:32

2025 we're probably going to see

42:34

new breakthroughs in the same way

42:36

if we saw new breakthroughs last

42:38

year. And these breakthroughs are actually

42:40

very difficult to predict. I was

42:42

personally very surprised with the performance

42:44

at all three. could get an

42:47

arc one last year. That came

42:49

as a surprise. So maybe we'll

42:51

have new surprises this year. But

42:53

I would be extremely surprised if

42:55

we see an efficient solution

42:58

that's human level on arc two

43:00

by the end of 2025. I

43:02

would basically pull that out. By

43:04

the end of 2026, maybe. Right.

43:06

which is why we have the ARX3

43:08

coming, of course. So on analysis of

43:10

failure modes, I'm sure you saw the

43:12

blog post that I read where it

43:14

went through all of the different failure

43:16

modes of O3, and of course it

43:19

was solution space prediction which made it

43:21

more surprising to me. My take on

43:23

it was I was really impressed that

43:25

even when it failed it was because

43:27

the solution space got too big or

43:29

it was just getting minor mistakes but

43:31

broadly it got the direction of many

43:34

of the problems quite well. You know

43:36

similarly tell me about the failure modes

43:38

on V2. Right so we were not able

43:40

to test Austria as much on V2 but

43:42

I can tell you a bit fair mode

43:45

based on what we saw on V1 and well

43:47

there are many but generally This

43:49

is a model where

43:51

reasoning abilities can decrease

43:54

exponentially with problem size.

43:56

If you have more objects in

43:58

the scene, if you have more... more

44:00

rules, more concepts, interacting. You see

44:02

this exponential decrease in capabilities. It's

44:04

also, you know, because it's a

44:06

model that needs to, it works

44:08

by writing a kind of natural

44:10

language program that describes what it's

44:12

seeing, that describes the problem and

44:15

the sequence of steps to solve

44:17

it. So in that sense, it's

44:19

100% a natural language program. And

44:21

that means that in order to

44:23

solve a problem that's to talk

44:25

about it using words. And as

44:27

a result, if you have a

44:29

task where the rule is very

44:31

simple to grow up for human,

44:33

but in a non-verbal way, but

44:36

it's very difficult to put it

44:38

into words, it has no verbal

44:40

analogy. That's actually much harder to

44:42

solve for this direction of that

44:44

model. So other than that, we

44:46

saw that just one of the

44:48

big challenges is, you know, compositionality,

44:50

having multiple rules interact. There's also,

44:52

it seems there's a bit of

44:54

a... locality bias going on as

44:56

well, where if you have to

44:59

combine together bits of information that

45:01

are especially co-located together on the

45:03

grid, that's easier for the model

45:05

than if you have to do

45:07

the exact same thing, but the

45:09

two bits of information you have

45:11

to synthesize are pretty distant. So

45:13

having to combine together bits of

45:15

information that are separate, having to...

45:17

It seems as well that the

45:20

model has a trouble simulating the

45:22

execution of a rule and then

45:24

reading the results. Like for instance

45:26

if you're solving our task and

45:28

you grock a certain rule and

45:30

then you start applying it, let's

45:32

say it's like you're continuing your

45:34

line continuation or something, and then

45:36

you have to take another rule.

45:38

and use that rule to read

45:41

a bit of information that you

45:43

have written in the process of

45:45

existing in the first rule. That

45:47

sort of thing is completely out

45:49

of ridge for the chain of

45:51

thought models. How multi-dimensional do you

45:53

think? is, you know, one school

45:55

of thought, and I think you

45:57

might subscribe to this, is that

45:59

the universe is kind of almost

46:01

made up of platonic rules that

46:04

are disconnected from the world that

46:06

we live in, and then there's

46:08

this kaleidoscope idea you talk about,

46:10

and they get combined together, and

46:12

that's what we see. But another

46:14

school of thought is that there'll

46:16

always be another dimension of intelligence.

46:18

We'll always need Ark V4, V5,

46:20

V6, and there'll always be something

46:22

missing. Each step of generality that

46:25

you cross, you're gaining a nonlinear

46:27

amount of capabilities, right? And so

46:29

after a few steps, you are

46:31

so overwhelmingly superhuman across every possible

46:33

dimension that, yeah, you can say

46:35

without a doubt that you have

46:37

a giant fact, in fact, you

46:39

have super intelligence. But yeah, intelligence

46:41

is, in a sense, multidimensional. And

46:43

what Ark is trying to capture

46:46

is really just this. fluid intelligence

46:48

aspect, disability to recombine core knowledge,

46:50

building blocks. So in my definition

46:52

of intelligence, intelligence is about efficiently

46:54

acquiring skills and knowledge and recombining

46:56

them to, well, again, efficiently recombining

46:58

them to adapt to novel tasks,

47:00

to novel situations that you cannot

47:02

prepare for explicitly. purely the ability

47:04

to take a bunch of building

47:06

blocks and recombining them, doing kind

47:09

of program sentences, that's one aspect

47:11

of that. That's probably the most

47:13

central aspect, which is why, you

47:15

know, this is why we're focusing

47:17

on with ARC. But it's not

47:19

the only aspect, because this is

47:21

assuming that you already have this

47:23

pile of knowledge available, so it's

47:25

not, it's overlooking the acquisition of

47:27

information about the task in ARC.

47:30

You're provided all the information about

47:32

the task at once, but in

47:34

the real world, you have to

47:36

collect that information, you have to

47:38

take actions. goals to discover what

47:40

your environment is even about what

47:42

you can do within it. And

47:44

you have to do these things

47:46

efficiently, of course. And that efficiency

47:48

aspect is very important because intelligence

47:51

was developed by evolution in evolution

47:53

adaptation. And when you're exploring the

47:55

world, you are taking on some

47:57

risk. You might get killed by

47:59

a predator, for instance. And so

48:01

you want to be being. You

48:03

want to gain the maximum amount

48:05

of information and thereby power over

48:07

your environment by taking on a

48:09

minimum amount of frisk and expanding

48:11

a minimum amount of energy. That's

48:14

not something you can measure. That's

48:16

something you can capture with ARC

48:18

v1 or v2 alone. Can you

48:20

just expand on the significance of

48:22

the solution space prediction with O3?

48:24

Because that rather suggests to me

48:26

that... it's almost this rich sudden

48:28

idea where it's nearly a blank

48:30

slate and it's very empiricist and

48:32

we just take the data in

48:35

and the neural network, does all

48:37

of the things. I always imagine

48:39

that we would need to have

48:41

some kind of structured approach which

48:43

took, you know, the core knowledge

48:45

into account. Do you think that,

48:47

you know, it's actually simpler than

48:49

we thought? Trying to directly predict

48:51

the output versus trying to write

48:53

down... the steps to get the

48:56

output. They are not entirely separate

48:58

things. Because of course, once you've

49:00

written down the steps, you can

49:02

do what looks like transaction. And

49:04

O3 is not actually a real

49:06

transaction model, because it's much closer

49:08

to program synthesis model, where it's

49:10

searching. for the right chain of

49:12

thought to describe the task and

49:14

list the sequence of steps to

49:16

solve it. And once you have

49:19

the chain of thought, you can

49:21

just use the model to execute

49:23

it and it gives you the

49:25

output. So from the outside, if

49:27

you treat the entire system as

49:29

a black box, it looks like

49:31

transaction, but the same would be

49:33

true of any program search system.

49:35

what it's actually doing, and the

49:37

reason why it's able to adapt

49:40

to novelty so well, is because

49:42

it's synthesizing this chain of thought,

49:44

which serves as a recombination artifact

49:46

for the knowledge and the skills

49:48

that the model has. A recombination

49:50

artifact is adapted to the particular

49:52

task at hand. So it's much

49:54

closer to bronze this model. this

49:56

is something that the community found

49:58

very confusing because in the last

50:01

interview you were describing I think

50:03

it was a one pro as

50:05

being a kind of explicit search

50:07

process and yeah what seems to

50:09

be the case is that it's

50:11

you know there is some kind

50:13

of reinforcement learning thing in in

50:15

the pre-training and then it maybe

50:17

does some sampling inference time so

50:19

we you know we're doing a

50:21

whole bunch of compleitions and are

50:24

you saying it's as if it's

50:26

doing a program search? It's searching

50:28

over the space of possible channel

50:30

thoughts and finding the one that

50:32

seems most appropriate. So in that

50:34

case it's entirely analogous to a

50:36

problem search system where the program

50:38

you're synthesizing is a natural language

50:40

program, right? A program written in

50:42

English. Okay, it just seems a

50:45

bit strange doing auto regression on

50:47

a language model, how that could

50:49

be characterized as a search process.

50:51

It's so a model like O1

50:53

pro, for instance or O3. is

50:55

not just auto-aggressive. It has actually

50:57

this test time search step, which

50:59

is why it can adapt to

51:01

novelty much much better than the

51:03

base models of purely auto-aggressive. That's

51:06

why, again, you see on the...

51:08

So in general, ARC, even ARC1,

51:10

has completely resisted the pre-training, purely

51:12

auto-aggressive, pre-training scaling by dime, like

51:14

from... 2019 to 2025, we scaled

51:16

up these models back 50,000 X,

51:18

like from GPT-2 to GPT-4.5. And

51:20

even on Arc 1, you went

51:22

from 0% to something like 10%

51:24

and on Arc 2, you're going

51:26

from 0% to 0% right. And

51:29

meanwhile, if you. If you have

51:31

any system that's actually capable of

51:33

doing test time adaptation, like test

51:35

time search, like O1 Pro or

51:37

O3, then you're getting much, much

51:39

better performance. There's this huge performance

51:41

gap. So, you can tell the

51:43

difference between the model that does

51:45

not do test time adaptation and

51:47

the model that does by looking

51:50

at this performance gap, this generalization

51:52

gap on arc, also by looking

51:54

at latency and by looking at

51:56

cost. So, of course, a model

51:58

that does test time search is

52:00

going to give you answer. It's

52:02

going to take much longer. Like

52:04

if you look at 01 Pro,

52:06

for instance, it's taking 10 minutes

52:08

to answer your queries, and it's

52:11

going to cost you much more

52:13

as well, because of all this

52:15

work it's doing. So I could

52:17

download the Deep Sea Car 1

52:19

model, and I'm running it on

52:21

my machine, and as far as

52:23

my machine is concerned, it's just

52:25

a normal LLM, it's doing greedy

52:27

sampling, maybe. Oh, so you're saying

52:29

there is something different about... Oh,

52:31

three is qualitatively different. That's correct.

52:34

It is qualitatively different from all

52:36

the other models that can be

52:38

for. It is actually a model

52:40

that has fluid intelligence. It has

52:42

non-zero amount of fluid intelligence. And

52:44

R1, for instance, is not. Okay,

52:46

so categorically it's doing some kind

52:48

of active search processor inference. That's

52:50

what it looks like. So of

52:52

course, I don't actually know how

52:55

it works, but that's what I

52:57

would speculate it looks like. Yes.

52:59

and you see it in the

53:01

latency in the cost and of

53:03

course the arc performance. Would you

53:05

be shocked and surprised if it

53:07

came to light that it was

53:09

just doing like auto-agressive greedy sampling?

53:11

Honestly I think it's very very

53:13

likely because it's completely incompatible with

53:16

the characteristics of the system that

53:18

we know of that we were

53:20

exposed to when we tested those

53:22

three. Awesome. And do you think

53:24

that there will always be human

53:26

gaps? Probably not always. Today they

53:28

are very clear, very significant gaps,

53:30

right? Like we're not actually that

53:32

close to a GI right now.

53:34

But eventually, you know, as we

53:36

get closer and closer, there will

53:39

be few and fewer gaps.

53:41

At some point, we're going

53:43

to have an AI going

53:45

to have that is

53:47

just that are

53:49

just overwhelmingly every possible

53:51

every possible to look at.

53:53

So to look

53:55

at. that there don't

53:57

think gaps be gaps

54:00

right, Tim. Thank you so much

54:02

thank you so

54:04

much for doing

54:06

We're looking Tim. forward to

54:08

looking forward to

54:10

seeing you in

54:12

a couple weeks.

54:14

weeks.

Rate

Get this podcast via API

From The Podcast

Machine Learning Street Talk (MLST)

Welcome! We engage in fascinating discussions with pre-eminent figures in the AI field. Our flagship show covers current affairs in AI, cognitive science, neuroscience and philosophy of mind with in-depth analysis. Our approach is unrivalled in terms of scope and rigour – we believe in intellectual diversity in AI, and we touch on all of the main ideas in the field with the hype surgically removed. MLST is run by Tim Scarfe, Ph.D (https://www.linkedin.com/in/ecsquizor/) and features regular appearances from MIT Doctor of Philosophy Keith Duggar (https://www.linkedin.com/in/dr-keith-duggar/).

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More