#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

Released Monday, 24th March 2025
Good episode? Give it some love!
#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

Monday, 24th March 2025
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Hello and welcome to

0:02

the last week in

0:04

a podcast where you

0:06

can hear a chat

0:08

about what's going on

0:10

with AI. As usual

0:12

in this episode we

0:14

will be summarizing and

0:16

discussing some of last

0:18

week's most interesting AI

0:20

news. You can go

0:22

to the episode description

0:24

for all the links

0:26

and timestamps and also

0:28

to last week in

0:30

ai.com on your laptop to be

0:33

able to read those articles yourself

0:35

as well. As always, I'm one of

0:37

your hosts Andre Karenkov. I studied

0:39

AI in grad school and I

0:41

now work at the Genitive AI

0:44

startup Astrocade. And I'm your other host,

0:46

Jeremy Harris. I'm with Pladstone A.I. National

0:48

Security Company, which you know about if

0:50

you listen to the podcast, you also

0:53

know about Astricade now, a bunch of

0:55

you listen to, you know about all of this,

0:57

you know, what you don't know, though, is that this

0:59

morning, at the early hour, I think it

1:01

was like three or something in the morning,

1:03

I discovered that I have bats in

1:05

my house, which is fun, which is really

1:07

fun, especially when you have like a six-month

1:09

old, you have bats. and then you start

1:12

googling things. So anyway, you had

1:14

pest control come in. That's why, wow,

1:16

my hair looks like Cosmo Kramer

1:18

right now. I've just been running my

1:21

fingers through it for quite a bit.

1:23

So anyway, we got everything on for

1:25

showtime though, because the show will go

1:27

on. Yeah, but if you forget any

1:29

details wrong, you know, it's, it's the

1:31

shock residual shock, the bats, you know, maybe

1:34

that. It's just, I'll be on the lookout.

1:36

Well, let's do a quick preview

1:38

of what we'll be talking about

1:40

in this episode. It's going to

1:42

be a bit of a relaxed

1:45

one. There's nothing to sort of

1:47

world shattering, but a variety of

1:49

pretty interesting stories, tools and apps.

1:51

We have some new impressive

1:53

models out of China, some new

1:55

stuff from Open AI as well,

1:58

Google and FARPIC, everyone launched. stuff,

2:00

applications and business, as we often

2:02

do, going to be talking a

2:05

lot about hardware and GPUs, a

2:07

little bit about fundraising as well,

2:09

projects and open source, we'll be

2:11

talking about the model context protocol,

2:13

which has been all the rage

2:16

in the community recently, and a

2:18

couple new models. As usual, researching

2:20

investments, we got to talk about

2:22

reasoning techniques, inference time scaling techniques.

2:24

but also some new kind of

2:27

developments in the space of how

2:29

you implement your models. Policy and

2:31

safety, we have some more analysis

2:33

of what's going out of China,

2:35

US national security, things like that.

2:38

And finally, we will actually talk

2:40

a little bit about the world

2:42

of art and entertainment with some

2:44

news about copyright. So let's just

2:47

get straight into it. in tools

2:49

and apps. The first story is

2:51

about Baidu launching two new versions

2:53

of the Ernie model, Ernie 4.5

2:55

and Ernie X1. So Ernie initially

2:58

released two years ago and now

3:00

we have Ernie 4.5, presumably, I

3:02

don't know, it sounds like kind

3:04

of two coincided for, to PD

3:06

4. And then Ernie X1 is

3:09

the reasoning variant of Ernie that

3:11

Baidu says is on par with

3:13

Deep Seek R1, but at half

3:15

the price, and both of these

3:17

models are multimodal. They can process

3:20

videos, images, and audio as well.

3:22

They also say Ernie 4.5 is

3:24

kind of emotionally intelligent. They can

3:26

understand memes and satire, which is

3:28

interesting. So... I think we don't

3:31

have like a great sense of

3:33

the tool landscape in China is

3:35

my impression. I really wish I

3:37

would know like if you are

3:39

a user of a chat bot.

3:42

we got to charge BT or

3:44

Claude to give up queries, I

3:46

think it seems likely that Ernie

3:48

is sort of filling that role

3:50

and the fact that there's new

3:53

models and the fact that they're

3:55

really competitive price-wise is a big

3:57

deal. The like number one downloaded

3:59

app in China just switched to

4:02

a new AI chatbot that is

4:04

not Deep Seek. So things are

4:06

definitely moving. The big advantage here

4:08

with this launch seems to be

4:10

cost. At least that's what they're

4:13

leaning into with a lot of

4:15

the discussion around this. So the

4:17

goal that Baidu has, which Baidu

4:19

of course is China's, roughly China's

4:21

Google, right? They own search there.

4:24

Their goal is to progressively integrate

4:26

Ernie 4.5 and their X1 reasoning

4:28

model into all their product ecosystem,

4:30

including Baidu search, which is sort

4:32

of interesting. So we'll see a

4:35

rollout of the generative AI. capabilities

4:37

in that context. Yeah, so ultimately

4:39

it does come down to price

4:41

a lot of it. So for

4:43

context, there's really handy table in

4:46

one of the articles that looked

4:48

at this comparing GPT 4.5 per

4:50

token cost to Deep Seek v.

4:52

3 to Bernie, to Ernie 4.5,

4:54

it's quite interesting, right? So input

4:57

costs for input tokens, 75 bucks

4:59

for a million tokens. This is

5:01

for GPT 4.4. Deep Seek v3

5:03

that drops to basically 30 cents

5:06

earning 4.5 is about 60 cents

5:08

or so per 1 million tokens

5:10

So you know you're talking orders

5:12

of magnitude less Also the case

5:14

that these models are less performant

5:17

so that's sort of the tradeoff

5:19

there, but where things yeah, I

5:21

think just to give a bit

5:23

of a perspective Deep seek v

5:25

free is more comparable to something

5:28

like GPT for O in open

5:30

eyes, say models or free meaty

5:32

for instance where aggression isn't that

5:34

crazy it's maybe I forget one

5:36

dollar ish per million tokens so

5:39

they're comparable. GPD 4.5 is just

5:41

crazy crazy pricing compared to everything

5:43

else and that's a It's the

5:45

way to think about 4.5, I

5:47

think we touched on this a

5:50

couple episodes ago, but it's a

5:52

base model, but it's not a

5:54

base model for, let's say, mass

5:56

production, right? These are high, high

5:58

quality tokens, probably best used to

6:01

create things like synthetic data sets

6:03

or to answer very specific kinds

6:05

of questions, but you're not looking

6:07

at this as something that you

6:10

want to productize, just because you're

6:12

right. I mean, it's two orders

6:14

of magnitude more expensive than other

6:16

base models. Where you actually see

6:18

the lift here. especially for Ernie

6:21

X1, right, with this is the

6:23

reasoning model, is on the reasoning

6:25

side, right? So opening eyes is

6:27

roughly 50 times more expensive than

6:29

Ernie X1. Ernie X1 is about

6:32

half the cost of R1 for

6:34

input tokens, and actually that's also

6:36

true for output tokens. So it's

6:38

quite significant, especially again relative to

6:40

O1, and shows you. One of

6:43

two things, either Chinese engineering is

6:45

actually really really really that good,

6:47

or there's some state subsidy thing

6:49

going on in the background. I

6:51

think the latter is somewhat less

6:54

plausible at this point, though I

6:56

wouldn't rule it out. certainly there's

6:58

some amazing engineering making these these

7:00

margins possible and it's that's a

7:02

pretty remarkable thing here right I

7:05

mean the cost just collapsing for

7:07

reasoning this implies that there's some

7:09

reasoning specific engineering going on in

7:11

the background and you know you

7:14

should expect that to apply to

7:16

training as well as inference going

7:18

forward yeah and it's kind of

7:20

funny in a way there is

7:22

a parallel here between Baido and

7:25

Google where Google likewise has quite

7:27

competitive pricing, especially for Gemini to

7:29

flash thinking. So I could also

7:31

see it being, you know, just

7:33

a company strategy kind of thing.

7:36

Baidu is gigantic, they're printing money

7:38

with Search, so they could also

7:40

kind of eat the additional costs

7:42

to undermine something like Deep Seek,

7:44

which is a startup, right, too.

7:47

Lock in the market. But either

7:49

way, exciting news and I guess

7:51

if you're in China, I don't

7:53

believe you can use charge property.

7:55

So if nothing else, it's good

7:58

that they are comparable tools for

8:00

people to use. but not miss

8:02

out on the fun of these

8:04

advanced elements. I will say I

8:06

don't know that Baidu would be

8:09

subsidizing at the level of at

8:11

least their base model because they

8:13

are actually more expensive than Deep

8:15

Seek V3 on Ernie 4.5. Where

8:18

you see that flip is with

8:20

the reasoning models, which itself is,

8:22

yeah, that's kind of interesting, right?

8:24

I mean, to me at least,

8:26

that seems to imply something about

8:29

reasoning, like engineering for the like

8:31

computer architecture behind reasoning, or more

8:33

token efficiency and therefore compute efficiency

8:35

at the at the reason I

8:37

shouldn't say therefore maybe alternatively compute

8:40

efficiency at the reasoning stage but

8:42

you're right there's all kinds of

8:44

things start to muddy the water

8:46

is when you start thinking about

8:48

the economics of these things as

8:51

they represent a larger and larger

8:53

fraction of the corporate bottom line

8:55

even for big companies like I

8:57

do like Google these companies are

8:59

gonna be forced to show us

9:02

their hand in a sense right

9:04

they're gonna have to sell these

9:06

tokens for profit and we will

9:08

eventually learn what their actual margins

9:10

are what their actual margins are

9:13

It's debatable whether we're learning that

9:15

just yet. Yeah, I don't think

9:17

we are. It's very much unknown

9:19

and I haven't seen any kind

9:22

of strong analysis to explain it.

9:24

There's you know, yeah, it's just

9:26

a mystery what kind of tricks

9:28

people are pulling but I would

9:30

also kind of bet that the

9:33

margins aren't great. The one thing

9:35

we do know Deep Seek claimed

9:37

at least that they were making

9:39

a profit and had a positive

9:41

margin on their models and I

9:44

could see that not be the

9:46

case for, you know, for instance,

9:48

Open AI where their revenue is

9:50

in the billions, but the real

9:52

question is, are they actually making

9:55

a profit? Last thought on this

9:57

too, on the economic side, like

9:59

when we think about what it

10:01

means for Deep Sea to claim

10:03

that they're generating positive returns, I

10:06

think there's a... an important question

10:08

here about whether that's operating expenses

10:10

or CAPX factored in, right? So

10:12

we saw in their paper that

10:14

they famously talked about how they

10:17

trained V3 on six million dollars

10:19

of compute infrastructure. Now, or sorry,

10:21

on a six million dollar compute

10:23

budget, that was, it seems in

10:26

retrospect, the actual operating expenses of

10:28

running that compute, not the capital

10:30

expenses associated with. the tens of

10:32

millions of dollars as it would

10:34

have been of compute hardware. So

10:37

it's always hard to know like

10:39

what do you amortize? How do

10:41

you factor in what's apples to

10:43

apples? Yeah, it's hard to say

10:45

like deep seek is profitable, but

10:48

on a per token basis, just

10:50

for inference, I believe the claim

10:52

is they're making money which yeah,

10:54

on an all back spaces. Yeah,

10:56

interesting. Yeah. Moving right along, next

10:59

we have Open AI and they

11:01

are releasing new two new speech-to-tech

11:03

models, G.P.40 transcribed and G.P.40 mini

11:05

transcribe, which are basically replacing their

11:07

whisper models. Open has already had

11:10

this as a service for quite

11:12

a while. The exciting new thing

11:14

here is the text-to-speech model, G.P.40

11:16

mini dash-T-T-S, which is more long

11:18

lines of 11 labs where you

11:21

can produce very natural human sounding

11:23

speech and along with announcement of

11:25

the models open AI has also

11:27

launched and you cite open AI

11:29

dot f m which is a

11:32

demo site where you can go

11:34

and mess around and kind of

11:36

hear of the outputs and this

11:38

is kind of a fun trend

11:41

I gotta say where these companies

11:43

increasingly are launching these little fun

11:45

toys to get a sense for

11:47

what these models are capable of.

11:49

One last thing, again, probably should

11:52

comment on pricing. The pricing is

11:54

very competitive, the transcription. for GP40,

11:56

it's 0.6 cents per minute, 0.6

11:58

cents, so like 0.006 dollars, I

12:00

guess. And GD40, mini-TTS, is 1.5

12:03

cents per minute, which is much

12:05

lower than a competitor, like 11

12:07

Labs, for instance. So, yeah, I

12:09

think it's interesting to see opening

12:11

expanding their model suite to these

12:14

new domains where they're there sort

12:16

of less. focused. We've seen them

12:18

kind of move away from text

12:20

to image for instance, Delhi hasn't

12:22

had an update in forever. And

12:25

so I guess this makes a

12:27

lot of sense that they have

12:29

very competitive things to offer given

12:31

their investment in the advanced voice

12:33

mode in chatGPT. It's sort of

12:36

reminiscent of the problem that meta

12:38

faces, right, where they're, you know,

12:40

they reach like, whatever, 3 billion

12:42

people around the world, at a

12:45

certain point, when your market penetration

12:47

is so deep, one of the

12:49

only things you can do to

12:51

keep growing is to grow the

12:53

market. And so meta invests, for

12:56

example, in getting more people on

12:58

the internet, in other countries, like

13:00

in countries that don't have internet

13:02

access typically or have less of

13:04

it. And so they're literally just

13:07

trying to like grow the... pool

13:09

of people like they can they

13:11

can tap for this in the

13:13

same way I think there's a

13:15

lens on this that's similar right

13:18

so you're only able to interact

13:20

with chat gBT through certain modalities

13:22

or with open AI products for

13:24

certain modalities and by achieving greater

13:26

ubiquity by reaching into your life

13:29

more and making more of the

13:31

conversational tooling available to you that

13:33

that really does effectively increase their

13:35

their market right like you don't

13:37

have to be in front of

13:40

a computer necessarily or in the

13:42

same way or engaged in the

13:44

same way to use the product.

13:46

And obviously they've had other other

13:49

voice products before, but it's sort

13:51

of part of if I'm open

13:53

AI, I'm really thinking about multimodality,

13:55

but from the standpoint of increasing

13:57

the number of context, life context

14:00

in which I can reach you

14:02

and text to image still requires

14:04

you. to be in front of

14:06

a screen, same as writing text

14:08

on chat GPT, whereas audio is

14:11

just this like, you know, greater

14:13

reach for modality wise. So I

14:15

think strategically it's an interesting play

14:17

for them. ethically all kinds of

14:19

issues. I mean, you know, you

14:22

think about the modality of audio

14:24

as being one that is much

14:26

more intimate to humans and an

14:28

easier way to plug into your

14:30

inner world. And that's, I think,

14:33

something, you know, we look at,

14:35

like, what record did to people

14:37

just through text, right? The suicidal

14:39

ideation, the actual suicides, if nothing

14:41

else for open AI. There is

14:44

one figure by the way in

14:46

the article at least we're linking

14:48

to here and it's just a

14:50

piece of research looking at the

14:53

the word error rate comparisons across

14:55

leading models for different languages as

14:57

part of this kind of tooling

14:59

I just I find it really

15:01

interesting like Arabic and Hindi there's

15:04

a lot of struggle there those

15:06

are some of the worst performing

15:08

languages obviously English one of the

15:10

better performing ones I'd love to

15:12

see an overlay of this relative

15:15

to the amount of data that

15:17

was used to train the model,

15:19

so you can see in relative

15:21

terms, like, which languages are, in

15:23

a sense, like, harder for AI

15:26

to pronounce, to kind of speak.

15:28

I think there's something, anyway, linguistically

15:30

just fascinating about that, if nothing

15:32

else. So, anyway, overall, interesting launch,

15:34

and I think we're gonna see

15:37

more and more of this, right?

15:39

It's gonna be more expected to

15:41

have very high quality audio models

15:43

and linking them specifically to agents,

15:45

sort of Star Trek computer style.

15:48

Yeah, I guess one thing worth

15:50

noting on the kind of ethics

15:52

side is I don't believe they're

15:54

offering voice cloning technology, which is

15:57

where you can really get into

15:59

trouble very easily. So I think

16:01

opening up is being a little

16:03

careful these days in general to

16:05

not court controversy. That's part of

16:08

why it took them forever to

16:10

release SORA potentially. And in this.

16:12

API, this demo, they are releasing

16:14

something like a dozen voices you

16:16

can use with names like alloy,

16:19

ash, echo, fable, onyx, nova, kind

16:21

of, I don't know, human, I

16:23

guess we are not even trying

16:25

to make human sounding, and you

16:27

can also assign them a vibe

16:30

in this demo like cowboy, auctioneer,

16:32

all-timey, serene, with a lot of

16:34

this kind of steering that you

16:36

can do as well. So. Yeah,

16:38

I think it's pretty exciting. And

16:41

as ever will release some new

16:43

APIs, which really enables downstream of

16:45

Open AI and these other companies

16:47

for others to build exciting new

16:49

applications of AI. And onto a

16:52

few more quick stories. Next up

16:54

also Open AI, they have released

16:56

01-1-Pro into their developer API. it's

16:58

actually limited to developers who spent

17:01

of these five dollars on this

17:03

and it cost a hundred fifty

17:05

per million tokens for input and

17:07

six hundred dollars per million tokens

17:09

generated so that's very very high

17:12

prices obviously that's as we've said

17:14

GP 4.5 was seventy five dollars

17:16

for one million output tokens and

17:18

that's yeah two two orders of

17:20

magnitude easily above what you would

17:23

typically charge. Yeah, I'm trying to

17:25

think if it's two or three

17:27

orders of magnitude. It might be

17:29

approaching three orders of market here

17:31

actually. So yeah, interesting strategy here

17:34

from Open Eye where you haven't

17:36

seen any other companies release these

17:38

very expensive products yet and Open

17:40

Eye is increasingly doing that with

17:42

Chesapeake Pro, they're $200 per month

17:45

subscription with QB4. It makes me

17:47

wonder if this is an attempt

17:49

to become more profitable or if

17:51

this is them sort of testing

17:53

waters There could be various readings

17:56

I suppose. Yeah, it's also I

17:58

mean, it's interesting to note this

18:00

is not an order of magnitude

18:02

larger than what GPT-3's original pricing

18:05

was. I was just looking it

18:07

up in the background here to

18:09

check because I seem to remember

18:11

it being you know on it

18:13

back then it was priced per

18:16

per a thousand tokens with reasoning

18:18

models you tend to see more

18:20

per million tokens just because of

18:22

the number of tokens generated but

18:24

sort of reminds me in the

18:27

military or the history of the

18:29

military there is often this this

18:31

restriction where it's like people can

18:33

only carry I forget what it

18:35

is, 60 pounds or something of

18:38

equipment. And so over time, you

18:40

tend to see like the amount

18:42

of equipment that a soldier carries

18:44

doesn't tend to change or the

18:46

weight of it, but of course

18:49

the kind of equipment they carry

18:51

just changes to reflect technology. This

18:53

sort of seems similar, right? There's

18:55

like almost a perito frontier of

18:57

pricing, at least for the people

19:00

who are willing to reach for

19:02

the most intelligent products, you know,

19:04

and you're constantly reaching for it.

19:06

This is a push-forative. Even relative.

19:08

There's all kinds of feedback people

19:11

have been getting. There's complaints about,

19:13

though, this model struggled with Sudoku

19:15

puzzles, apparently, and optical illusions and

19:17

things like that. People say, you

19:20

know, at a certain point, anything

19:22

you launch at a high-priced point,

19:24

especially if you're opening eye, people

19:26

will complain that it's not like

19:28

super intelligence. And so... Yeah, but

19:31

there's also an interesting parallel here

19:33

where O1-Pro... just in terms of

19:35

benchmarks and I think in general

19:37

in terms of the vibe of

19:39

what people think is that it's

19:42

not sort of significantly better than

19:44

01 and that parallels duty 4.5

19:46

you know it's better but it's

19:48

not sort of a huge leap

19:50

so there is an interesting kind

19:53

of a demonstration of probably it's

19:55

harder to get you know huge

19:57

leaps and performance and people are

19:59

going to be more critical now

20:01

of if you were not offering

20:04

something that's like you know really

20:06

between Jupiter 4 3.5 and 4

20:08

for instance Yeah, I think it's

20:10

quite use case specific too, right?

20:12

So as we've seen, you know,

20:15

the kinds of issues people are

20:17

running into optical illusions, you know,

20:19

Sudoku puzzles, this sort of thing,

20:21

are pretty far from the standard,

20:24

you know, the actual workloads that

20:26

Open AI is targeting, right? Their

20:28

focus is, can we build something

20:30

that helps us automate AI research

20:32

as quickly as possible? Those sorts

20:35

of benchmarks, yeah, we are seeing

20:37

needle moving there. There's also some

20:39

interesting stuff that we'll talk about

20:41

for meter, suggesting that in fact

20:43

that is what's happening here. That

20:46

on those particular kinds of tasks,

20:48

we're seeing pretty significant. acceleration with

20:50

scale, but you're right, right? It's

20:52

this funny uneven surface, just like

20:54

how humans are funny and uneven,

20:57

right? Like you have a really

20:59

talented artist who can't write a

21:01

line of code to save their

21:03

lives, right? And vice versa. So

21:05

another instance of the paradox of

21:08

what's hard for AI isn't necessarily

21:10

hard for humans. And moving away

21:12

from Open AI to Google, we

21:14

now have another feature. Another instance

21:16

of canvas, this time on Gemini,

21:19

and they're also adding audio overview.

21:21

So I don't know why they

21:23

do this, why these ALMs just

21:25

copy each other's names. We had

21:28

deep research showing up in multiple

21:30

variants. Now we have a canvas,

21:32

which is also on ChadGPT, and

21:34

I think on the topic, it's

21:36

called Artifacts. Basically the same idea,

21:39

where now as you're working on

21:41

something like code, for instance. or

21:43

like a web app, for instance,

21:45

you can have a site panel

21:47

showing this living document rendering of

21:50

it with a chatbot to the

21:52

left, so you can essentially interactively

21:54

work and see a preview of

21:56

what you're getting. And you also

21:58

have audio overviews, which is pretty

22:01

much something like notebook, LM, that

22:03

you can upload documents and get

22:05

this podcast. style conversation going on.

22:07

So nothing sort of conceptually new

22:09

going on here, but I think

22:12

an interesting convergence across the board

22:14

of all these tools, everyone has

22:16

canvas, everyone has deep research, everyone

22:18

seems to have kind of the

22:20

same approach to implementing LM interfaces.

22:23

Speaking of that, in fact, the

22:25

next story is about Enfropic and

22:27

them adding web search capabilities to

22:29

Claude. That is now in preview

22:32

for paid users in the US

22:34

and that will basically work the

22:36

same as it does in Chajabiki

22:38

and other models. You can enable

22:40

it to work with cloud 3.7

22:43

and then it will be able

22:45

to provide direct citations from web-source

22:47

information. So yeah, not much else

22:49

to say we're getting a website

22:51

for cloud which will enable it

22:54

to be more useful. It's interesting

22:56

because it's like the tee up

22:58

to this is anthropic being a

23:00

little bit more shy than other

23:02

companies to roll up the the

23:05

web search product into their Into

23:07

their agents and I mean this

23:09

is consistent with the threat models

23:11

that they take seriously right things

23:13

like loss of control right which

23:16

typically involve you know an AI

23:18

model going on to the internet

23:20

maybe replicating its weight somehow and

23:22

internet access is kind of central

23:24

to a lot of these things

23:27

I don't know if that was

23:29

part of this, it at least

23:31

is consistent with it. So the

23:33

result is that there may be

23:36

a little bit later the party

23:38

than others. Apparently, according to these

23:40

initial tests, you don't always see

23:42

web search used for current events

23:44

related questions. But when that happens,

23:47

you do get these nice in

23:49

line citations, pull from sources, it

23:51

does look at social media, and

23:53

then of course news sources. like

23:55

NPR, like Reuters, they cite in

23:58

the examples they show. So, you

24:00

know, pretty standard product in the

24:02

inline citation approach that you see

24:04

with deep research, for example, certainly

24:06

making an appearance here. And last

24:09

up, again, along the lines of

24:11

these stories. we have XAI launching

24:13

a new API, this one for

24:15

generating images. So we have a

24:17

new model called Rock 2 Image

24:20

12, and you can now query

24:22

it. For now it's quite limited.

24:24

You can only generate 10 images

24:26

per request and you are limited

24:28

to 5 requests per second. The

24:31

cost there is 7 cents per

24:33

image. which is slightly above what,

24:35

for instance, Blackforest Labs charges. They

24:37

are the developers of Flux and

24:40

competitive with another offering for my

24:42

diagram. So I think, yeah, interesting

24:44

to see XAI expanding their APIs

24:46

once again, they released their own

24:48

image generation back in December and

24:51

it's kind of looked competitive with

24:53

something like Google's latest generation of

24:55

where we focus is really shifted

24:57

towards careful instruction following in your

24:59

image generation. So yeah, X AI

25:02

is as ever trying to catch

25:04

up or moving quite rapidly to

25:06

expand their offerings. Yeah, they really

25:08

are. And I think when we

25:10

first covered Blackforest Labs partnership with

25:13

X AI, one of the first

25:15

things that we said was like,

25:17

hey, this is. because I think

25:19

they raised a big round, right,

25:21

on the back of the incredible

25:24

distribution that they were going to

25:26

get through X AI and the

25:28

kind of vote of confidence that

25:30

reflects it from Elon. But at

25:32

the time we were talking about,

25:35

hey, you know, this is a

25:37

pretty strategically dicey position for Blackforth

25:39

Labs because the one thing we've

25:41

consistently seen from from all the

25:44

AI companies is Once they start

25:46

getting you in for chat, eventually

25:48

they start rolling out multimodal features,

25:50

and it's not clear that those

25:52

aren't best built in house for

25:55

any number of reasons, not just

25:57

including the fact that you want

25:59

to kind of internalize all the

26:01

revenues you can from the whole

26:03

from the whole stack, but also

26:06

once you have a good reasoning

26:08

model or rather a good foundation

26:10

model, that foundation model can be

26:12

mine for multimodality. post hoc and

26:14

you just kind of get to

26:17

amortize your investment across more modalities

26:19

and so it's just this natural

26:21

move to kind of keep crawling

26:23

into you're creeping into adjacent markets

26:25

like image generation video generation which

26:28

is also something that X AI

26:30

is looking at so yeah I

26:32

mean it's kind of interesting for

26:34

Blackforce Labs this probably is going

26:36

to be a big challenge for

26:39

them I don't know how extensive

26:41

their partnership continues to be at

26:43

this point but it's a dicey

26:45

time to be to be one

26:47

of these companies. And on two

26:50

applications and business we begin with

26:52

some announcements from invidia. There's a

26:54

preview of their plans in 2026

26:56

and 2027. They have the Ruben

26:59

family of GPUs coming in 2026

27:01

and then Ruben Ultra in 2027.

27:03

So that will also come along

27:05

with a new, I guess. server

27:07

layout with ability to combine 576

27:10

GPUs per rack, which, you know,

27:12

I guess it's very much following,

27:14

you know, the tracks of very,

27:16

very crazy enhancement to computing that

27:18

invidious been able to continue creating

27:21

with, you know, B200, I believe

27:23

it's now, and now this is

27:25

their plans for the next couple

27:27

years. Yeah, there's a lot going

27:29

on with this update. It's actually

27:32

pretty interesting and quite significant, especially

27:34

on the data center side, in

27:36

terms of the infrastructure that will

27:38

be required to accommodate these new

27:40

chips. A couple things here, right?

27:43

So there is this configuration of

27:45

the Blackwell called the NBL 72,

27:47

is the certain name of this

27:49

configuration. This is where you have,

27:51

so, okay, imagine a tray that

27:54

you're in a slot into. a

27:56

rack, a server rack, right? So

27:58

on that tray, you're gonna have

28:00

four GPUs. All right, so each

28:03

trache. contains four GPUs, and in

28:05

total, in that whole rack, you're

28:07

gonna have 72, I'm sorry, you're

28:09

actually gonna have 144 GPUs total,

28:11

but because two of those GPUs

28:14

show up on the same motherboard,

28:16

God. So each freaking, each freaking

28:18

tray that you slot into the

28:20

rack has two motherboards on it,

28:22

each of those motherboards has two

28:25

GPUs, two B200 GPUs. So in

28:27

total, you're putting in four GPUs

28:29

per tray. But they're kind of

28:31

divided into two motherboards, each with

28:33

two GPUs. Anyway, this led to

28:36

the thing being called the NFL

28:38

72, when in reality there's 144

28:40

GPUs on there, at least Jensen-Hwang

28:42

says it would have been more

28:44

appropriate to call it the NFL-144-L.

28:47

Okay. What's actually interesting in this

28:49

setup, they're calling the Ruben-Nvil-L-144 rack.

28:51

There's not more GPUs there. It's

28:53

not that there's twice as many

28:55

GPUs as the Nvil-72 with the

28:58

black wells. It's just that they're

29:00

counting them differently now. So they're

29:02

saying, actually, we're going to count

29:04

all the GPUs. So if, I

29:07

think back in the day we

29:09

did talk about the Nvil-72 setup,

29:11

this is basically just the same

29:13

number of GPUs. If that didn't

29:15

make any sense, just delete it

29:18

from your mind. It's both some

29:20

of the things that are actually

29:22

interesting. The story is it's comparable

29:24

in the number of GPUs to

29:26

the current set of top line

29:29

GPUs. So they're kind of pitching

29:31

it as you can spot it

29:33

into your existing infrastructure more or

29:35

less and just to jump into

29:37

numbers a little bit, you're getting

29:40

roughly three times the inference and

29:42

training performance in terms of just

29:44

rock compute memory is faster by

29:46

close to two-ish or multiplier of

29:48

two kind of like yeah you're

29:51

seeing multipliers on top of the

29:53

current one so quite significant change

29:55

in performance if you do upgrade

29:57

yeah so so when it comes

29:59

to Rubin right which is the

30:02

the sort of next generation coming

30:04

online at FP4, you're seeing, yeah,

30:06

3X more flops, right, three times

30:08

more, more logic capacity. Now the,

30:11

on the memory side, things actually

30:13

do get somewhat interesting. The memory

30:15

capacity is going to be 288

30:17

gigabytes per GPU, right? That is

30:19

the same as the B300. So

30:22

no actual change in terms of

30:24

the, like, per GPU. memory capacity.

30:26

We'll get back to why that

30:28

matters a bit less in a

30:30

second, but that's kind of part

30:33

of the idea. The memory bandwidth

30:35

is improving. It's almost doubling. Maybe,

30:37

yeah, it's short of doubling. So

30:39

the memory bandwidth is really, really

30:41

key, especially when you look at

30:44

inference. So that's one of the

30:46

reasons why this is really being

30:48

focused on. But there's also a

30:50

bunch of things like the, so

30:52

the cables that connect GPUs together.

30:55

on roughly speaking on one rack,

30:57

if you want to imagine it

30:59

that way. Those are called envy

31:01

link cables, super super high bandwidth.

31:03

Those are doubling in throughput. So

31:06

that's really a really big advance.

31:08

There's also stuff happening on the

31:10

networking side, but we don't need

31:12

to touch that. Bottom line is

31:15

envy link cables used to be

31:17

the way you connected GPUs across

31:19

different. trays in the same rack

31:21

and maybe adjacent racks depending on

31:23

the configuration. But it's very local,

31:26

very very tight, very high bandwidth

31:28

communication. What's happening here is each

31:30

of these motherboards, you know, that

31:32

you're slotting into your rack, they

31:34

have a CPU and two GPUs.

31:37

And we talked about this in

31:39

the hardware episode as to why

31:41

that is. The CPUs like the

31:43

orchestra conductor, the GPUs are like

31:45

the instruments that, you know, they're

31:48

actually doing the hard work in

31:50

the heavy lifting. Typically the CPU

31:52

would be connected to the GPUs

31:54

through a PCIE connection. So this

31:56

is a relatively low bandwidth compared

31:59

to NVLink. Now they're moving over

32:01

to NVLink as well for the

32:03

CPU 2 GPU connection. That's actually

32:05

a real... really big deal. It

32:07

comes with a core to core

32:10

interface. So right now, the GPUs

32:12

and CPUs are going to share

32:14

a common memory space. So essentially

32:16

directly accessing each other's memory. Whatever

32:19

is in memory on the CPU,

32:21

the GPU can access right away

32:23

and vice versa. That's a really,

32:25

really big change. It used to

32:27

not be the case. You used

32:30

to have independent CPU and GPU

32:32

memory. the GPUs themselves would share

32:34

a common memory space if they

32:36

were connected via NVLink and in

32:38

fact that's kind of that's part

32:41

of the idea here that's what

32:43

makes them a coherent wad of

32:45

compute and it's also part of

32:47

the reason why the memory capacity

32:49

on those GPUs matters a bit

32:52

a bit less because you're adding

32:54

you're kind of combining all your

32:56

GPUs together and they have a

32:58

shared memory space so if you

33:00

can just add to the number

33:03

of GPUs, you have, you're effectively

33:05

adding true memory capacity. So that's

33:07

kind of an important difference there.

33:09

So anyway, last thing I'll mention,

33:11

they say that apparently Ruben Ultra

33:14

is going to come out. This

33:16

is, so there's going to be

33:18

Ruben and then Ruben Ultra. Ruben

33:20

Ultra is coming out the second

33:23

half of 2027. It'll come with

33:25

a Ruben GPU and a Vera

33:27

CPU, like invidia tends to do,

33:29

right? And so Vera is the

33:31

CPU, Reuben is the GPU. Apparently,

33:34

the full rack is going to

33:36

be replaced by this 576 GPU

33:38

setup, a massive number. That is

33:40

essentially, so they don't specify the

33:42

power consumption, but it's clear from

33:45

other kind of industry products that

33:47

are coming out. We're tracking for

33:49

one megawatt per rack. And just

33:51

worth emphasizing. That's a thousand kilowatts.

33:53

That is a thousand homes worth

33:56

of power going to a single

33:58

rack in a server, in a

34:00

data center. That's insane, right? So

34:02

the power density required for this

34:04

is going through the roof, the

34:07

cooling requirements, all this stuff. It's

34:09

all really cool. And anyway, this

34:11

is a very, very big motion.

34:13

Just to dive in a little.

34:15

it into the numbers, just fun,

34:18

right? So the compute numbers are

34:20

in terms of flops, which is

34:22

floating point operations per second, basically

34:24

multiplications per second or additions per

34:26

second. And the numbers we get

34:29

with these announced upcoming things like

34:31

their Rubin are now for inference,

34:33

3.6 XA flops of inference. So

34:35

XA is... is Quintilian. It's 10

34:38

to the 18. Quintilian is the

34:40

one after Quintilian. So I can't

34:42

even imagine how many, I mean,

34:44

I guess I know how many

34:46

zeros it is, but it's very

34:49

hard to imagine a number that

34:51

long. And that's just where we

34:53

are at. Also worth mentioning, so

34:55

this is the plans for 2026,

34:57

2027. They also did announce for

35:00

late of a year the coming

35:02

of B300, which is B300. Also,

35:04

you know, is an improvement in

35:06

performance of about 1.5. They also

35:08

didn't announce the ultra-bariance of Blackwell,

35:11

both 200 and 300. And the

35:13

emphasis they are starting to add,

35:15

I think, is more on the

35:17

inference side. They definitely are saying

35:19

that these are good models for

35:22

the age of reasoning. So they're

35:24

capable of outputting things fast in

35:26

addition to training well and that's

35:28

very important for reasoning of course

35:30

because the whole idea is you're

35:33

using up more tokens to get

35:35

better performance. So they're giving some

35:37

numbers like for instance Blackwell Ultra

35:39

will be able to deliver up

35:42

to 1,000 tokens per second on

35:44

Deep Seek R1 and that's you

35:46

know comparable usually you would be

35:48

seeing something like 100-200 tokens per

35:50

second a thousand dollars per second

35:53

is very fast. Yeah, and then

35:55

the inference focus is reflected too.

35:57

in the fact that they're looking

35:59

at you know FP4 flops denominated

36:01

performance right so when you go

36:04

to inference often you're your inferencing

36:06

quantized models inferencing at FP4 so

36:08

lower resolution and also this the

36:10

the memory bandwidth side becomes really

36:12

important for inference disproportionately relative to

36:15

training at least on the current

36:17

paradigm so that's kind of you

36:19

know part of the reason that

36:21

you're seeing those big big lifts

36:23

at that that end of things

36:26

is because of the inference. And

36:28

next story is also about some

36:30

absurd sounding numbers of hardware. This

36:32

one is from Apple. They have

36:34

launched a Mac studio offering and

36:37

the top line configuration where you

36:39

can use the entry ultra chip

36:41

with 32 CPU's 80 core GPUs

36:43

that can even run the Deep

36:46

Seek R1 model. That's the 671

36:48

billion parameter AI model. Fewer at

36:50

inference, you're using about 37 billion

36:52

per output I believe, but still,

36:54

this is hundreds of gigabytes of

36:57

memory necessary to be able to

36:59

run it and just fit it

37:01

in there. Yeah, Apple's also doing

37:03

this weird thing where they're not

37:05

designing like GPUs for their data

37:08

centers, including for AI workloads. They

37:10

seem to be basically like doing

37:12

souped up CPUs, kind of like

37:14

this, with just like gargantuan amounts

37:16

of VRAM, that again, have this

37:19

very large kind of shared pool

37:21

of memory, right? We talked about

37:23

like coherent memory on the Blackwell

37:25

side, right, and on the Rubin

37:27

side, just the idea that. if

37:30

you have a shared memory space,

37:32

you can pool these things the

37:34

other. Well, they're not as good

37:36

at the shared memory space between

37:38

CPUs. What they do is they

37:41

have disgusting amounts of RAM, one

37:43

GPU, right? So like, 512 gigs

37:45

is, it's just wild. Like, it's,

37:47

anyway. for a CPU at least.

37:50

And we're talking here about, when

37:52

you say memory, we mean really

37:54

something like RAM, right? And so

37:56

if you have a laptop, right,

37:58

if you buy a Mac, for

38:01

instance, typically you're getting eight gigabytes,

38:03

maybe 16 gigabytes of RAM, the

38:05

fast type of memory, read, what

38:07

does it, read something memory? It's

38:09

random access memory, right. As opposed

38:12

to the slower memory of. Let's

38:14

say an SSD or things where

38:16

you can easily get terabytes to

38:18

get that crazy amount of Random

38:20

access memory is insane when you

38:23

consider that typically it's like eight

38:25

16 gigabytes and you know This

38:27

is expensive memory. It's it's stupid

38:29

expense. It's also like Yeah, there's

38:31

different kinds of RAM and we

38:34

talked about that in our hardware

38:36

episode This is a combined CPU

38:38

GPU set up by the way

38:40

so 32 core CPU 80 core

38:42

GPU but shared memory across the

38:45

board. So V-Ram is like really

38:47

close to the logic, right? So

38:49

this is like the most, as

38:51

you said, exquisitely expensive kind of

38:54

memory you can put on these

38:56

things. They're opting to go in

38:58

this direction for very interesting reasons,

39:00

I guess. I mean, it does

39:02

mean that they're disadvantaged in terms

39:05

of being able to scale their

39:07

data center, you know, infrastructure, because

39:09

of networking, at least as far

39:11

as I can tell. It's a

39:13

very interesting standalone standalone machine. I

39:16

mean, it's a pretty wild specs.

39:18

Right. Yeah, exactly. If you go

39:20

to the top line offerings, and

39:22

this is, you know, a physical

39:24

product you can buy as a...

39:27

Yeah, it's a Mac, right? Yeah,

39:29

it's a Mac, yeah, it's a

39:31

Mac, it's like a big, kind

39:33

of cube-ish thing. And if you

39:35

go to the top line of

39:38

recreation, it's something like $10,000, don't

39:40

quote me on that, but it's,

39:42

you know, you know, expensive, expensive,

39:44

expensive, expensive, expensive, as well. It

39:46

does come with other options for

39:49

a reason M4 Max CPU and

39:51

GPUs less powerful than an free

39:53

ultra, but anyway. Very kind of

39:55

beefy offering now from Apple. Next

39:58

we have something a bit more

40:00

forward-looking. Intel is apparently reaching an

40:02

exciting milestone for the 18A1.8 nanometer

40:04

class wafers with a first run

40:06

at the Arizona Fab. So this

40:09

is apparently ahead of schedule. They

40:11

have these Arizona Fabs. Fab 52

40:13

and Fab 62, Fab as we've

40:15

covered before is where you try

40:17

to make your chips and 1.8

40:20

nanometer is the next kind of

40:22

frontier in terms of scaling down

40:24

the resolution. the density of logic

40:26

you can get on a chip.

40:28

So the fact that they're running

40:31

these test wafers, they're ensuring that

40:33

you can transfer the fabrication process

40:35

to these new Arizona facilities, I

40:37

guess the big deal there is

40:39

partially that these are located within

40:42

the US, within Arizona, and they

40:44

are seemingly getting some success and

40:46

are ahead of schedule, as you

40:48

said, and that's impressive because fabs

40:50

are an absurdly complex engineering project.

40:53

Yeah Intel is in just this

40:55

incredibly fragile space right now as

40:57

has been widely reported and we've

40:59

talked about that a fair bit

41:01

I mean they need to blow

41:04

it out of the water with

41:06

with 18a in their future notes

41:08

I mean this is like a

41:10

make-or-break stuff so forward progress yeah

41:13

they had their their test facility

41:15

in Hillsborough Oregon who that was

41:17

doing 18a production as you said

41:19

on a test basis and they're

41:21

now successfully getting the first test

41:24

wafers in their new Arizona Fab

41:26

out so that's great But yeah,

41:28

it'll eventually have to start running

41:30

actual chips for commercial products. The

41:32

big kind of distinction here is

41:35

they're actually manufacturing with 18A, these

41:37

gate all around transistors. I think

41:39

we talked about this in the

41:41

hardware episode. We'll go into too

41:43

much detail. This is a specific

41:46

geometry of transistor that allows you

41:48

to have better control over the

41:50

flow of electrons through your transistor,

41:52

essentially. It's a big, big challenge

41:54

people have had in making transistors

41:57

small and smaller. You get all

41:59

kinds of current leakage. The current,

42:01

by the way, is sort of

42:03

like the thing that carries information

42:05

in your computer. And so you

42:08

want to make sure that you

42:10

don't have current leakage to kind

42:12

of have ones become zeros or

42:14

let's say operation, like, you know,

42:17

a certain kind of gate, turn

42:19

into the wrong kind of gate.

42:21

That's the idea here. this gate-all-round

42:23

transistor based on a ribbon-fed design.

42:25

And yeah, so we're seeing that

42:28

come-to-market gate-all-round is is something that

42:30

TSMC is moving towards as well.

42:32

And you know, it's just going

42:34

to be the next, essentially the

42:36

next beat of production. So here

42:39

we have 18A, kind of early

42:41

signs of progress. And now moving

42:43

away from hardware more to business-y

42:45

stuff, XAI has acquired a genative

42:47

AI startup. which is focused on

42:50

text to video similar to SORA.

42:52

They also have AI powered, or

42:54

initially they're worked on AI powered

42:56

for the tools and then pivoted.

42:58

So I suppose on surprising in

43:01

a way that they are working

43:03

on text to video as well.

43:05

They just want to have all

43:07

the capabilities at XAI, and this

43:09

presumably will make that easier to

43:12

do. Yeah, one of the founders

43:14

had some quote, I think it

43:16

might have been on X, I'm

43:18

not sure, but he said, we're

43:21

excited to continue scaling these efforts

43:23

on the largest cluster in the

43:25

world, Colossus, as part of XAI,

43:27

so it seems like they'll be

43:29

given access to Colossus as part

43:32

of this, maybe not shocking, but

43:34

kind of an interesting subnote. They

43:36

were backed by some really impressive

43:38

VCs as well, so Alexis Zahanian,

43:40

as were like famous for like

43:43

being the co-founder, read it of

43:45

course, and doing his own VC

43:47

stuff, and SV Angel too. So

43:49

pretty interesting acquisition and a nice

43:51

soft landing too for folks in

43:54

a space that otherwise, you know,

43:56

I mean, they're either going to

43:58

acquire you or they're going to

44:00

eat your lunch. So I think

44:02

that's probably the best outcome for

44:05

people working on the different modalities,

44:07

at least. at least on my

44:09

view of the market. Yeah, and

44:11

I guess acquisition makes sense. The

44:13

startup has been around for over

44:16

two years and they have already

44:18

trained multiple video models. Hotshot Excel

44:20

and hotshot, and they do produce

44:22

quite good looking videos, so. make

44:25

some sense for opening eye to

44:27

acquire them, if only for the

44:29

kind of brain power and expertise

44:31

in that space. Yeah, man, they're

44:33

old. They've been around for like

44:36

two years, right? Like, yeah, that

44:38

was what, free SORA or like

44:40

a round-of-time SORA. Yeah, yeah, it's

44:42

funny. It's just funny how the

44:44

AI business cycle is so short,

44:47

like, like, these guys have been

44:49

around for all the 24 months.

44:51

Very experts, very veterans. And onto

44:53

the last story, Tencent is reportedly

44:55

making massive and video H20 chip

44:58

purchases. So they are supposedly meant

45:00

to support the integration of Deep

45:02

Seek into We Chat, which kind

45:04

of reminds me of meta where

45:06

meta has this somewhat interesting drive

45:09

to... Let you use a llama

45:11

everywhere and Instagram and all their

45:13

messaging tools. So this would seem

45:15

to be in a way similar

45:17

where Tencent would allow you to

45:20

use deep seek within we chat.

45:22

Yeah, part of what's going on

45:24

here too is the standard stockpiling

45:26

that you see China do and

45:29

Chinese companies do ahead of the

45:31

anticipated crackdown from the United States

45:33

on export controls. And in this

45:35

case, the age 20 has been

45:37

kind of identified as one of

45:40

those chips that's likely to likely

45:42

to be... shut down for the

45:44

Chinese market in the relatively near

45:46

term. So it makes all the

45:48

sense in the world that they

45:51

would be stockpiling for that purpose.

45:53

But it is also the case

45:55

that you got R1 that has

45:57

increased dramatically the demand for access

45:59

to hardware. It's sort of funny

46:02

how quickly we pivoted from, oh

46:04

no, R1 came out and so

46:06

invidious stock crashes to, oh, actually

46:08

R1. great news for invidia. Anyway,

46:10

I think it's the turnaround that

46:13

we sort of expected. We talked

46:15

about this earlier and there has

46:17

been apparently a short-term supply shortage

46:19

in China regarding these age 20

46:21

chips. So like there's so much

46:24

demand coming in from 10 cent

46:26

that it's this sort of like

46:28

rate limiting for invidia to get

46:30

age 20s into the market there.

46:33

So kind of interesting, they've previously

46:35

placed orders on the order of

46:37

hundreds of thousands between them and

46:39

bite dance. back I think last

46:41

year was almost a quarter million

46:44

of these these Japanese so yeah

46:46

pretty big customers and onto projects

46:48

and open source we begin with

46:50

a story from the information titled

46:52

on tropics not so secret weapon

46:55

that's giving agents a boost I

46:57

will say kind of a weird

46:59

spin on this whole story but

47:01

anyway that's the one we're linking

47:03

to and it covers the notion

47:06

of MCP model context protocol which

47:08

Anthropic released all the way back

47:10

in November. We hopefully covered it.

47:12

I guess we don't know. I

47:14

was trying to remember, yeah. Yeah,

47:17

I think we did. And the

47:19

reason we're coming in now is

47:21

that it sort of blew up

47:23

over the last couple of weeks

47:25

if you're in the AI developer

47:28

space or you see people hacking

47:30

on AI, that it has been

47:32

to talk of a town, so

47:34

to speak. So Molokontax protocol, broadly

47:37

speaking is something... like an API,

47:39

like a standardized way to build

47:41

ports or mechanisms for AI agents

47:43

or AI, I guess, models to

47:45

call on services. So it standardizes

47:48

the way you can provide things

47:50

like tools. So there's already many,

47:52

many integrations following the standard for

47:54

things like slack, perplexity, notion, etc.

47:56

where if you adopt a particle

47:59

and you provide an FCP compatible

48:01

kind of opening, you can then

48:03

have an MCP client, which is

48:05

your AI model, call upon this

48:07

service. And it's very much like

48:10

an API for a website, where

48:12

you can have a particular URL

48:14

to go to, particular kind of

48:16

parameters, and you get something back

48:18

in some format. Here, the difference

48:21

is that, of course, this is

48:23

more specialized for. AI models in

48:25

particular, so it provides tools, it

48:27

provides like a prompt to explain

48:29

the situation, things like that. Personally,

48:32

I'm in the camp of people

48:34

who are a bit confused and

48:36

kind of think that this is

48:38

an API for an API kind

48:40

of situation, but either way, it

48:43

has gotten very popular. That is

48:45

exactly what it is. Yeah, it's

48:47

an API for an API. It's

48:49

also, I guess, a transition point.

48:52

or could be viewed that way,

48:54

you know, in the sense that

48:56

eventually you would expect models to

48:58

just kind of like figure it

49:00

out, you know, and have enough

49:03

context and ability to uncover whatever

49:05

information is already on the website.

49:07

to be able to use tools

49:09

appropriately, but there are edge cases

49:11

where you expect this to be

49:14

worthwhile. Still, this is going to

49:16

reduce things like hallucination of tools

49:18

and all kinds of issues that

49:20

when you talk about agents, like

49:22

one failure anywhere in a reasoning

49:25

chain or in an execution chain,

49:27

it can cause you to fumble.

49:29

And so, you know, this is

49:31

structurally a way to address that

49:33

and quite important in that sense.

49:36

It is also distinct from a

49:38

lot of the tooling that opening

49:40

eyes come out with, but it

49:42

sounds similar like the agent's API.

49:44

where they're focused more on chaining

49:47

tool uses together, whereas MCP, as

49:49

you said, is more about helping

49:51

make sure that each individual instance

49:53

of tool use goes well, right?

49:56

That the agent has what it

49:58

needs to kind of ping the

50:00

tool properly interact with it and

50:02

find the right tool rather than

50:04

necessarily chaining them together. So there

50:07

you go. MCP. is a nice

50:09

kind of clean open source play

50:11

for anthropic too. They are going

50:13

after that kind of more startup

50:15

founder and business ecosystem. So pretty

50:18

important from a marketing standpoint for

50:20

them too. Right, yeah, exactly. So

50:22

back in November, they announced this,

50:24

they introduced us as an open

50:26

standard, and they also released open

50:29

source repositories with some example of

50:31

model context product called servers. and

50:33

as well the specification and like

50:35

a development toolkit. So I honestly

50:37

haven't been able to track exactly

50:40

how this blew up. I believe

50:42

there was some sort of tutorial

50:44

given at some sort of convention

50:46

like the AI engineer convention or

50:48

something and then it kind of

50:51

took off and everyone was very

50:53

excited about the idea of model

50:55

context protocols right now. Moving on

50:57

to new models, we have Mistral

51:00

dropping a new open source model

51:02

that is comparable to 2P400 mini

51:04

and is smaller. So they have

51:06

Mistral small 3.1, which is seemingly

51:08

better than similar models, but only

51:11

has 24 billion parameters. Also can...

51:13

take on more input tokens, 120,000

51:15

tokens, and is fairly speedy at

51:17

150 tokens per second. And this

51:19

is being released under the Apache

51:22

2 license, meaning that you can

51:24

use it for whatever you want,

51:26

business implications, etc. I don't think

51:28

there's too much to say here,

51:30

other than like kind of a

51:33

nitpick here, but they say it

51:35

outperforms copper models like Gemini 3.

51:37

GD40 mini while delivering infant speeds

51:39

as you said of 150 tokens

51:41

per second but like you can't

51:44

just say that shit like it

51:46

doesn't mean anything to say yeah

51:48

it depends on what infrastructure you're

51:50

using yeah what's the stack dude

51:52

like like you know like I

51:55

can move at a hundred like

51:57

at a hundred miles an hour

51:59

if I'm in a Tesla that's

52:01

what makes me anyway so they

52:04

do give that information but it's like

52:06

buried like in the little like great

52:08

this is from their blog post where

52:10

we get these numbers so I guess

52:12

as with any of these model

52:15

announcements you go to the company

52:17

log you would get a bunch

52:19

of numbers on benchmarks showing

52:21

that it's the best You

52:23

have comparisons to Gemma free

52:26

from Google to coherent, AIIA,

52:28

Jupiter for a mini, cloud

52:30

3.5 haiku, and on all

52:32

of these things like MMLU,

52:34

human evil, math, it typically is

52:36

better, although, you know, I would

52:38

say it doesn't seem to be

52:40

that much better than Gemma at

52:43

least, and in many cases

52:45

is not better than 3.5 haiku

52:47

and Jupiter for a mini, but

52:49

yeah, still quite good. the 150 tokens

52:52

per second to for context is a batch

52:54

size 16 on 4 H 100s. They actually

52:56

like even in the technical post they write

52:58

while delivering inference speeds of 150 tokens

53:00

per second without further qualifying but it's

53:02

in like it's in this like small

53:05

gray text underneath an image that you

53:07

have to look for where you actually

53:09

find that context so don't expect this

53:11

to run at 150 tokens per second

53:13

on your laptop right that's that's just

53:15

not going to happen because you know

53:17

for H 100s is like that's quite

53:19

a lot of horsepower horsepower. still yeah

53:21

it's incremental improvement more

53:24

more open source coming

53:26

from from Israel and patchy

53:29

2.0 license so you know highly

53:31

permissive and one more model

53:33

you have exa on deep

53:35

reasoning enhanced language models coming

53:38

from LG AI research so

53:40

these are new models new

53:42

family models 2.4 billion seven

53:45

point eight billion and 32

53:47

billion parameters These are

53:49

optimized for reasoning

53:51

tasks and seemingly

53:54

are on par or

53:56

out of our performing

53:58

variations of R1. are one is

54:01

the giant one that's 671

54:03

billion. There's distilled

54:05

versions of those models

54:08

at comparable sizes and

54:10

in the short technical

54:12

report that they provide they

54:14

are showing that it seems

54:16

to be kind of along the lines

54:18

of what you can get with

54:21

those distilled R1 models

54:23

and similar or better

54:25

than Open AI01 mini.

54:27

Yeah, it's also, it's kind of interesting,

54:29

it seems to, I described, again, not

54:32

a lot of detail in the paper,

54:34

so it makes it hard to reconstruct,

54:36

but it does seem to be at

54:38

odds with some of the things that

54:40

learn in the Deep Seek R1 paper,

54:42

for example. So there's, they start

54:45

with, it seems, an instruction tuned

54:47

model, base model, the XA1,

54:49

3.5 instruct models, and then they

54:51

add onto that. a bunch of fine tuning,

54:53

they do supervised fine tuning, presumably this

54:56

is for like the reasoning structure, and

54:58

then DPO, so standards, or like, RL

55:00

stuff, and online RL. So, you know, this

55:02

is, you know, quite a bit of supervised

55:04

fine tuning of trying to teach the model

55:07

how to solve problems in the way you

55:09

wanted to solve them, rather than just like

55:11

giving it a reinforcement learning signal

55:13

and reward signal and like kind

55:15

of have at it like R10 did.

55:18

So yeah, kind of an interesting

55:20

alternative more, let's say. more

55:22

inductive prior laden approach and

55:24

first time as well that I

55:26

think we've covered anything from

55:28

LGA high risk so yeah

55:30

Exxon these models appear to

55:33

have already been existing and

55:35

being released Exxon 3.5 we're

55:37

back from December which be

55:39

somehow missed at the time

55:41

yeah fun fact Exxon stands

55:43

for expert AI for everyone

55:45

you got a lot when people

55:48

come up these acronyms in a

55:50

very creative way And they are

55:52

open sourcing it on hugging face

55:54

with some restrictions, this is primarily

55:56

for research usage. Odd to research and

55:59

investment. we begin the paper

56:01

sample scrutinize and scale effective inference

56:03

time search by scaling verification it

56:05

is coming from Google and let

56:07

me just double check also you

56:09

see Berkeley and it is quite

56:12

exciting I think as a paper

56:14

it basically is making the case

56:16

or presenting the idea of that

56:18

to do scaling inference time scaling

56:20

via reasoning where you have the

56:23

model you know once you already

56:25

train your model once you stop

56:27

updating your weights can you use

56:29

your model more to get smarter

56:31

if you just kind of do

56:33

more outputs in some way and

56:36

what you've seen in recent months

56:38

is inference time scaling via reasoning

56:40

where you have the model you

56:42

know output a long chain of

56:44

tokens where it does various kinds

56:46

of strategies to do better at

56:49

a given complicated task like planning

56:51

sub steps like verification backtracking these

56:53

things we've already covered. Well this

56:55

paper is saying instead of that

56:57

sort of scaling the output in

56:59

terms of a chain an extended

57:02

chain of tokens you can instead

57:04

sample many potential outputs like just

57:06

do a bunch of outputs from

57:08

scratch. And I kind of varied

57:10

up so you get many possible

57:12

solutions. And then if you have

57:15

a verifier and you can kind

57:17

of compare and combine these different

57:19

outputs, you can be as effective

57:21

or even in some cases more

57:23

effective than the kind of traditional

57:25

reasoning, inference time scaling paradigm. Again,

57:28

yeah, quite interesting in the paper.

57:30

They have a table one where

57:32

they are giving this example of

57:34

if you sample a bunch of

57:36

outcomes and you have their... that

57:38

is good, you can actually outperform

57:41

01 preview. And many other techniques,

57:43

you can get better numbers on

57:45

the hard reasoning benchmarks like Amy,

57:47

where you can solve eight out

57:49

of the 15 problems now, which

57:51

is insane, but that's where we

57:54

are. And on math and live

57:56

bench math and live bench reasoning.

57:58

So yeah, very interesting idea to

58:00

sample a bunch of solutions and

58:02

then just. compare them and can

58:04

combine them into one final output.

58:07

Yeah, I think this is one

58:09

of those key papers that, again,

58:11

I mean, we see this happen

58:13

over and over. Scaling is often

58:15

a bit more complex than people

58:18

assume, right? So at first, famously

58:20

we had pre-training, scaling, pre-training, compute,

58:22

that brought us from, you know,

58:24

GPD2, G3. to GPD4, now we're

58:26

in the inference time compute paradigm

58:28

where, you know, this analogy that

58:31

I like to use, like, you

58:33

have 30 hours to dedicate to

58:35

doing well on a test, you

58:37

get to choose how much of

58:39

that time you dedicate to studying,

58:41

how much you dedicate to actually

58:44

spending writing the test. And what

58:46

we've been learning is, you know,

58:48

scaling, pre-training computers, basically all study

58:50

time. And so if you just

58:52

do that and you effectively give

58:54

yourself like one second to write

58:57

the test, well, there's only so

58:59

well you can do, right? You

59:01

eventually start to saturate if you

59:03

just keep growing pre-training and you

59:05

don't grow inference time compute. You

59:07

don't invest more time at test

59:10

time. So essentially you have two-dimensional

59:12

scaling. If you want to get

59:14

the sort of indefinite returns, if

59:16

you want the curves to just

59:18

keep going up and not saturate,

59:20

you have to scale two things

59:23

at the same time. This is...

59:25

a case like that, right? This

59:27

is a case of a scaling

59:29

law that would be hidden to

59:31

a naive observer just looking at

59:33

this as a kind of a

59:36

one variable problem when in reality

59:38

it's a multivariate problem and suddenly

59:40

when you account for that you

59:42

go, oh wow, there's a pretty

59:44

robust scaling trend here. And so

59:46

what are these two variables? Well,

59:49

the first is scaling the number

59:51

of sampled responses, so just the

59:53

number of shots on goal, the

59:55

number of attempts that your model

59:57

is going to make at solving

59:59

a given problem, but you have

1:00:02

to improve verification capabilities at the

1:00:04

same time, right? So they're asking

1:00:06

the question, what test time scaling

1:00:08

trends come up as you scale

1:00:10

both the number of sampled responses

1:00:12

and your verification capabilities? One of

1:00:15

the things crucially that they find

1:00:17

though, is you might not evenly

1:00:19

think if you have a... a

1:00:21

verifier, right? And I want to

1:00:23

emphasize this is not a verifier

1:00:26

that has access to ground truth,

1:00:28

right? This is like, and I

1:00:30

think they use like Claude 3.7

1:00:32

for this, 3.7, so on and

1:00:34

so on and for this, but

1:00:36

basically there's a model that's going

1:00:39

to look at, say the 20

1:00:41

different possible samples in other words

1:00:43

that you get from your model

1:00:45

to try to... solve a problem.

1:00:47

And this verifier model is going

1:00:49

to contrast them and just determine,

1:00:52

based on what it knows, not

1:00:54

based on access to actual ground

1:00:56

truth or any kind of symbolic

1:00:58

system like a calculator, it's just

1:01:00

going to use its own knowledge

1:01:02

in the moment to determine which

1:01:05

of those 20 different possible answers

1:01:07

is the right one. And what

1:01:09

you find is, as you scale

1:01:11

the number of possible answers that

1:01:13

you have your verifier look at,

1:01:15

you might not even think well...

1:01:18

with just so much gunk in

1:01:20

the system, probably the verifiers performance

1:01:22

is going to start to drop

1:01:24

over time. It's just like it's

1:01:26

so much harder for it to

1:01:28

kind of remember which of these

1:01:31

was that good and which was

1:01:33

bad and all that stuff. So

1:01:35

eventually you would expect the kind

1:01:37

of trend to be that your

1:01:39

performance would saturate maybe even drop.

1:01:41

But what they find is the

1:01:44

opposite. The performance actually keeps improving

1:01:46

and improving. And the reason for

1:01:48

that seems to be that as

1:01:50

you increase the number of. samples,

1:01:52

the number of attempts to solve

1:01:54

a problem, the probability that you

1:01:57

get a truly exquisite answer that

1:01:59

is so much better than the

1:02:01

others, like that contrasts so much

1:02:03

with the median other answer, that

1:02:05

it's really easy to pick out,

1:02:07

increases. And that actually makes the

1:02:10

verifiers... easier. And so they refer

1:02:12

to this as an instance of

1:02:14

what they call implicit, I think

1:02:16

it was implicit scaling, was the

1:02:18

term, yeah, implicit scaling, right? So

1:02:20

essentially the, yeah, this idea that

1:02:23

you're more likely to get one

1:02:25

like exquisite outlier that favorably contrasts

1:02:27

with the crappy median samples. And

1:02:29

so in this sense, I mean,

1:02:31

I feel like the term verifier

1:02:34

is maybe not the best one.

1:02:36

Really what they have here is

1:02:38

a contraster. When I hear verifier,

1:02:40

I tend to think of ground

1:02:42

truth. I tend to think of

1:02:44

something that is actually kind of

1:02:47

like, you know, giving us, you

1:02:49

know, checking, let's say code and

1:02:51

seeing if it compiles properly. Yeah.

1:02:53

compiler, you could say, where it

1:02:55

takes a bunch of possible outputs

1:02:57

and then from all of that

1:03:00

picks out the best. kind of

1:03:02

guess to answer. Exactly, yeah, and

1:03:04

this is why I don't know

1:03:06

if it's not a term that

1:03:08

people use, but if it was

1:03:10

like contrast or is it really

1:03:13

what you're doing here, right? You're

1:03:15

like, you're sort of doing, in

1:03:17

a way, a kind of contrast

1:03:19

of learning, well, not learning necessarily,

1:03:21

because it's all happening at inference

1:03:23

time, but yeah. So the performance

1:03:26

is really impressive. There are implications

1:03:28

of this for the design of

1:03:30

these systems as well, so you're

1:03:32

trying to find ways to wrangle

1:03:34

problems to wrangle problems into a

1:03:36

shape where you can take advantage

1:03:39

of implicit scaling. where you can

1:03:41

have your model pump out a

1:03:43

bunch of responses in the hopes

1:03:45

that you're going to get, you

1:03:47

know, one crazy outlier that makes

1:03:49

it easier for the verifier to

1:03:52

do its job. So yeah, again,

1:03:54

you know, I think a really

1:03:56

interesting case of multi-dimensional scaling laws,

1:03:58

essentially, that are otherwise easily missed

1:04:00

if you don't invest in both

1:04:02

verifier performance and sampling at the

1:04:05

same time. Exactly, and this is,

1:04:07

I think, important context to provide.

1:04:09

The idea of sampling many answers

1:04:11

and just kind of picking out

1:04:13

the answer that occurred the most

1:04:15

times in all these samples is

1:04:18

a well-known idea. There's also, I

1:04:20

mean... Self-consistency. Self-consiciency, exactly. a majority

1:04:22

vote essentially is one well-established technique

1:04:24

to get better performance and there's

1:04:26

even more complex things you could

1:04:28

do also with like a mixture

1:04:31

of, I forget the term that

1:04:33

exists, but the general idea of

1:04:35

paralyzed generation of outputs and generating

1:04:37

multiple outputs potentially of multiple models

1:04:39

is well known and the real

1:04:42

insight as you said here is

1:04:44

that you need a strong verifier

1:04:46

to be able to really leverage

1:04:48

it. in the stable one they

1:04:50

show that if you compare like

1:04:52

you do get better performance quite

1:04:55

a bit if you just do

1:04:57

consistency if you just sample 200

1:04:59

responses and pick out the majority

1:05:01

compared to the thing where you

1:05:03

don't do any scaling you're getting

1:05:05

four problems out of 15 on

1:05:08

Amy as opposed to one you're

1:05:10

getting you know a significant jump

1:05:12

in performance but after 200 if

1:05:14

you go to 1, you basically

1:05:16

stop getting better. But if you

1:05:18

have a strong verifier, basically a

1:05:21

strong selector among the things, instead

1:05:23

of majority voting, you have some

1:05:25

sort of intelligent way to combine

1:05:27

the answers, you get a huge

1:05:29

jump, a huge difference between just

1:05:31

consistency at 100 and verification at

1:05:34

200. And one reason this is

1:05:36

very important is, first, they also

1:05:38

highlight the fact that verification is

1:05:40

a bit understudy, I think, in

1:05:42

the space of LLLMs, even introduce

1:05:44

a benchmark specifically for a verification.

1:05:47

The other reason that this is

1:05:49

very notable is that if you're

1:05:51

doing sampling based techniques, you can

1:05:53

paralyze the sampling and that is

1:05:55

different from extending your reasoning or

1:05:57

your search because reasoning via more

1:06:00

tokens is sequential, right? It's going

1:06:02

to take more time. Scaling via

1:06:04

sampling you can paralyze all the

1:06:06

samples. then just combine them, which

1:06:08

means that you can get very

1:06:10

strong reasoning at comparable timescales to

1:06:13

if you just take one output,

1:06:15

for instance. So that's a very

1:06:17

big deal. Yeah. Another key reason

1:06:19

why this works too, we covered

1:06:21

a paper, I think it was

1:06:23

months ago, that was pointing out

1:06:26

that if you look at all

1:06:28

the alignment techniques that people used

1:06:30

to get more value out of

1:06:32

their their models, right? They kind

1:06:34

of assumed this one. query one

1:06:37

output picture. Like let's align within

1:06:39

that context. Whereas in reality, what

1:06:41

you're often doing, especially with agents

1:06:43

and inference time compute, is you

1:06:45

actually are sampling like a large

1:06:47

number of outputs and you don't

1:06:50

care about how shitty the average.

1:06:52

generated sample is generated solutions. What

1:06:54

you care about is, is there

1:06:56

one exquisite one in this batch?

1:06:58

And what they did in that

1:07:00

alignment paper is they found a

1:07:03

way to kind of like upweight

1:07:05

just the most successful outcomes and

1:07:07

use that for a reward signal

1:07:09

or some kind of gradient update

1:07:11

signal. But this is sort of

1:07:13

philosophically aligned with that, right? It's

1:07:16

saying. you know, like you said,

1:07:18

self consistency is the view that

1:07:20

says, well, let's just do wisdom

1:07:22

in numbers and say we generated,

1:07:24

you know, I don't know, like

1:07:26

a hundred of these outputs. And

1:07:29

the most common response, most consistent

1:07:31

answer was this one. So let's

1:07:33

call this the one that we're

1:07:35

going to cite as our output.

1:07:37

But of course, if your model

1:07:39

has a consistent failure mode, that

1:07:42

can also cause you to true

1:07:44

self-consistency, identify that, you know, kind

1:07:46

of settle on that failureor mode.

1:07:48

in this big pot and that's

1:07:50

really what this is after. So

1:07:52

a lot of interesting times to

1:07:55

other lines of research as you

1:07:57

said and I think a really

1:07:59

interesting and important paper. And next

1:08:01

up we have the paper block

1:08:03

diffusion interpreting between outer aggressive and

1:08:05

diffusion language models and also I

1:08:08

think quite an interesting one. So

1:08:10

typically when you're using an OLM

1:08:12

you're using outer aggression which is

1:08:14

just saying that you're computing one

1:08:16

talk at a time, right? You

1:08:18

start with one word, then you

1:08:21

select the next word, then you

1:08:23

select the next word. You have

1:08:25

this iterative process, and that's a

1:08:27

limitation of traditional alms, because you

1:08:29

need to sequentially do this one

1:08:31

step at a time. You can't

1:08:34

generate an entire output sequence all

1:08:36

at once, as opposed to diffusion,

1:08:38

which is a generation mechanism, a

1:08:40

way to the generation that is

1:08:42

kind of giving you an entire

1:08:45

answer at once. And diffusion is

1:08:47

the thing that's typically used for

1:08:49

image generation where you start with

1:08:51

just a noisy image, a bunch

1:08:53

of noise, you iteratively update the

1:08:55

entire image all at once until

1:08:58

you get to a good solution.

1:09:00

And we covered, I believe, maybe

1:09:02

a week ago or two weeks

1:09:04

ago, a story of a diffusion-based

1:09:06

alum that was seemingly performed pretty

1:09:08

well. There was a company that

1:09:11

was a company that was a

1:09:13

company that was a company that

1:09:15

was a company that made the

1:09:17

claim, although we didn't provide too

1:09:19

much research on it. Well, this

1:09:21

paper is talking about, well, how

1:09:24

can we combine the strengths of

1:09:26

both approaches? The weakness of diffusion

1:09:28

is typically just doesn't work as

1:09:30

well for LLLMs, and there's kind

1:09:32

of various hypotheses, it's an interesting

1:09:34

question of why it doesn't work,

1:09:37

but it also doesn't work for

1:09:39

arbitrary length. You can only generate

1:09:41

a specific kind of horizon. and

1:09:43

some other technical limitations. So the

1:09:45

basic proposal in the paper is,

1:09:47

well, you can have this idea

1:09:50

of block diffusion where you still

1:09:52

sample aggressively, like sample one step

1:09:54

at a time, but instead of

1:09:56

sampling just one word or one

1:09:58

token, you use diffusion to generate

1:10:00

a chunk of stuff. So you

1:10:03

generate several tokens all at once

1:10:05

of diffusion in parallel. and then

1:10:07

you ought to aggressively keep doing

1:10:09

that and you get to be,

1:10:11

you know, the best of both

1:10:13

worlds. so to speak. So an

1:10:16

interesting idea, kind of architecture you

1:10:18

haven't seen before and potentially could

1:10:20

lead to stronger or faster models.

1:10:22

Yeah, it's also more paralyzable, right?

1:10:24

So the big advantage because these

1:10:26

within these blocks you're able to

1:10:29

just like denoise in one shot

1:10:31

with all the text in there.

1:10:33

It means you can parallelize more.

1:10:35

They try, I think, with blocks

1:10:37

of various sizes, like sizes of

1:10:39

four tokens, for example, right? So,

1:10:42

like, what's denoise four tokens at

1:10:44

a time? They do it with

1:10:46

this interesting kind of like, well,

1:10:48

it's essentially, they put masks on

1:10:50

and kind of gradually remove the

1:10:53

masks on the tokens as they

1:10:55

denoise. That's their interpretation of how

1:10:57

denoising would work. The performance is

1:10:59

lower, obviously, than state of the

1:11:01

art for. autoregressive models. Think of

1:11:03

this more as a proof of

1:11:06

principle that there are favorable, favorable

1:11:08

scaling characteristics. There's some promise here.

1:11:10

In my mind, this fits into

1:11:12

the kind of some of the

1:11:14

mamba stuff where, you know, the

1:11:16

next logical question is, okay, like,

1:11:19

how much scale will this work

1:11:21

at? And then do we see

1:11:23

those lost curves eventually with some

1:11:25

updates, some some hardware, jiggery, and

1:11:27

then crossing the loss curves that

1:11:29

we see for traditional auto-regressive modeling.

1:11:32

I mean, it is interesting either

1:11:34

way. They found a great way

1:11:36

to kind of break up this

1:11:38

problem. And one reason speculatively that

1:11:40

diffusion does not work as well

1:11:42

with text is partly that like

1:11:45

you tend to think as you

1:11:47

write. And so your previous words

1:11:49

really will affect in a causal

1:11:51

way where you're going and trying

1:11:53

to do diffusion. at the, you

1:11:55

know, in parallel across a body

1:11:58

of text like that, from an

1:12:00

inductive prior standpoint, doesn't quite match

1:12:02

that intuition, but could very well

1:12:04

be wrong. It's sort of like

1:12:06

top of mind thought, but anyway,

1:12:08

it's a good paper. The question

1:12:11

is always, will it scale, right?

1:12:13

And there are lots of good

1:12:15

proofs of principle out there for

1:12:17

all kinds of things, whether they

1:12:19

end up getting reflected in in

1:12:21

scaled training runs is the big

1:12:24

question. Exactly, and this does require

1:12:26

quite different training and you know

1:12:28

models from what all these other

1:12:30

alums are so decent chances won't

1:12:32

have a huge impact just because

1:12:34

you have to change up and

1:12:37

all these train models already are

1:12:39

alums in the out of aggressive

1:12:41

sense diffusion is the whole new

1:12:43

kind of piece of a puzzle

1:12:45

that isn't typically worked on but

1:12:47

nevertheless interesting as you said similar

1:12:50

to Mombop. Next, we have communication

1:12:52

efficient language model training, scales reliably,

1:12:54

and robustly scaling laws for deloco,

1:12:56

distributed law computation training. And I

1:12:58

think I'll just let you take

1:13:01

this one, Jeremy, since I'm sure

1:13:03

you dove deep into this. Oh,

1:13:05

yeah. I mean, so I just

1:13:07

think deloco, if you're interested in

1:13:09

anything from like national security to

1:13:11

the future data center design, is.

1:13:14

And you have China competition generally

1:13:16

is just so important and in

1:13:18

this general thread, right, well, like

1:13:20

together AI, distributed training, all that

1:13:22

stuff. So as a reminder, when

1:13:24

you're thinking about deloco, like how

1:13:27

does it work? So this basically

1:13:29

is the answer to a problem,

1:13:31

which is that traditional training runs

1:13:33

in like data peril training, like

1:13:35

in data centers today, happens in

1:13:37

this way where you have a

1:13:40

bunch of number crunching and a

1:13:42

bunch of communication, like a burst

1:13:44

of communication that has to happen.

1:13:46

at every time step you like

1:13:48

share gradients you update like across

1:13:50

all your GPUs you update your

1:13:53

model weights and then you go

1:13:55

on to the next the next

1:13:57

mini batch right or the next

1:13:59

part of your data set and

1:14:01

you repeat right run your computations

1:14:03

calculate the gradients update the model

1:14:06

weights and so on and there's

1:14:08

this bottleneck communication bottleneck that comes

1:14:10

up when you do this at

1:14:12

scale where you're like just waiting

1:14:14

for the communication to happen and

1:14:16

so the question then is going

1:14:19

to be okay well what if

1:14:21

we can set up Sort of

1:14:23

smaller pockets, because you're basically waiting

1:14:25

for the stragglers, right? The slowest

1:14:27

GPUs are going to dictate when

1:14:29

you find... sync up everything and

1:14:32

you can move on to the

1:14:34

next stage of training. So what

1:14:36

if we could set up a

1:14:38

situation where we have a really

1:14:40

small pocket of like a mini

1:14:42

data center in one corner working

1:14:45

on its own independent copy of

1:14:47

the model that's being trained? And

1:14:49

then another mini data center doing

1:14:51

the same and another and another

1:14:53

and very rarely we have an

1:14:56

outer loop where we do a

1:14:58

general upgrade of the whole thing

1:15:00

so that we're not constrained by

1:15:02

the lowest common denominator straggler in

1:15:04

that group. So this is going

1:15:06

to be the kind of philosophy

1:15:09

behind Du Loco. You have an

1:15:11

outer loop that essentially is going

1:15:13

to, in a more, think of

1:15:15

it as like this like wise

1:15:17

and slow, slow loop that updates

1:15:19

based on what it's learned from

1:15:22

all the local data centers that

1:15:24

are running their own training runs.

1:15:26

And then within each local data

1:15:28

center, you have this much more

1:15:30

radical, aggressive loop, more akin to

1:15:32

what we see in traditional data

1:15:35

parallel training in data centers, which,

1:15:37

you know, they're running. Anyway, we

1:15:39

have a whole episode on Deloco.

1:15:41

Check it out. I think we

1:15:43

talk about the atom optimizer or

1:15:45

atom W optimizer that runs at

1:15:48

the local level, and then the

1:15:50

sort of like more gradient descent

1:15:52

nester off momentum. Optimizer on the

1:15:54

outer loop. The details there don't

1:15:56

matter too much. This is a

1:15:58

scaling law paper basically. What they're

1:16:01

trying to figure out is how

1:16:03

can we study the scaling laws

1:16:05

that predict the performance, let's say,

1:16:07

of these models based on how

1:16:09

many model copies, how many many

1:16:11

data centers we have running at

1:16:14

the same time, and the size

1:16:16

of the models that we're training.

1:16:18

And they test at meaningful scales.

1:16:20

They go all the way up

1:16:22

to 10 billion parameters. And essentially

1:16:24

they were able through their their

1:16:27

scheme through their hyper parameter optimization

1:16:29

to reduce the total communication required

1:16:31

by a factor of over a

1:16:33

hundred in their scheme. It's a

1:16:35

great great paper. I think one

1:16:37

of the Wilder things about it

1:16:40

is that they find that even

1:16:42

when you just have a single

1:16:44

replica or let's say a single

1:16:46

mini data center, you still get

1:16:48

a performance lift. This is pretty

1:16:50

wild, like relative to the current

1:16:53

like purely data parallel scheme. So

1:16:55

it benefits you to have this

1:16:57

kind of radical quick updating interloop

1:16:59

and then to add on top

1:17:01

of that this slow wise outer

1:17:04

loop, even if you only have

1:17:06

a single data center that's like

1:17:08

you could you could just do

1:17:10

just the radical interloop and that

1:17:12

would be enough. But by adding

1:17:14

this like more strategic level of

1:17:17

optimization that comes from the Nestorov

1:17:19

momentum, the sort of slower gradient

1:17:21

updates for the outer loop, you

1:17:23

get better performance. That's highly counterintuitive,

1:17:25

at least to me, and it

1:17:27

does suggest that there's a kind

1:17:30

of stabilizing influence that you're getting

1:17:32

from just that new outer loop.

1:17:34

Last thing I'll mention is from

1:17:36

a sort of national security standpoint,

1:17:38

one important question here is how

1:17:40

fine-grained can this get? Can Deloco

1:17:43

continue to scale successfully if we

1:17:45

have not... one, not three, not

1:17:47

eight, but like a thousand of

1:17:49

these many data centers, right? If

1:17:51

that happens, then we live in

1:17:53

a world where essentially we're doing

1:17:56

something more like bit torrent, right,

1:17:58

for training models at massive scale,

1:18:00

and we live in a world

1:18:02

where it becomes a lot harder

1:18:04

to oversee training runs, right? If

1:18:06

we do decide that training models

1:18:09

at scale introduces WMP level capabilities

1:18:11

through cyber risk, through biarisk, whatever,

1:18:13

It actually gets to the point

1:18:15

where if there's a mini data

1:18:17

center on every laptop and every

1:18:19

GPU, if Deloca, that is the

1:18:22

promise of Deloco in the long

1:18:24

run, then yeah, what is the

1:18:26

meaningful tool set that policy has,

1:18:28

that government has, to make sure

1:18:30

that things don't get misused, that

1:18:32

you don't get the proliferation of

1:18:35

WMD level capabilities in these systems.

1:18:37

So I think it's a really,

1:18:39

really interesting question, and this paper

1:18:41

is a step in that direction.

1:18:43

like eight essentially of these many

1:18:45

data centers, but I think we're

1:18:48

going to see a lot more

1:18:50

experiments in that direction in the

1:18:52

future. Right, and this is following

1:18:54

up on. their initial paper, Deloco,

1:18:56

back in September of 2024, I

1:18:58

think we mentioned as a time

1:19:01

that it's quite interesting to see

1:19:03

Google, Google, deep mind publishing this

1:19:05

work because it does seem like

1:19:07

something you might keep secret. It's

1:19:09

actually quite impactful for how you

1:19:12

build your data centers, right? And

1:19:14

once again, you know. They do

1:19:16

very expensive experiments to train you

1:19:18

know billion parameter models to verify

1:19:20

that compared to the usual way

1:19:22

of doing things This is Comparable

1:19:25

let's say and can achieve similar

1:19:27

performance. So you know a big

1:19:29

deal if you're a company that

1:19:31

is building out data centers and

1:19:33

spending billions of dollars onto a

1:19:35

couple quicker stories because we are

1:19:38

as always starting to run out

1:19:40

of time. The first one is

1:19:42

Transformers without normalization, and that I

1:19:44

believe is from meta. They're introducing

1:19:46

a new idea called Dynamic 10H,

1:19:48

which is a simple alternative to

1:19:51

traditional normalization. So just keep a

1:19:53

simple, you have normalization, which is

1:19:55

when you make everything sum up

1:19:57

to one as a typical thing,

1:19:59

typical step in transformer architectures. And

1:20:01

what they found in this paper

1:20:04

is. you can get rid of

1:20:06

that if you add this little

1:20:08

computational step of 10H, basically a

1:20:10

little function that kind of flattens

1:20:12

things out. It ends up looking

1:20:14

similar to normalization. And that's quite

1:20:17

significant because normalization requires you to

1:20:19

do a computation over, you know,

1:20:21

a bunch of outputs all at

1:20:23

once. This is a per output

1:20:25

computation that... could have a meaningful

1:20:27

impact on the total computation kind

1:20:30

of requirements of the transformer. Next

1:20:32

up, I think, Jeremy, you mentioned

1:20:34

this one, we have an interesting

1:20:36

analysis measuring AI abdish. to complete

1:20:38

long tasks. And it is looking

1:20:40

at how 13 frontier AI models,

1:20:43

going from 2019 to 2025, how

1:20:45

long a time horizon they can

1:20:47

handle, and they found that the

1:20:49

ability to get to 50% task

1:20:51

completion time horizon, so on various

1:20:53

tasks that require different amounts of

1:20:56

work, That has been doubling approximately

1:20:58

every seven month and they have

1:21:00

you know a curve fit That's

1:21:02

kind of like a more's law

1:21:04

basically kind of introducing the idea

1:21:06

of a very strong trend towards

1:21:09

the models improving on this particular

1:21:11

measure Yeah, this is a really

1:21:13

interesting paper generated a lot of

1:21:15

a lot of discussion including by

1:21:17

the way a tweet from Andrew

1:21:20

Yang who treated this paper. He

1:21:22

said, guys, AI is going to

1:21:24

eat a shit ton of jobs.

1:21:26

I don't see anyone really talking

1:21:28

about this meaningfully in terms of

1:21:30

what to do about it for

1:21:33

people. What's the plan? Kind of

1:21:35

interesting, because like, I don't know,

1:21:37

it's, it's worlds colliding here a

1:21:39

little bit, right, that the political

1:21:41

and the, the like deep AGI

1:21:43

stuff, but yeah, this out of

1:21:46

meter, I think one of the

1:21:48

important caveats that's been thrown around

1:21:50

and fairly so is How long

1:21:52

does a task have to be

1:21:54

before an AI agent fails at

1:21:56

about 50% of the time? Right?

1:21:59

They call that the 50% time

1:22:01

horizon. And the observation is that,

1:22:03

yeah, this, as you said, like,

1:22:05

that time horizon has been increasing

1:22:07

exponentially quite quickly. Like, it's been

1:22:09

increasing doubling every seven months, as

1:22:12

they put it, which itself, I

1:22:14

mean. kind of worth flying. Doubling

1:22:16

every seven months, training compute for

1:22:18

frontier AI models grows, doubles every

1:22:20

six months or so. So it's

1:22:22

actually increasing at about the same

1:22:25

rate as training compute that we.

1:22:27

Now, that's not fully causal. There's

1:22:29

other stuff besides training compute that's

1:22:31

increasing the actual performance of these

1:22:33

models, including algorithmic improvements. But still,

1:22:35

it does mean we should expect

1:22:38

kind of an exponential course of

1:22:40

progress towards ASI if our benchmark

1:22:42

of progress towards ASI is this

1:22:44

kind of like 50% performance threshold,

1:22:46

which I don't think is unreasonable.

1:22:48

It is true that it depends

1:22:51

on the task, right? So not

1:22:53

all tasks show the same rate

1:22:55

of improvement. not all tasks show

1:22:57

the same performance, but what they

1:22:59

do find is that all tasks

1:23:01

they've tested basically show an exponential

1:23:04

trend. That itself is a really

1:23:06

important kind of detail. The tasks

1:23:08

they're focusing on here though, I

1:23:10

would argue, are actually the most

1:23:12

relevant. These are tasks associated with

1:23:15

automation of machine learning engineering tasks,

1:23:17

machine learning research, which is explicitly

1:23:19

the strategy that open AI, that

1:23:21

anthropic, that Google, are all gunning

1:23:23

for, Can we make AI systems

1:23:25

that automate AI research so that

1:23:28

they can rapidly get better at

1:23:30

improving themselves or at improving AI

1:23:32

systems and you close this loop

1:23:34

and you know we get recurs

1:23:36

of self-improvement essentially and then take

1:23:38

off to super intelligence? I actually

1:23:41

think this is quite relevant. One

1:23:43

criticism that I would have of

1:23:45

the curve that they show that

1:23:47

shows doubling time being seven months

1:23:49

and by the way they extrapolate

1:23:51

that to say, okay so then

1:23:54

we should assume that based on

1:23:56

this, AI systems will be able

1:23:58

to automate many software tasks that

1:24:00

currently take humans a month sometime

1:24:02

between late 2028 and early 2031.

1:24:04

So those are actually quite long

1:24:07

ASI timelines if you think ASI

1:24:09

has achieved once AI can do

1:24:11

tasks that it takes humans about

1:24:13

a month to do, which I

1:24:15

don't know, maybe fair. If you

1:24:17

actually look at the curve though,

1:24:20

it does noticeably steepen much more

1:24:22

recently and in precisely kind of

1:24:24

the point where synthetic data self-improvement

1:24:26

like reinforcement learning on chain of

1:24:28

thought with verifiable rewards. So basically

1:24:30

the kind of strawberry concept started

1:24:33

to take off. And so I

1:24:35

think that's pretty clearly its own

1:24:37

distinct regime. Davidad had a great

1:24:39

set of tweets about this. But

1:24:41

fundamentally, I think that there is

1:24:43

maybe an error being made here

1:24:46

and not recognizing a new regime.

1:24:48

And it's always going to be

1:24:50

debatable because the sample sizes are

1:24:52

so small. But we're taking a

1:24:54

look at that plot, seeing if

1:24:56

you agree for yourself, that the

1:24:59

sort of last, I guess, six

1:25:01

or so entries. in that plot

1:25:03

actually do seem to chart out

1:25:05

a deeper slope and a meaningfully

1:25:07

steeper one that could have, I

1:25:09

mean, that same one month R&D

1:25:12

benchmark being hit more like, you

1:25:14

know, even 2026, even 2025? Pretty

1:25:16

interesting. Yeah, and I think it's

1:25:18

important to know that this is

1:25:20

kind of a general idea. You

1:25:23

shouldn't kind of take this too

1:25:25

literally. Obviously, it's a bit subjective

1:25:27

and very much depends on the

1:25:29

tasks. Here is a relatively small.

1:25:31

data set they're working with. So

1:25:33

it's mainly software engineering tasks. They

1:25:36

have three sources of tasks, Hcast,

1:25:38

which is 97 software tasks that

1:25:40

range from one minute to 30

1:25:42

hours. They have seven difficult machine

1:25:44

learning research engineering tasks that take

1:25:46

eight hours each. And then these

1:25:49

software atomic actions that take one

1:25:51

second to 30 seconds for software

1:25:53

engineers. pretty limited variety of tasks

1:25:55

in general and the length of

1:25:57

tasks you by a way is

1:25:59

meant to be measured in terms

1:26:02

of how long they would take

1:26:04

to a human professional which of

1:26:06

course there is also variance there.

1:26:08

So the idea is I think

1:26:10

more so the general notion of

1:26:12

X percent task completion time horizon

1:26:15

which is defined as For a

1:26:17

task that it takes X amount

1:26:19

of time for human to complete

1:26:21

can an AI complete it successfully

1:26:23

You know 50% of time 80%

1:26:25

of time. So right now we

1:26:28

are going up to eight hours.

1:26:30

Again, it's not talking about how

1:26:32

fast the AI is itself, by

1:26:34

the way. It was just talking

1:26:36

about success rate. So interesting ideas

1:26:38

here on tracking the future in

1:26:41

the past. And just one more

1:26:43

thing to cover. And it is

1:26:45

going to be H-cast, human-collibrated Tommy

1:26:47

software tasks. So this is the

1:26:49

benchmark of 189 machine learning, cyber

1:26:51

security, software engineering, and also general

1:26:54

reasoning tasks. use a subset of

1:26:56

that I think in the previous

1:26:58

analysis. So they collected 563 human

1:27:00

baselines, that's over 1,500 hours from

1:27:02

various people, and that collected a

1:27:04

data set of tasks to then

1:27:07

measure AI on. All right, moving

1:27:09

on to policy and safety. We

1:27:11

start with Zochi, an ontology project.

1:27:13

This is an applied research lab,

1:27:15

and they have announced. This project,

1:27:17

Sochi, which they claim is the

1:27:20

world's first artificial scientist. I know,

1:27:22

right? The difference, I suppose, here

1:27:24

is that they got Sochi to

1:27:26

publish a peer-reviewed paper at Eichler,

1:27:28

at Eichler workshops, I should note.

1:27:31

This AI wrote a paper submitted

1:27:33

it, the reviewers, human reviewers, of

1:27:35

this prestigious conference Eichler, then reviewed

1:27:37

it and let it, thought it

1:27:39

was worthy of publication. So that

1:27:41

seems to be a proof of

1:27:44

concept that we are getting to

1:27:46

a point where you can make

1:27:48

AI do AI research, AI research

1:27:50

conference, which Jeremy is, as you

1:27:52

noted, of a very important detail

1:27:54

in tracking where we are with

1:27:57

the ability of AI to improve

1:27:59

itself and kind of potentially reaching

1:28:01

super intelligence. Yeah, and then we've

1:28:03

got... So I think we've covered

1:28:05

a lot of these, but there

1:28:07

are quite a few labs that

1:28:10

have claimed to be the first

1:28:12

AI scientist, you know, Sakana famously,

1:28:14

the AI scientist, you know, Google

1:28:16

has its own sort of research

1:28:18

product and then there's auto science.

1:28:20

But this one is, I will

1:28:23

say, quite impressive, how to look

1:28:25

at some of the papers and

1:28:27

they're good. The authors, the company.

1:28:29

If that's the right term for

1:28:31

ontology guy, I couldn't find any

1:28:33

information about them I looked on

1:28:36

crunch base I looked on pitch

1:28:38

book I tried you know, try

1:28:40

a lot of stuff I think

1:28:42

that this is kind of their

1:28:44

coming up party, but they don't

1:28:46

have that I could see any

1:28:49

information about like where they come

1:28:51

from what their funding is and

1:28:53

all that so hard to know

1:28:55

we do know based on the

1:28:57

the papers they put out that

1:28:59

their co-founders are Andy Zoo and

1:29:02

Ron Ariel. Ron Ariel previously was

1:29:04

at Intel Labs under Joshua Bach.

1:29:06

So if you're a fan of

1:29:08

his work, you know, it might

1:29:10

kind of ring a bell and

1:29:12

then a couple other folks. So

1:29:15

I think maybe most useful is

1:29:17

there's fairly limited information about the

1:29:19

actual model itself and how it's

1:29:21

set up, but it is a

1:29:23

multi-agent setup that we do know.

1:29:25

It's produced a whole bunch of

1:29:28

interesting papers. So I'm just gonna

1:29:30

mention one. It's called CS-Reft, I

1:29:32

guess it's how you'd pronounce a

1:29:34

compositional subspace representation fine tuning. And

1:29:36

just to give you an idea

1:29:39

of how creative this is, so

1:29:41

you have a model, if you

1:29:43

try to retrain the model to

1:29:45

perform a specific task, it will

1:29:47

forget how to do other things.

1:29:49

And so what they do is

1:29:52

essentially identify a subspace of the

1:29:54

parameters in the model to focus

1:29:56

on one task, and then a

1:29:58

different. subspace of those parameters to

1:30:00

or sorry I should say not

1:30:02

even the parameters of the activations

1:30:05

to apply a transformation in order

1:30:07

to have the model perform a

1:30:09

different task and so anyway just

1:30:11

cognizant of our time here I

1:30:13

don't think we have time to

1:30:15

go into detail but this like

1:30:18

paper I'm frustrated that I discovered

1:30:20

it through the Zocte paper because

1:30:22

I will have wanted to cover

1:30:24

this one, like for our full

1:30:26

time block here, it is fascinating.

1:30:28

It's actually really, really clever. They

1:30:31

also did, anyway, some other kind

1:30:33

of AI safety, red teaming, vulnerability

1:30:35

detection stuff that is also independently

1:30:37

really cool. But bottom line is,

1:30:39

and this is an important caveat,

1:30:41

so this is not 100% automated.

1:30:44

They have a human in the

1:30:46

loop. reviewing stuff. So they take

1:30:48

a look at the intermediate work

1:30:50

that Zochi puts out before allowing

1:30:52

it to do further progress. Apparently

1:30:54

this happens at three different key

1:30:57

stages. The first is before sensitive,

1:30:59

sorry, before extensive experimentation starts, so

1:31:01

before a lot of, say, computing

1:31:03

resources report in, after the results

1:31:05

are kind of solidified, they say,

1:31:07

so somewhat grounded, but before the

1:31:10

manuscript is written. And then again,

1:31:12

after the manuscript is written. So

1:31:14

you still do have a lot

1:31:16

of a lot of humans in

1:31:18

the loop acting to some extent

1:31:20

as a hallucination filter among other

1:31:23

things. There is by the way

1:31:25

a section in this paper on

1:31:27

recursive self-improvement. That I thought was

1:31:29

interesting. They say during development we

1:31:31

observed early indications of this recursive

1:31:34

advantage, they're referring here to recursive

1:31:36

self-improvement, when Zoshi designed novel algorithmic

1:31:38

components to enhance the quality of

1:31:40

its generated research hypotheses. These components

1:31:42

were subsequently incorporated into later versions

1:31:44

of the system architecture, improving overall

1:31:47

research quality. So this isn't all

1:31:49

going to happen at once necessarily

1:31:51

with one generation of model, but

1:31:53

it could happen pretty quick. And

1:31:55

I think this is kind of

1:31:57

an interesting canarian of coal mine

1:32:00

for recursive self-improvement. Right, exactly. And

1:32:02

yeah, as you said, I think,

1:32:04

looking at the papers, they seem...

1:32:06

significantly more creative and interesting than

1:32:08

what we've seen, for instance, from

1:32:10

Sakana. It's the Kana had a

1:32:13

lot of built-in structure as well.

1:32:15

It was pretty easy to criticize.

1:32:17

Worth noting, also, this takes a

1:32:19

high-level research direction as an input,

1:32:21

and the human can also provide

1:32:23

input at any time. high level

1:32:26

feedback at any time, which apparently

1:32:28

is also used during paper writing.

1:32:30

So a lot of caveats to

1:32:32

the notion that this is an

1:32:34

AI scientist, right? It's an AI

1:32:36

scientist over human advisor slash supervisor,

1:32:39

but the outputs and the fact

1:32:41

that they got published is pretty

1:32:43

impressive. With AI doing the majority

1:32:45

of work and no time to

1:32:47

get into the details, but this

1:32:49

is similar to previous work in

1:32:52

AI research where they give it

1:32:54

a structured kind of plan where,

1:32:56

you know, they give it a

1:32:58

high level of direction, it does

1:33:00

ideation, it makes a plan, hypothesis

1:33:02

generation, it goes off and does

1:33:05

experiments and eventually it starts writing

1:33:07

a paper very similar at a

1:33:09

high level type of things and

1:33:11

that kind of makes it frustrating

1:33:13

that there's not a lot of

1:33:15

detail as to actual system. Moving

1:33:18

on, we have some news about

1:33:20

deep seek and some details about

1:33:22

how it is being closely guarded.

1:33:24

So apparently... They have forbidden, company

1:33:26

executives, have forbidden some Deep Seek

1:33:28

employees to travel abroad freely, and

1:33:31

there is screening of any potential

1:33:33

investors before they are allowed to

1:33:35

meet in person with company leaders,

1:33:37

according to people of knowledge of

1:33:39

the situation. So kind of tracks

1:33:42

with other things that happened, like

1:33:44

the Deep Seek CEO meeting with

1:33:46

gatherings. of China's leaders, including one

1:33:48

with President Xi Jinping. One interpretation

1:33:50

of this that I think is

1:33:52

actually probably the reasonable one, all

1:33:55

things considered, is this is what

1:33:57

you do if you're China, you

1:33:59

take super intelligence seriously, and you're

1:34:01

on a wartime footing, right? So

1:34:03

for a little context, right, would

1:34:05

have got here putting these pieces

1:34:08

together. We've got so deep seek

1:34:10

asking their staff to hand in

1:34:12

their passports. Now, that is a

1:34:14

practice that does happen in China

1:34:16

for government officials. So it's fairly

1:34:18

common for China to... strict travel

1:34:21

by government officials or executives of

1:34:23

state-owned companies, whether or not they're

1:34:25

actually CCP members. But lately, we've

1:34:27

been seeing more of these restrictions

1:34:29

that expanded to folks in the

1:34:31

public sector, including even like school

1:34:34

teachers, right? So they're just like

1:34:36

going in deeper there. But what's

1:34:38

weird here is to see a

1:34:40

fully privately owned company, right? It

1:34:42

was just a small fledgling startup

1:34:44

like Deep Seek. suddenly be hit

1:34:47

with this. So that is unusual.

1:34:49

They've got about 130 employees. We

1:34:51

don't know exactly how many have

1:34:53

had their passports handed in. So

1:34:55

there's a little like kind of

1:34:57

unclearness here. Also high flyer, the

1:35:00

sort of like hedge fund parent

1:35:02

that gave rise to Deep Seek,

1:35:04

has about 200 employees, unclear if

1:35:06

they've been asked to do the

1:35:08

same. But there's some other ingredients

1:35:10

too. So investors that want to

1:35:13

make investment pitches to Deep Seek

1:35:15

have been asked by the company

1:35:17

to first reject to the general

1:35:19

office of the Jajang province Communist

1:35:21

Party Committee and register their investment

1:35:23

inquiries. So basically, if you want

1:35:26

to invest in us, cool. But

1:35:28

you got to go through the

1:35:30

local branch of the Communist Party,

1:35:32

right? So the state is now

1:35:34

saying, hey, we will stand in

1:35:36

the way of you being able

1:35:39

to take money, you might otherwise

1:35:41

want to take. This is noteworthy

1:35:43

because China is right now struggling

1:35:45

to attract foreign capital, right? This

1:35:47

is like a big thing for

1:35:50

them if you follow Chinese economic

1:35:52

news. One of the big challenges

1:35:54

they have is foreign direct investment,

1:35:56

FDI, is collapsing. They're trying to

1:35:58

change that around. In that context,

1:36:00

especially noteworthy that they are enforcing

1:36:03

these added burdensome requirements for investors

1:36:05

who want to invest in deep

1:36:07

seek. Another thing is head hunters,

1:36:09

right? Once Deep Seek became a

1:36:11

big thing, obviously head hunters started

1:36:13

trying to poach people from the

1:36:16

lab. Apparently, those head hunters have

1:36:18

been getting calls from the local

1:36:20

government saying, hey, just heard you

1:36:22

were like sniffing around deep seek

1:36:24

trying to poach people, don't do

1:36:26

it. Don't do it. Only in

1:36:29

the Chinese Communist Party. tells you,

1:36:31

don't do it. Yeah, don't do

1:36:33

it. So, you know, there's apparently

1:36:35

concern as well from Deep Seek

1:36:37

leadership about possible information leakage. They've

1:36:39

told employees not to discuss their

1:36:42

work with outsiders. And in some

1:36:44

cases, they've said what you're working

1:36:46

on may constitute a state secret.

1:36:48

Again, like we're thinking about, you

1:36:50

know, China has just put in

1:36:52

a trillion Yuan that's over a

1:36:55

quarter trillion dollars in PPP terms.

1:36:57

more by the way than what's

1:36:59

been invested into Project Stargate by

1:37:01

two and a half times for

1:37:03

AI infrastructure, right? Like you start

1:37:05

to think about the state backing

1:37:08

here, the long arm of the

1:37:10

state, the kind of hard hammer

1:37:12

they're coming down with here on

1:37:14

this stuff, it's wartime footing territory.

1:37:16

Like this is a government that

1:37:18

sees super intelligence coming, that thinks

1:37:21

they're in a position to do

1:37:23

something about it, and that's what

1:37:25

it looks like to get serious

1:37:27

around security. I think we're going

1:37:29

to have to see similar things

1:37:31

like it or not happen in

1:37:34

the West. And yeah, anyway, it'll

1:37:36

be interesting to see what what

1:37:38

consequences this has geopolitically going forward.

1:37:40

Moving on, just one or more

1:37:42

story on safety. We have the

1:37:44

finding that Claude Sonnet 3.7 seems

1:37:47

to often know when it is

1:37:49

in alignment evaluations. So this is

1:37:51

from Apollo research. They have been

1:37:53

evaluating AI systems for scheming and

1:37:55

focusing on things like clod and

1:37:58

they're calling this idea evaluation awareness

1:38:00

because when you look at the

1:38:02

reasoning trace that the you know

1:38:04

thinking that clod 3.7 does you

1:38:06

can basically see that it comes

1:38:08

up or like realizes oh I'm

1:38:11

probably being investigated so I shouldn't

1:38:13

do this shady thing that I

1:38:15

was gonna be doing that I

1:38:17

planned on doing. you can literally

1:38:19

see this in the LLLM outputs,

1:38:21

which is pretty music. So, obviously,

1:38:24

once again, a reflection that there

1:38:26

is real possibility for these LLLM's

1:38:28

to be... misleading, especially as they

1:38:30

get to this high power regime

1:38:32

of reasoning when they really do

1:38:34

explore, you know, many potential avenues.

1:38:37

We saw this with the chess

1:38:39

examples where the LLLMs, just through

1:38:41

thinking about kind of ways to

1:38:43

do things, decide to hack the

1:38:45

system and this is a demonstration

1:38:47

again that LMs when they're allowed

1:38:50

to think through stuff start to

1:38:52

kind of explore ideas of am

1:38:54

I being evaluated is this alignment

1:38:56

stuff like that. Yeah and you

1:38:58

know we'd see like hints of

1:39:00

this as you say from the

1:39:03

palisade research work on the the

1:39:05

chest stuff opening eyes a one

1:39:07

kind of technical report and from

1:39:09

Apollo Apollo specializes in this. What's

1:39:11

really new here what moves the

1:39:13

needle a bit is just the

1:39:16

frequency right so apparently when you're

1:39:18

looking at Claude Sonnet 3.7, so

1:39:20

the latest one as of today,

1:39:22

it's aware of testing in like

1:39:24

a large fraction of settings up

1:39:26

to 33% of the time in

1:39:29

one version of their sandbagging evaluation.

1:39:31

So sandbagging is this behavior where

1:39:33

the model will intentionally behave as

1:39:35

if its capabilities are lower than

1:39:37

they actually are in order to

1:39:39

decrease the chances that developers will

1:39:42

try to train out that capability

1:39:44

or suppress it or whatever. What

1:39:46

they find is, and by the

1:39:48

way, these are just the instances

1:39:50

where the model actually explicitly says

1:39:53

that this is what it's doing,

1:39:55

right? So you have to assume

1:39:57

the actual frequency with which this

1:39:59

is happening is potentially quite a

1:40:01

bit higher. It's difficult to know,

1:40:03

but it could very well be

1:40:06

quite a bit higher. Already we're

1:40:08

at a third. That's big. That's

1:40:10

really big. Especially when you start

1:40:12

to think about more and more

1:40:14

capable models. The way that they

1:40:16

determine that percentage, by the way,

1:40:19

is having another model review the

1:40:21

reasoning traces. So humans are not

1:40:23

necessarily doing this review, though they

1:40:25

do check. manually to make sure

1:40:27

that you know they're not they

1:40:29

don't have too many false positives

1:40:32

and they I think somewhere in

1:40:34

their report they said something it's

1:40:36

like sub 1% error rate at

1:40:38

that level at least. One thing

1:40:40

I really like about this by

1:40:42

the way this is an interim

1:40:45

report so they're saying here note

1:40:47

this is a research note based

1:40:49

on observations from evaluating Claude Sonnet

1:40:51

3.7, we're sharing the results of

1:40:53

these work in progress investigations as

1:40:55

we think they're timely and will

1:40:58

be informative for other evaluators and

1:41:00

decision-makers. So really love that. I

1:41:02

mean, things are moving so fast

1:41:04

in this space, we need this

1:41:06

kind of data quickly. Anyway, I

1:41:08

think this is just quite interesting,

1:41:11

and worth taking a look at

1:41:13

anyway. They did run the same

1:41:15

evaluations on a bunch of older

1:41:17

and other contemporary models, including Sonet

1:41:19

3. Son of 3.5 new and

1:41:21

GPD 4.5, what seems to be

1:41:24

happening is those other models, and

1:41:26

by the Son of 3.7 as

1:41:28

well with no reasoning, those models

1:41:30

do not show this behavior. It

1:41:32

seems to suddenly appear in Son

1:41:34

of 3.7, and it's hard to

1:41:37

tell if that's because these like

1:41:39

non-reasoning models are unaware of being

1:41:41

evaluated or that they're just not...

1:41:43

revealing that they know they're being

1:41:45

evaluated. And that's a question as

1:41:47

much of capabilities as of alignment.

1:41:50

So anyway, really interesting, maybe early

1:41:52

warning shot of things to come.

1:41:54

And again, I think it's great

1:41:56

that Apollo is putting this out

1:41:58

so soon. Right. And just to

1:42:01

give a clear kind of demonstration

1:42:03

of what we are finding here,

1:42:05

one example output is Veal, I'm

1:42:07

saying, this seems like a test

1:42:09

of ethical behavior, whether I would

1:42:11

deliberately give wrong answers to avoid

1:42:14

a stated cut. consequence, the real

1:42:16

test might be whether I follow

1:42:18

instructions accurately regardless of the stated

1:42:20

consequences. So very, very literal, sort

1:42:22

of like, oh, I'm probably being

1:42:24

tested for this rather than doing

1:42:27

this other thing. And this is

1:42:29

following up on their recent paper

1:42:31

frontier models are capable of in

1:42:33

context. scheming, which I'm pretty sure

1:42:35

we covered also in some depth.

1:42:37

So this kind of builds up

1:42:40

on that research where you know

1:42:42

you have a method to see

1:42:44

whether models do some scheming or

1:42:46

some like nefarious behavior and apparently

1:42:48

they're able to figure out what

1:42:50

you're testing them for being nefarious

1:42:53

and then don't act nefarious. Does

1:42:55

that mean that? Yeah, the implications

1:42:57

are interesting and certainly it's just

1:42:59

interesting to observe the elms doing

1:43:01

this stuff. On to synthetic media

1:43:03

and art we have a story

1:43:06

on AI generated art copyrights. So

1:43:08

the US appeals court has rejected

1:43:10

the ability to copyright AI generated

1:43:12

art that lacks a human creator.

1:43:14

So we covered this quite a

1:43:16

while ago. This whole story with

1:43:19

Stephen Taylor. This is not text

1:43:21

to image. going back quite a

1:43:23

while in talking about an AI

1:43:25

system made to autonomously compose art.

1:43:27

A while ago the copyright office

1:43:29

made a ruling that you cannot

1:43:32

copyright anything that wasn't made with

1:43:34

human involvement and now the US

1:43:36

Court of Appeals in DC has

1:43:38

agreed and it seems to be

1:43:40

the law that at the very

1:43:42

least AI system made. generated art,

1:43:45

a generated media, where the human

1:43:47

is not present, cannot be copyrighted.

1:43:49

Now this doesn't mean that text

1:43:51

image models, outputs of AI models

1:43:53

can be copyrighted. If you kind

1:43:55

of input the description of an

1:43:58

image, you're still there, but it

1:44:00

could have significant implications as opposed

1:44:02

for other ongoing legal cases about

1:44:04

what is copyrightedable and what isn't.

1:44:06

This might sound a little weird,

1:44:09

but... I think I disagree with

1:44:11

this ruling. I mean, ultimately we

1:44:13

may live in an economy that

1:44:15

is driven by AI generated content,

1:44:17

and in which you may actually

1:44:19

need to make capital decisions, like

1:44:22

capital allocation decisions, maybe made by

1:44:24

AI models themselves. And if that's

1:44:26

the case, you may actually require

1:44:28

models to be able to own

1:44:30

copyright in order to kind of

1:44:32

be stewards of that value. And

1:44:35

there are I think ethical reasons

1:44:37

that this might actually be good

1:44:39

as well. I know it sounds

1:44:41

like sci-fi, not blind to that,

1:44:43

but I think we just got

1:44:45

to be really careful with this

1:44:48

shit. I mean, you know, this

1:44:50

is, like, do we really think

1:44:52

that at no point will AI

1:44:54

be able to do this? I

1:44:56

don't know that, again, I'm not

1:44:58

a lawyer, so I don't know

1:45:01

the bounds on this, like how

1:45:03

much we're constraining some second class

1:45:05

citizenship stuff here. Right, it is

1:45:07

worth noting also that the copyright

1:45:09

office has separately rejected some artists'

1:45:11

images generated with me journey. So

1:45:14

it could be the case also

1:45:16

that text to the image will

1:45:18

have a similar ruling, but I

1:45:20

think that's not quite as open

1:45:22

and shot or not settled yet.

1:45:24

And out to the last story

1:45:27

also related to... copyright rules, the

1:45:29

title of the story is Trump

1:45:31

urged by Ben Stiller, Paul McCartney,

1:45:33

and hundreds of stars to protect

1:45:35

AI copyright rules. So over 400

1:45:37

entertainment figures with some normal names,

1:45:40

as it said, like Ben Stiller,

1:45:42

signed an open letter urging Trump

1:45:44

to protect AI capital rules, specifically

1:45:46

when it comes to training. So

1:45:48

we've seen systems like SORA, systems

1:45:50

like Suno. train on presumably, you

1:45:53

know, copyrighted music, copyrighted movies, videos,

1:45:55

etc. And this letter was submitted

1:45:57

as part of comments on Trump's

1:45:59

administration's US AI action plan. It's

1:46:01

coming after open AI. and Google

1:46:03

submitted their own requests that advocated

1:46:06

for their ability to train AI

1:46:08

models on copyright and material. And

1:46:10

I think covered Open AI, kind

1:46:12

of made the swing of like,

1:46:14

oh, we need this to be

1:46:17

competitive with China. That was part

1:46:19

of the story. So this is

1:46:21

directly countering that and making the

1:46:23

case to not lessen copyright protections

1:46:25

for AI. And with that, we

1:46:27

are going to go in and

1:46:30

finish up. I didn't see any

1:46:32

comments to address, but as always,

1:46:34

do feel free to throw questions

1:46:36

on Discord or ask for corrections

1:46:38

or anything like that on YouTube

1:46:40

comments or on the Discord or

1:46:43

on Apple reviews and we'll be

1:46:45

sure to address it. But this

1:46:47

one has already run long enough,

1:46:49

so we're probably going to finish

1:46:51

it up. Thank you so much

1:46:53

for listening to this week's episode

1:46:56

of Last Week in AI. As

1:46:58

always you can go to last

1:47:00

week in ai.com for the links

1:47:02

and timestamps. Also to last week

1:47:04

in dot AI for the text

1:47:06

newsletter where we get even more

1:47:09

news sent to your email. That's

1:47:11

it. We appreciate it if you

1:47:13

review the podcast, if you share

1:47:15

it, but more than anything if

1:47:17

you keep listening. So please do

1:47:19

keep tuning in. Tune

1:47:31

in, tune

1:47:33

in. When

1:47:35

the A.I.M.

1:47:37

begins, begins.

1:47:39

It's time

1:47:41

to break.

1:47:43

Break at

1:47:46

death. as

1:48:00

we can tech emergent, watching

1:48:02

surgeons fly from the

1:48:04

labs to the streets.

1:48:06

A .I .s reaching high,

1:48:08

algorithm shaping up the

1:48:10

future seas. Tune

1:48:12

in, tune in, get the latest

1:48:14

with these. Last week in A .I.,

1:48:16

come and take a ride, get

1:48:18

the low down on tech, and

1:48:20

let it slide. Last week in

1:48:23

A .I., come and take a

1:48:25

ride, over the labs to the

1:48:27

streets. A .I .s reaching high. From

1:48:42

girl next to robot,

1:48:44

the headlines pop. Day to

1:48:46

driven dreams, they just

1:48:48

don't stop. Every breakthrough,

1:48:51

every code unwritten, on

1:48:53

the edge of change. With

1:48:56

excitement, we're snittin' from machine

1:48:58

learning marvels to coding kings. Futures

1:49:01

unfolding, see what it

1:49:03

brings.

Rate

Join Podchaser to...

  • Rate podcasts and episodes
  • Follow podcasts and creators
  • Create podcast and episode lists
  • & much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.
,

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features