Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Hello and welcome to
0:02
the last week in
0:04
a podcast where you
0:06
can hear a chat
0:08
about what's going on
0:10
with AI. As usual
0:12
in this episode we
0:14
will be summarizing and
0:16
discussing some of last
0:18
week's most interesting AI
0:20
news. You can go
0:22
to the episode description
0:24
for all the links
0:26
and timestamps and also
0:28
to last week in
0:30
ai.com on your laptop to be
0:33
able to read those articles yourself
0:35
as well. As always, I'm one of
0:37
your hosts Andre Karenkov. I studied
0:39
AI in grad school and I
0:41
now work at the Genitive AI
0:44
startup Astrocade. And I'm your other host,
0:46
Jeremy Harris. I'm with Pladstone A.I. National
0:48
Security Company, which you know about if
0:50
you listen to the podcast, you also
0:53
know about Astricade now, a bunch of
0:55
you listen to, you know about all of this,
0:57
you know, what you don't know, though, is that this
0:59
morning, at the early hour, I think it
1:01
was like three or something in the morning,
1:03
I discovered that I have bats in
1:05
my house, which is fun, which is really
1:07
fun, especially when you have like a six-month
1:09
old, you have bats. and then you start
1:12
googling things. So anyway, you had
1:14
pest control come in. That's why, wow,
1:16
my hair looks like Cosmo Kramer
1:18
right now. I've just been running my
1:21
fingers through it for quite a bit.
1:23
So anyway, we got everything on for
1:25
showtime though, because the show will go
1:27
on. Yeah, but if you forget any
1:29
details wrong, you know, it's, it's the
1:31
shock residual shock, the bats, you know, maybe
1:34
that. It's just, I'll be on the lookout.
1:36
Well, let's do a quick preview
1:38
of what we'll be talking about
1:40
in this episode. It's going to
1:42
be a bit of a relaxed
1:45
one. There's nothing to sort of
1:47
world shattering, but a variety of
1:49
pretty interesting stories, tools and apps.
1:51
We have some new impressive
1:53
models out of China, some new
1:55
stuff from Open AI as well,
1:58
Google and FARPIC, everyone launched. stuff,
2:00
applications and business, as we often
2:02
do, going to be talking a
2:05
lot about hardware and GPUs, a
2:07
little bit about fundraising as well,
2:09
projects and open source, we'll be
2:11
talking about the model context protocol,
2:13
which has been all the rage
2:16
in the community recently, and a
2:18
couple new models. As usual, researching
2:20
investments, we got to talk about
2:22
reasoning techniques, inference time scaling techniques.
2:24
but also some new kind of
2:27
developments in the space of how
2:29
you implement your models. Policy and
2:31
safety, we have some more analysis
2:33
of what's going out of China,
2:35
US national security, things like that.
2:38
And finally, we will actually talk
2:40
a little bit about the world
2:42
of art and entertainment with some
2:44
news about copyright. So let's just
2:47
get straight into it. in tools
2:49
and apps. The first story is
2:51
about Baidu launching two new versions
2:53
of the Ernie model, Ernie 4.5
2:55
and Ernie X1. So Ernie initially
2:58
released two years ago and now
3:00
we have Ernie 4.5, presumably, I
3:02
don't know, it sounds like kind
3:04
of two coincided for, to PD
3:06
4. And then Ernie X1 is
3:09
the reasoning variant of Ernie that
3:11
Baidu says is on par with
3:13
Deep Seek R1, but at half
3:15
the price, and both of these
3:17
models are multimodal. They can process
3:20
videos, images, and audio as well.
3:22
They also say Ernie 4.5 is
3:24
kind of emotionally intelligent. They can
3:26
understand memes and satire, which is
3:28
interesting. So... I think we don't
3:31
have like a great sense of
3:33
the tool landscape in China is
3:35
my impression. I really wish I
3:37
would know like if you are
3:39
a user of a chat bot.
3:42
we got to charge BT or
3:44
Claude to give up queries, I
3:46
think it seems likely that Ernie
3:48
is sort of filling that role
3:50
and the fact that there's new
3:53
models and the fact that they're
3:55
really competitive price-wise is a big
3:57
deal. The like number one downloaded
3:59
app in China just switched to
4:02
a new AI chatbot that is
4:04
not Deep Seek. So things are
4:06
definitely moving. The big advantage here
4:08
with this launch seems to be
4:10
cost. At least that's what they're
4:13
leaning into with a lot of
4:15
the discussion around this. So the
4:17
goal that Baidu has, which Baidu
4:19
of course is China's, roughly China's
4:21
Google, right? They own search there.
4:24
Their goal is to progressively integrate
4:26
Ernie 4.5 and their X1 reasoning
4:28
model into all their product ecosystem,
4:30
including Baidu search, which is sort
4:32
of interesting. So we'll see a
4:35
rollout of the generative AI. capabilities
4:37
in that context. Yeah, so ultimately
4:39
it does come down to price
4:41
a lot of it. So for
4:43
context, there's really handy table in
4:46
one of the articles that looked
4:48
at this comparing GPT 4.5 per
4:50
token cost to Deep Seek v.
4:52
3 to Bernie, to Ernie 4.5,
4:54
it's quite interesting, right? So input
4:57
costs for input tokens, 75 bucks
4:59
for a million tokens. This is
5:01
for GPT 4.4. Deep Seek v3
5:03
that drops to basically 30 cents
5:06
earning 4.5 is about 60 cents
5:08
or so per 1 million tokens
5:10
So you know you're talking orders
5:12
of magnitude less Also the case
5:14
that these models are less performant
5:17
so that's sort of the tradeoff
5:19
there, but where things yeah, I
5:21
think just to give a bit
5:23
of a perspective Deep seek v
5:25
free is more comparable to something
5:28
like GPT for O in open
5:30
eyes, say models or free meaty
5:32
for instance where aggression isn't that
5:34
crazy it's maybe I forget one
5:36
dollar ish per million tokens so
5:39
they're comparable. GPD 4.5 is just
5:41
crazy crazy pricing compared to everything
5:43
else and that's a It's the
5:45
way to think about 4.5, I
5:47
think we touched on this a
5:50
couple episodes ago, but it's a
5:52
base model, but it's not a
5:54
base model for, let's say, mass
5:56
production, right? These are high, high
5:58
quality tokens, probably best used to
6:01
create things like synthetic data sets
6:03
or to answer very specific kinds
6:05
of questions, but you're not looking
6:07
at this as something that you
6:10
want to productize, just because you're
6:12
right. I mean, it's two orders
6:14
of magnitude more expensive than other
6:16
base models. Where you actually see
6:18
the lift here. especially for Ernie
6:21
X1, right, with this is the
6:23
reasoning model, is on the reasoning
6:25
side, right? So opening eyes is
6:27
roughly 50 times more expensive than
6:29
Ernie X1. Ernie X1 is about
6:32
half the cost of R1 for
6:34
input tokens, and actually that's also
6:36
true for output tokens. So it's
6:38
quite significant, especially again relative to
6:40
O1, and shows you. One of
6:43
two things, either Chinese engineering is
6:45
actually really really really that good,
6:47
or there's some state subsidy thing
6:49
going on in the background. I
6:51
think the latter is somewhat less
6:54
plausible at this point, though I
6:56
wouldn't rule it out. certainly there's
6:58
some amazing engineering making these these
7:00
margins possible and it's that's a
7:02
pretty remarkable thing here right I
7:05
mean the cost just collapsing for
7:07
reasoning this implies that there's some
7:09
reasoning specific engineering going on in
7:11
the background and you know you
7:14
should expect that to apply to
7:16
training as well as inference going
7:18
forward yeah and it's kind of
7:20
funny in a way there is
7:22
a parallel here between Baido and
7:25
Google where Google likewise has quite
7:27
competitive pricing, especially for Gemini to
7:29
flash thinking. So I could also
7:31
see it being, you know, just
7:33
a company strategy kind of thing.
7:36
Baidu is gigantic, they're printing money
7:38
with Search, so they could also
7:40
kind of eat the additional costs
7:42
to undermine something like Deep Seek,
7:44
which is a startup, right, too.
7:47
Lock in the market. But either
7:49
way, exciting news and I guess
7:51
if you're in China, I don't
7:53
believe you can use charge property.
7:55
So if nothing else, it's good
7:58
that they are comparable tools for
8:00
people to use. but not miss
8:02
out on the fun of these
8:04
advanced elements. I will say I
8:06
don't know that Baidu would be
8:09
subsidizing at the level of at
8:11
least their base model because they
8:13
are actually more expensive than Deep
8:15
Seek V3 on Ernie 4.5. Where
8:18
you see that flip is with
8:20
the reasoning models, which itself is,
8:22
yeah, that's kind of interesting, right?
8:24
I mean, to me at least,
8:26
that seems to imply something about
8:29
reasoning, like engineering for the like
8:31
computer architecture behind reasoning, or more
8:33
token efficiency and therefore compute efficiency
8:35
at the at the reason I
8:37
shouldn't say therefore maybe alternatively compute
8:40
efficiency at the reasoning stage but
8:42
you're right there's all kinds of
8:44
things start to muddy the water
8:46
is when you start thinking about
8:48
the economics of these things as
8:51
they represent a larger and larger
8:53
fraction of the corporate bottom line
8:55
even for big companies like I
8:57
do like Google these companies are
8:59
gonna be forced to show us
9:02
their hand in a sense right
9:04
they're gonna have to sell these
9:06
tokens for profit and we will
9:08
eventually learn what their actual margins
9:10
are what their actual margins are
9:13
It's debatable whether we're learning that
9:15
just yet. Yeah, I don't think
9:17
we are. It's very much unknown
9:19
and I haven't seen any kind
9:22
of strong analysis to explain it.
9:24
There's you know, yeah, it's just
9:26
a mystery what kind of tricks
9:28
people are pulling but I would
9:30
also kind of bet that the
9:33
margins aren't great. The one thing
9:35
we do know Deep Seek claimed
9:37
at least that they were making
9:39
a profit and had a positive
9:41
margin on their models and I
9:44
could see that not be the
9:46
case for, you know, for instance,
9:48
Open AI where their revenue is
9:50
in the billions, but the real
9:52
question is, are they actually making
9:55
a profit? Last thought on this
9:57
too, on the economic side, like
9:59
when we think about what it
10:01
means for Deep Sea to claim
10:03
that they're generating positive returns, I
10:06
think there's a... an important question
10:08
here about whether that's operating expenses
10:10
or CAPX factored in, right? So
10:12
we saw in their paper that
10:14
they famously talked about how they
10:17
trained V3 on six million dollars
10:19
of compute infrastructure. Now, or sorry,
10:21
on a six million dollar compute
10:23
budget, that was, it seems in
10:26
retrospect, the actual operating expenses of
10:28
running that compute, not the capital
10:30
expenses associated with. the tens of
10:32
millions of dollars as it would
10:34
have been of compute hardware. So
10:37
it's always hard to know like
10:39
what do you amortize? How do
10:41
you factor in what's apples to
10:43
apples? Yeah, it's hard to say
10:45
like deep seek is profitable, but
10:48
on a per token basis, just
10:50
for inference, I believe the claim
10:52
is they're making money which yeah,
10:54
on an all back spaces. Yeah,
10:56
interesting. Yeah. Moving right along, next
10:59
we have Open AI and they
11:01
are releasing new two new speech-to-tech
11:03
models, G.P.40 transcribed and G.P.40 mini
11:05
transcribe, which are basically replacing their
11:07
whisper models. Open has already had
11:10
this as a service for quite
11:12
a while. The exciting new thing
11:14
here is the text-to-speech model, G.P.40
11:16
mini dash-T-T-S, which is more long
11:18
lines of 11 labs where you
11:21
can produce very natural human sounding
11:23
speech and along with announcement of
11:25
the models open AI has also
11:27
launched and you cite open AI
11:29
dot f m which is a
11:32
demo site where you can go
11:34
and mess around and kind of
11:36
hear of the outputs and this
11:38
is kind of a fun trend
11:41
I gotta say where these companies
11:43
increasingly are launching these little fun
11:45
toys to get a sense for
11:47
what these models are capable of.
11:49
One last thing, again, probably should
11:52
comment on pricing. The pricing is
11:54
very competitive, the transcription. for GP40,
11:56
it's 0.6 cents per minute, 0.6
11:58
cents, so like 0.006 dollars, I
12:00
guess. And GD40, mini-TTS, is 1.5
12:03
cents per minute, which is much
12:05
lower than a competitor, like 11
12:07
Labs, for instance. So, yeah, I
12:09
think it's interesting to see opening
12:11
expanding their model suite to these
12:14
new domains where they're there sort
12:16
of less. focused. We've seen them
12:18
kind of move away from text
12:20
to image for instance, Delhi hasn't
12:22
had an update in forever. And
12:25
so I guess this makes a
12:27
lot of sense that they have
12:29
very competitive things to offer given
12:31
their investment in the advanced voice
12:33
mode in chatGPT. It's sort of
12:36
reminiscent of the problem that meta
12:38
faces, right, where they're, you know,
12:40
they reach like, whatever, 3 billion
12:42
people around the world, at a
12:45
certain point, when your market penetration
12:47
is so deep, one of the
12:49
only things you can do to
12:51
keep growing is to grow the
12:53
market. And so meta invests, for
12:56
example, in getting more people on
12:58
the internet, in other countries, like
13:00
in countries that don't have internet
13:02
access typically or have less of
13:04
it. And so they're literally just
13:07
trying to like grow the... pool
13:09
of people like they can they
13:11
can tap for this in the
13:13
same way I think there's a
13:15
lens on this that's similar right
13:18
so you're only able to interact
13:20
with chat gBT through certain modalities
13:22
or with open AI products for
13:24
certain modalities and by achieving greater
13:26
ubiquity by reaching into your life
13:29
more and making more of the
13:31
conversational tooling available to you that
13:33
that really does effectively increase their
13:35
their market right like you don't
13:37
have to be in front of
13:40
a computer necessarily or in the
13:42
same way or engaged in the
13:44
same way to use the product.
13:46
And obviously they've had other other
13:49
voice products before, but it's sort
13:51
of part of if I'm open
13:53
AI, I'm really thinking about multimodality,
13:55
but from the standpoint of increasing
13:57
the number of context, life context
14:00
in which I can reach you
14:02
and text to image still requires
14:04
you. to be in front of
14:06
a screen, same as writing text
14:08
on chat GPT, whereas audio is
14:11
just this like, you know, greater
14:13
reach for modality wise. So I
14:15
think strategically it's an interesting play
14:17
for them. ethically all kinds of
14:19
issues. I mean, you know, you
14:22
think about the modality of audio
14:24
as being one that is much
14:26
more intimate to humans and an
14:28
easier way to plug into your
14:30
inner world. And that's, I think,
14:33
something, you know, we look at,
14:35
like, what record did to people
14:37
just through text, right? The suicidal
14:39
ideation, the actual suicides, if nothing
14:41
else for open AI. There is
14:44
one figure by the way in
14:46
the article at least we're linking
14:48
to here and it's just a
14:50
piece of research looking at the
14:53
the word error rate comparisons across
14:55
leading models for different languages as
14:57
part of this kind of tooling
14:59
I just I find it really
15:01
interesting like Arabic and Hindi there's
15:04
a lot of struggle there those
15:06
are some of the worst performing
15:08
languages obviously English one of the
15:10
better performing ones I'd love to
15:12
see an overlay of this relative
15:15
to the amount of data that
15:17
was used to train the model,
15:19
so you can see in relative
15:21
terms, like, which languages are, in
15:23
a sense, like, harder for AI
15:26
to pronounce, to kind of speak.
15:28
I think there's something, anyway, linguistically
15:30
just fascinating about that, if nothing
15:32
else. So, anyway, overall, interesting launch,
15:34
and I think we're gonna see
15:37
more and more of this, right?
15:39
It's gonna be more expected to
15:41
have very high quality audio models
15:43
and linking them specifically to agents,
15:45
sort of Star Trek computer style.
15:48
Yeah, I guess one thing worth
15:50
noting on the kind of ethics
15:52
side is I don't believe they're
15:54
offering voice cloning technology, which is
15:57
where you can really get into
15:59
trouble very easily. So I think
16:01
opening up is being a little
16:03
careful these days in general to
16:05
not court controversy. That's part of
16:08
why it took them forever to
16:10
release SORA potentially. And in this.
16:12
API, this demo, they are releasing
16:14
something like a dozen voices you
16:16
can use with names like alloy,
16:19
ash, echo, fable, onyx, nova, kind
16:21
of, I don't know, human, I
16:23
guess we are not even trying
16:25
to make human sounding, and you
16:27
can also assign them a vibe
16:30
in this demo like cowboy, auctioneer,
16:32
all-timey, serene, with a lot of
16:34
this kind of steering that you
16:36
can do as well. So. Yeah,
16:38
I think it's pretty exciting. And
16:41
as ever will release some new
16:43
APIs, which really enables downstream of
16:45
Open AI and these other companies
16:47
for others to build exciting new
16:49
applications of AI. And onto a
16:52
few more quick stories. Next up
16:54
also Open AI, they have released
16:56
01-1-Pro into their developer API. it's
16:58
actually limited to developers who spent
17:01
of these five dollars on this
17:03
and it cost a hundred fifty
17:05
per million tokens for input and
17:07
six hundred dollars per million tokens
17:09
generated so that's very very high
17:12
prices obviously that's as we've said
17:14
GP 4.5 was seventy five dollars
17:16
for one million output tokens and
17:18
that's yeah two two orders of
17:20
magnitude easily above what you would
17:23
typically charge. Yeah, I'm trying to
17:25
think if it's two or three
17:27
orders of magnitude. It might be
17:29
approaching three orders of market here
17:31
actually. So yeah, interesting strategy here
17:34
from Open Eye where you haven't
17:36
seen any other companies release these
17:38
very expensive products yet and Open
17:40
Eye is increasingly doing that with
17:42
Chesapeake Pro, they're $200 per month
17:45
subscription with QB4. It makes me
17:47
wonder if this is an attempt
17:49
to become more profitable or if
17:51
this is them sort of testing
17:53
waters There could be various readings
17:56
I suppose. Yeah, it's also I
17:58
mean, it's interesting to note this
18:00
is not an order of magnitude
18:02
larger than what GPT-3's original pricing
18:05
was. I was just looking it
18:07
up in the background here to
18:09
check because I seem to remember
18:11
it being you know on it
18:13
back then it was priced per
18:16
per a thousand tokens with reasoning
18:18
models you tend to see more
18:20
per million tokens just because of
18:22
the number of tokens generated but
18:24
sort of reminds me in the
18:27
military or the history of the
18:29
military there is often this this
18:31
restriction where it's like people can
18:33
only carry I forget what it
18:35
is, 60 pounds or something of
18:38
equipment. And so over time, you
18:40
tend to see like the amount
18:42
of equipment that a soldier carries
18:44
doesn't tend to change or the
18:46
weight of it, but of course
18:49
the kind of equipment they carry
18:51
just changes to reflect technology. This
18:53
sort of seems similar, right? There's
18:55
like almost a perito frontier of
18:57
pricing, at least for the people
19:00
who are willing to reach for
19:02
the most intelligent products, you know,
19:04
and you're constantly reaching for it.
19:06
This is a push-forative. Even relative.
19:08
There's all kinds of feedback people
19:11
have been getting. There's complaints about,
19:13
though, this model struggled with Sudoku
19:15
puzzles, apparently, and optical illusions and
19:17
things like that. People say, you
19:20
know, at a certain point, anything
19:22
you launch at a high-priced point,
19:24
especially if you're opening eye, people
19:26
will complain that it's not like
19:28
super intelligence. And so... Yeah, but
19:31
there's also an interesting parallel here
19:33
where O1-Pro... just in terms of
19:35
benchmarks and I think in general
19:37
in terms of the vibe of
19:39
what people think is that it's
19:42
not sort of significantly better than
19:44
01 and that parallels duty 4.5
19:46
you know it's better but it's
19:48
not sort of a huge leap
19:50
so there is an interesting kind
19:53
of a demonstration of probably it's
19:55
harder to get you know huge
19:57
leaps and performance and people are
19:59
going to be more critical now
20:01
of if you were not offering
20:04
something that's like you know really
20:06
between Jupiter 4 3.5 and 4
20:08
for instance Yeah, I think it's
20:10
quite use case specific too, right?
20:12
So as we've seen, you know,
20:15
the kinds of issues people are
20:17
running into optical illusions, you know,
20:19
Sudoku puzzles, this sort of thing,
20:21
are pretty far from the standard,
20:24
you know, the actual workloads that
20:26
Open AI is targeting, right? Their
20:28
focus is, can we build something
20:30
that helps us automate AI research
20:32
as quickly as possible? Those sorts
20:35
of benchmarks, yeah, we are seeing
20:37
needle moving there. There's also some
20:39
interesting stuff that we'll talk about
20:41
for meter, suggesting that in fact
20:43
that is what's happening here. That
20:46
on those particular kinds of tasks,
20:48
we're seeing pretty significant. acceleration with
20:50
scale, but you're right, right? It's
20:52
this funny uneven surface, just like
20:54
how humans are funny and uneven,
20:57
right? Like you have a really
20:59
talented artist who can't write a
21:01
line of code to save their
21:03
lives, right? And vice versa. So
21:05
another instance of the paradox of
21:08
what's hard for AI isn't necessarily
21:10
hard for humans. And moving away
21:12
from Open AI to Google, we
21:14
now have another feature. Another instance
21:16
of canvas, this time on Gemini,
21:19
and they're also adding audio overview.
21:21
So I don't know why they
21:23
do this, why these ALMs just
21:25
copy each other's names. We had
21:28
deep research showing up in multiple
21:30
variants. Now we have a canvas,
21:32
which is also on ChadGPT, and
21:34
I think on the topic, it's
21:36
called Artifacts. Basically the same idea,
21:39
where now as you're working on
21:41
something like code, for instance. or
21:43
like a web app, for instance,
21:45
you can have a site panel
21:47
showing this living document rendering of
21:50
it with a chatbot to the
21:52
left, so you can essentially interactively
21:54
work and see a preview of
21:56
what you're getting. And you also
21:58
have audio overviews, which is pretty
22:01
much something like notebook, LM, that
22:03
you can upload documents and get
22:05
this podcast. style conversation going on.
22:07
So nothing sort of conceptually new
22:09
going on here, but I think
22:12
an interesting convergence across the board
22:14
of all these tools, everyone has
22:16
canvas, everyone has deep research, everyone
22:18
seems to have kind of the
22:20
same approach to implementing LM interfaces.
22:23
Speaking of that, in fact, the
22:25
next story is about Enfropic and
22:27
them adding web search capabilities to
22:29
Claude. That is now in preview
22:32
for paid users in the US
22:34
and that will basically work the
22:36
same as it does in Chajabiki
22:38
and other models. You can enable
22:40
it to work with cloud 3.7
22:43
and then it will be able
22:45
to provide direct citations from web-source
22:47
information. So yeah, not much else
22:49
to say we're getting a website
22:51
for cloud which will enable it
22:54
to be more useful. It's interesting
22:56
because it's like the tee up
22:58
to this is anthropic being a
23:00
little bit more shy than other
23:02
companies to roll up the the
23:05
web search product into their Into
23:07
their agents and I mean this
23:09
is consistent with the threat models
23:11
that they take seriously right things
23:13
like loss of control right which
23:16
typically involve you know an AI
23:18
model going on to the internet
23:20
maybe replicating its weight somehow and
23:22
internet access is kind of central
23:24
to a lot of these things
23:27
I don't know if that was
23:29
part of this, it at least
23:31
is consistent with it. So the
23:33
result is that there may be
23:36
a little bit later the party
23:38
than others. Apparently, according to these
23:40
initial tests, you don't always see
23:42
web search used for current events
23:44
related questions. But when that happens,
23:47
you do get these nice in
23:49
line citations, pull from sources, it
23:51
does look at social media, and
23:53
then of course news sources. like
23:55
NPR, like Reuters, they cite in
23:58
the examples they show. So, you
24:00
know, pretty standard product in the
24:02
inline citation approach that you see
24:04
with deep research, for example, certainly
24:06
making an appearance here. And last
24:09
up, again, along the lines of
24:11
these stories. we have XAI launching
24:13
a new API, this one for
24:15
generating images. So we have a
24:17
new model called Rock 2 Image
24:20
12, and you can now query
24:22
it. For now it's quite limited.
24:24
You can only generate 10 images
24:26
per request and you are limited
24:28
to 5 requests per second. The
24:31
cost there is 7 cents per
24:33
image. which is slightly above what,
24:35
for instance, Blackforest Labs charges. They
24:37
are the developers of Flux and
24:40
competitive with another offering for my
24:42
diagram. So I think, yeah, interesting
24:44
to see XAI expanding their APIs
24:46
once again, they released their own
24:48
image generation back in December and
24:51
it's kind of looked competitive with
24:53
something like Google's latest generation of
24:55
where we focus is really shifted
24:57
towards careful instruction following in your
24:59
image generation. So yeah, X AI
25:02
is as ever trying to catch
25:04
up or moving quite rapidly to
25:06
expand their offerings. Yeah, they really
25:08
are. And I think when we
25:10
first covered Blackforest Labs partnership with
25:13
X AI, one of the first
25:15
things that we said was like,
25:17
hey, this is. because I think
25:19
they raised a big round, right,
25:21
on the back of the incredible
25:24
distribution that they were going to
25:26
get through X AI and the
25:28
kind of vote of confidence that
25:30
reflects it from Elon. But at
25:32
the time we were talking about,
25:35
hey, you know, this is a
25:37
pretty strategically dicey position for Blackforth
25:39
Labs because the one thing we've
25:41
consistently seen from from all the
25:44
AI companies is Once they start
25:46
getting you in for chat, eventually
25:48
they start rolling out multimodal features,
25:50
and it's not clear that those
25:52
aren't best built in house for
25:55
any number of reasons, not just
25:57
including the fact that you want
25:59
to kind of internalize all the
26:01
revenues you can from the whole
26:03
from the whole stack, but also
26:06
once you have a good reasoning
26:08
model or rather a good foundation
26:10
model, that foundation model can be
26:12
mine for multimodality. post hoc and
26:14
you just kind of get to
26:17
amortize your investment across more modalities
26:19
and so it's just this natural
26:21
move to kind of keep crawling
26:23
into you're creeping into adjacent markets
26:25
like image generation video generation which
26:28
is also something that X AI
26:30
is looking at so yeah I
26:32
mean it's kind of interesting for
26:34
Blackforce Labs this probably is going
26:36
to be a big challenge for
26:39
them I don't know how extensive
26:41
their partnership continues to be at
26:43
this point but it's a dicey
26:45
time to be to be one
26:47
of these companies. And on two
26:50
applications and business we begin with
26:52
some announcements from invidia. There's a
26:54
preview of their plans in 2026
26:56
and 2027. They have the Ruben
26:59
family of GPUs coming in 2026
27:01
and then Ruben Ultra in 2027.
27:03
So that will also come along
27:05
with a new, I guess. server
27:07
layout with ability to combine 576
27:10
GPUs per rack, which, you know,
27:12
I guess it's very much following,
27:14
you know, the tracks of very,
27:16
very crazy enhancement to computing that
27:18
invidious been able to continue creating
27:21
with, you know, B200, I believe
27:23
it's now, and now this is
27:25
their plans for the next couple
27:27
years. Yeah, there's a lot going
27:29
on with this update. It's actually
27:32
pretty interesting and quite significant, especially
27:34
on the data center side, in
27:36
terms of the infrastructure that will
27:38
be required to accommodate these new
27:40
chips. A couple things here, right?
27:43
So there is this configuration of
27:45
the Blackwell called the NBL 72,
27:47
is the certain name of this
27:49
configuration. This is where you have,
27:51
so, okay, imagine a tray that
27:54
you're in a slot into. a
27:56
rack, a server rack, right? So
27:58
on that tray, you're gonna have
28:00
four GPUs. All right, so each
28:03
trache. contains four GPUs, and in
28:05
total, in that whole rack, you're
28:07
gonna have 72, I'm sorry, you're
28:09
actually gonna have 144 GPUs total,
28:11
but because two of those GPUs
28:14
show up on the same motherboard,
28:16
God. So each freaking, each freaking
28:18
tray that you slot into the
28:20
rack has two motherboards on it,
28:22
each of those motherboards has two
28:25
GPUs, two B200 GPUs. So in
28:27
total, you're putting in four GPUs
28:29
per tray. But they're kind of
28:31
divided into two motherboards, each with
28:33
two GPUs. Anyway, this led to
28:36
the thing being called the NFL
28:38
72, when in reality there's 144
28:40
GPUs on there, at least Jensen-Hwang
28:42
says it would have been more
28:44
appropriate to call it the NFL-144-L.
28:47
Okay. What's actually interesting in this
28:49
setup, they're calling the Ruben-Nvil-L-144 rack.
28:51
There's not more GPUs there. It's
28:53
not that there's twice as many
28:55
GPUs as the Nvil-72 with the
28:58
black wells. It's just that they're
29:00
counting them differently now. So they're
29:02
saying, actually, we're going to count
29:04
all the GPUs. So if, I
29:07
think back in the day we
29:09
did talk about the Nvil-72 setup,
29:11
this is basically just the same
29:13
number of GPUs. If that didn't
29:15
make any sense, just delete it
29:18
from your mind. It's both some
29:20
of the things that are actually
29:22
interesting. The story is it's comparable
29:24
in the number of GPUs to
29:26
the current set of top line
29:29
GPUs. So they're kind of pitching
29:31
it as you can spot it
29:33
into your existing infrastructure more or
29:35
less and just to jump into
29:37
numbers a little bit, you're getting
29:40
roughly three times the inference and
29:42
training performance in terms of just
29:44
rock compute memory is faster by
29:46
close to two-ish or multiplier of
29:48
two kind of like yeah you're
29:51
seeing multipliers on top of the
29:53
current one so quite significant change
29:55
in performance if you do upgrade
29:57
yeah so so when it comes
29:59
to Rubin right which is the
30:02
the sort of next generation coming
30:04
online at FP4, you're seeing, yeah,
30:06
3X more flops, right, three times
30:08
more, more logic capacity. Now the,
30:11
on the memory side, things actually
30:13
do get somewhat interesting. The memory
30:15
capacity is going to be 288
30:17
gigabytes per GPU, right? That is
30:19
the same as the B300. So
30:22
no actual change in terms of
30:24
the, like, per GPU. memory capacity.
30:26
We'll get back to why that
30:28
matters a bit less in a
30:30
second, but that's kind of part
30:33
of the idea. The memory bandwidth
30:35
is improving. It's almost doubling. Maybe,
30:37
yeah, it's short of doubling. So
30:39
the memory bandwidth is really, really
30:41
key, especially when you look at
30:44
inference. So that's one of the
30:46
reasons why this is really being
30:48
focused on. But there's also a
30:50
bunch of things like the, so
30:52
the cables that connect GPUs together.
30:55
on roughly speaking on one rack,
30:57
if you want to imagine it
30:59
that way. Those are called envy
31:01
link cables, super super high bandwidth.
31:03
Those are doubling in throughput. So
31:06
that's really a really big advance.
31:08
There's also stuff happening on the
31:10
networking side, but we don't need
31:12
to touch that. Bottom line is
31:15
envy link cables used to be
31:17
the way you connected GPUs across
31:19
different. trays in the same rack
31:21
and maybe adjacent racks depending on
31:23
the configuration. But it's very local,
31:26
very very tight, very high bandwidth
31:28
communication. What's happening here is each
31:30
of these motherboards, you know, that
31:32
you're slotting into your rack, they
31:34
have a CPU and two GPUs.
31:37
And we talked about this in
31:39
the hardware episode as to why
31:41
that is. The CPUs like the
31:43
orchestra conductor, the GPUs are like
31:45
the instruments that, you know, they're
31:48
actually doing the hard work in
31:50
the heavy lifting. Typically the CPU
31:52
would be connected to the GPUs
31:54
through a PCIE connection. So this
31:56
is a relatively low bandwidth compared
31:59
to NVLink. Now they're moving over
32:01
to NVLink as well for the
32:03
CPU 2 GPU connection. That's actually
32:05
a real... really big deal. It
32:07
comes with a core to core
32:10
interface. So right now, the GPUs
32:12
and CPUs are going to share
32:14
a common memory space. So essentially
32:16
directly accessing each other's memory. Whatever
32:19
is in memory on the CPU,
32:21
the GPU can access right away
32:23
and vice versa. That's a really,
32:25
really big change. It used to
32:27
not be the case. You used
32:30
to have independent CPU and GPU
32:32
memory. the GPUs themselves would share
32:34
a common memory space if they
32:36
were connected via NVLink and in
32:38
fact that's kind of that's part
32:41
of the idea here that's what
32:43
makes them a coherent wad of
32:45
compute and it's also part of
32:47
the reason why the memory capacity
32:49
on those GPUs matters a bit
32:52
a bit less because you're adding
32:54
you're kind of combining all your
32:56
GPUs together and they have a
32:58
shared memory space so if you
33:00
can just add to the number
33:03
of GPUs, you have, you're effectively
33:05
adding true memory capacity. So that's
33:07
kind of an important difference there.
33:09
So anyway, last thing I'll mention,
33:11
they say that apparently Ruben Ultra
33:14
is going to come out. This
33:16
is, so there's going to be
33:18
Ruben and then Ruben Ultra. Ruben
33:20
Ultra is coming out the second
33:23
half of 2027. It'll come with
33:25
a Ruben GPU and a Vera
33:27
CPU, like invidia tends to do,
33:29
right? And so Vera is the
33:31
CPU, Reuben is the GPU. Apparently,
33:34
the full rack is going to
33:36
be replaced by this 576 GPU
33:38
setup, a massive number. That is
33:40
essentially, so they don't specify the
33:42
power consumption, but it's clear from
33:45
other kind of industry products that
33:47
are coming out. We're tracking for
33:49
one megawatt per rack. And just
33:51
worth emphasizing. That's a thousand kilowatts.
33:53
That is a thousand homes worth
33:56
of power going to a single
33:58
rack in a server, in a
34:00
data center. That's insane, right? So
34:02
the power density required for this
34:04
is going through the roof, the
34:07
cooling requirements, all this stuff. It's
34:09
all really cool. And anyway, this
34:11
is a very, very big motion.
34:13
Just to dive in a little.
34:15
it into the numbers, just fun,
34:18
right? So the compute numbers are
34:20
in terms of flops, which is
34:22
floating point operations per second, basically
34:24
multiplications per second or additions per
34:26
second. And the numbers we get
34:29
with these announced upcoming things like
34:31
their Rubin are now for inference,
34:33
3.6 XA flops of inference. So
34:35
XA is... is Quintilian. It's 10
34:38
to the 18. Quintilian is the
34:40
one after Quintilian. So I can't
34:42
even imagine how many, I mean,
34:44
I guess I know how many
34:46
zeros it is, but it's very
34:49
hard to imagine a number that
34:51
long. And that's just where we
34:53
are at. Also worth mentioning, so
34:55
this is the plans for 2026,
34:57
2027. They also did announce for
35:00
late of a year the coming
35:02
of B300, which is B300. Also,
35:04
you know, is an improvement in
35:06
performance of about 1.5. They also
35:08
didn't announce the ultra-bariance of Blackwell,
35:11
both 200 and 300. And the
35:13
emphasis they are starting to add,
35:15
I think, is more on the
35:17
inference side. They definitely are saying
35:19
that these are good models for
35:22
the age of reasoning. So they're
35:24
capable of outputting things fast in
35:26
addition to training well and that's
35:28
very important for reasoning of course
35:30
because the whole idea is you're
35:33
using up more tokens to get
35:35
better performance. So they're giving some
35:37
numbers like for instance Blackwell Ultra
35:39
will be able to deliver up
35:42
to 1,000 tokens per second on
35:44
Deep Seek R1 and that's you
35:46
know comparable usually you would be
35:48
seeing something like 100-200 tokens per
35:50
second a thousand dollars per second
35:53
is very fast. Yeah, and then
35:55
the inference focus is reflected too.
35:57
in the fact that they're looking
35:59
at you know FP4 flops denominated
36:01
performance right so when you go
36:04
to inference often you're your inferencing
36:06
quantized models inferencing at FP4 so
36:08
lower resolution and also this the
36:10
the memory bandwidth side becomes really
36:12
important for inference disproportionately relative to
36:15
training at least on the current
36:17
paradigm so that's kind of you
36:19
know part of the reason that
36:21
you're seeing those big big lifts
36:23
at that that end of things
36:26
is because of the inference. And
36:28
next story is also about some
36:30
absurd sounding numbers of hardware. This
36:32
one is from Apple. They have
36:34
launched a Mac studio offering and
36:37
the top line configuration where you
36:39
can use the entry ultra chip
36:41
with 32 CPU's 80 core GPUs
36:43
that can even run the Deep
36:46
Seek R1 model. That's the 671
36:48
billion parameter AI model. Fewer at
36:50
inference, you're using about 37 billion
36:52
per output I believe, but still,
36:54
this is hundreds of gigabytes of
36:57
memory necessary to be able to
36:59
run it and just fit it
37:01
in there. Yeah, Apple's also doing
37:03
this weird thing where they're not
37:05
designing like GPUs for their data
37:08
centers, including for AI workloads. They
37:10
seem to be basically like doing
37:12
souped up CPUs, kind of like
37:14
this, with just like gargantuan amounts
37:16
of VRAM, that again, have this
37:19
very large kind of shared pool
37:21
of memory, right? We talked about
37:23
like coherent memory on the Blackwell
37:25
side, right, and on the Rubin
37:27
side, just the idea that. if
37:30
you have a shared memory space,
37:32
you can pool these things the
37:34
other. Well, they're not as good
37:36
at the shared memory space between
37:38
CPUs. What they do is they
37:41
have disgusting amounts of RAM, one
37:43
GPU, right? So like, 512 gigs
37:45
is, it's just wild. Like, it's,
37:47
anyway. for a CPU at least.
37:50
And we're talking here about, when
37:52
you say memory, we mean really
37:54
something like RAM, right? And so
37:56
if you have a laptop, right,
37:58
if you buy a Mac, for
38:01
instance, typically you're getting eight gigabytes,
38:03
maybe 16 gigabytes of RAM, the
38:05
fast type of memory, read, what
38:07
does it, read something memory? It's
38:09
random access memory, right. As opposed
38:12
to the slower memory of. Let's
38:14
say an SSD or things where
38:16
you can easily get terabytes to
38:18
get that crazy amount of Random
38:20
access memory is insane when you
38:23
consider that typically it's like eight
38:25
16 gigabytes and you know This
38:27
is expensive memory. It's it's stupid
38:29
expense. It's also like Yeah, there's
38:31
different kinds of RAM and we
38:34
talked about that in our hardware
38:36
episode This is a combined CPU
38:38
GPU set up by the way
38:40
so 32 core CPU 80 core
38:42
GPU but shared memory across the
38:45
board. So V-Ram is like really
38:47
close to the logic, right? So
38:49
this is like the most, as
38:51
you said, exquisitely expensive kind of
38:54
memory you can put on these
38:56
things. They're opting to go in
38:58
this direction for very interesting reasons,
39:00
I guess. I mean, it does
39:02
mean that they're disadvantaged in terms
39:05
of being able to scale their
39:07
data center, you know, infrastructure, because
39:09
of networking, at least as far
39:11
as I can tell. It's a
39:13
very interesting standalone standalone machine. I
39:16
mean, it's a pretty wild specs.
39:18
Right. Yeah, exactly. If you go
39:20
to the top line offerings, and
39:22
this is, you know, a physical
39:24
product you can buy as a...
39:27
Yeah, it's a Mac, right? Yeah,
39:29
it's a Mac, yeah, it's a
39:31
Mac, it's like a big, kind
39:33
of cube-ish thing. And if you
39:35
go to the top line of
39:38
recreation, it's something like $10,000, don't
39:40
quote me on that, but it's,
39:42
you know, you know, expensive, expensive,
39:44
expensive, expensive, expensive, as well. It
39:46
does come with other options for
39:49
a reason M4 Max CPU and
39:51
GPUs less powerful than an free
39:53
ultra, but anyway. Very kind of
39:55
beefy offering now from Apple. Next
39:58
we have something a bit more
40:00
forward-looking. Intel is apparently reaching an
40:02
exciting milestone for the 18A1.8 nanometer
40:04
class wafers with a first run
40:06
at the Arizona Fab. So this
40:09
is apparently ahead of schedule. They
40:11
have these Arizona Fabs. Fab 52
40:13
and Fab 62, Fab as we've
40:15
covered before is where you try
40:17
to make your chips and 1.8
40:20
nanometer is the next kind of
40:22
frontier in terms of scaling down
40:24
the resolution. the density of logic
40:26
you can get on a chip.
40:28
So the fact that they're running
40:31
these test wafers, they're ensuring that
40:33
you can transfer the fabrication process
40:35
to these new Arizona facilities, I
40:37
guess the big deal there is
40:39
partially that these are located within
40:42
the US, within Arizona, and they
40:44
are seemingly getting some success and
40:46
are ahead of schedule, as you
40:48
said, and that's impressive because fabs
40:50
are an absurdly complex engineering project.
40:53
Yeah Intel is in just this
40:55
incredibly fragile space right now as
40:57
has been widely reported and we've
40:59
talked about that a fair bit
41:01
I mean they need to blow
41:04
it out of the water with
41:06
with 18a in their future notes
41:08
I mean this is like a
41:10
make-or-break stuff so forward progress yeah
41:13
they had their their test facility
41:15
in Hillsborough Oregon who that was
41:17
doing 18a production as you said
41:19
on a test basis and they're
41:21
now successfully getting the first test
41:24
wafers in their new Arizona Fab
41:26
out so that's great But yeah,
41:28
it'll eventually have to start running
41:30
actual chips for commercial products. The
41:32
big kind of distinction here is
41:35
they're actually manufacturing with 18A, these
41:37
gate all around transistors. I think
41:39
we talked about this in the
41:41
hardware episode. We'll go into too
41:43
much detail. This is a specific
41:46
geometry of transistor that allows you
41:48
to have better control over the
41:50
flow of electrons through your transistor,
41:52
essentially. It's a big, big challenge
41:54
people have had in making transistors
41:57
small and smaller. You get all
41:59
kinds of current leakage. The current,
42:01
by the way, is sort of
42:03
like the thing that carries information
42:05
in your computer. And so you
42:08
want to make sure that you
42:10
don't have current leakage to kind
42:12
of have ones become zeros or
42:14
let's say operation, like, you know,
42:17
a certain kind of gate, turn
42:19
into the wrong kind of gate.
42:21
That's the idea here. this gate-all-round
42:23
transistor based on a ribbon-fed design.
42:25
And yeah, so we're seeing that
42:28
come-to-market gate-all-round is is something that
42:30
TSMC is moving towards as well.
42:32
And you know, it's just going
42:34
to be the next, essentially the
42:36
next beat of production. So here
42:39
we have 18A, kind of early
42:41
signs of progress. And now moving
42:43
away from hardware more to business-y
42:45
stuff, XAI has acquired a genative
42:47
AI startup. which is focused on
42:50
text to video similar to SORA.
42:52
They also have AI powered, or
42:54
initially they're worked on AI powered
42:56
for the tools and then pivoted.
42:58
So I suppose on surprising in
43:01
a way that they are working
43:03
on text to video as well.
43:05
They just want to have all
43:07
the capabilities at XAI, and this
43:09
presumably will make that easier to
43:12
do. Yeah, one of the founders
43:14
had some quote, I think it
43:16
might have been on X, I'm
43:18
not sure, but he said, we're
43:21
excited to continue scaling these efforts
43:23
on the largest cluster in the
43:25
world, Colossus, as part of XAI,
43:27
so it seems like they'll be
43:29
given access to Colossus as part
43:32
of this, maybe not shocking, but
43:34
kind of an interesting subnote. They
43:36
were backed by some really impressive
43:38
VCs as well, so Alexis Zahanian,
43:40
as were like famous for like
43:43
being the co-founder, read it of
43:45
course, and doing his own VC
43:47
stuff, and SV Angel too. So
43:49
pretty interesting acquisition and a nice
43:51
soft landing too for folks in
43:54
a space that otherwise, you know,
43:56
I mean, they're either going to
43:58
acquire you or they're going to
44:00
eat your lunch. So I think
44:02
that's probably the best outcome for
44:05
people working on the different modalities,
44:07
at least. at least on my
44:09
view of the market. Yeah, and
44:11
I guess acquisition makes sense. The
44:13
startup has been around for over
44:16
two years and they have already
44:18
trained multiple video models. Hotshot Excel
44:20
and hotshot, and they do produce
44:22
quite good looking videos, so. make
44:25
some sense for opening eye to
44:27
acquire them, if only for the
44:29
kind of brain power and expertise
44:31
in that space. Yeah, man, they're
44:33
old. They've been around for like
44:36
two years, right? Like, yeah, that
44:38
was what, free SORA or like
44:40
a round-of-time SORA. Yeah, yeah, it's
44:42
funny. It's just funny how the
44:44
AI business cycle is so short,
44:47
like, like, these guys have been
44:49
around for all the 24 months.
44:51
Very experts, very veterans. And onto
44:53
the last story, Tencent is reportedly
44:55
making massive and video H20 chip
44:58
purchases. So they are supposedly meant
45:00
to support the integration of Deep
45:02
Seek into We Chat, which kind
45:04
of reminds me of meta where
45:06
meta has this somewhat interesting drive
45:09
to... Let you use a llama
45:11
everywhere and Instagram and all their
45:13
messaging tools. So this would seem
45:15
to be in a way similar
45:17
where Tencent would allow you to
45:20
use deep seek within we chat.
45:22
Yeah, part of what's going on
45:24
here too is the standard stockpiling
45:26
that you see China do and
45:29
Chinese companies do ahead of the
45:31
anticipated crackdown from the United States
45:33
on export controls. And in this
45:35
case, the age 20 has been
45:37
kind of identified as one of
45:40
those chips that's likely to likely
45:42
to be... shut down for the
45:44
Chinese market in the relatively near
45:46
term. So it makes all the
45:48
sense in the world that they
45:51
would be stockpiling for that purpose.
45:53
But it is also the case
45:55
that you got R1 that has
45:57
increased dramatically the demand for access
45:59
to hardware. It's sort of funny
46:02
how quickly we pivoted from, oh
46:04
no, R1 came out and so
46:06
invidious stock crashes to, oh, actually
46:08
R1. great news for invidia. Anyway,
46:10
I think it's the turnaround that
46:13
we sort of expected. We talked
46:15
about this earlier and there has
46:17
been apparently a short-term supply shortage
46:19
in China regarding these age 20
46:21
chips. So like there's so much
46:24
demand coming in from 10 cent
46:26
that it's this sort of like
46:28
rate limiting for invidia to get
46:30
age 20s into the market there.
46:33
So kind of interesting, they've previously
46:35
placed orders on the order of
46:37
hundreds of thousands between them and
46:39
bite dance. back I think last
46:41
year was almost a quarter million
46:44
of these these Japanese so yeah
46:46
pretty big customers and onto projects
46:48
and open source we begin with
46:50
a story from the information titled
46:52
on tropics not so secret weapon
46:55
that's giving agents a boost I
46:57
will say kind of a weird
46:59
spin on this whole story but
47:01
anyway that's the one we're linking
47:03
to and it covers the notion
47:06
of MCP model context protocol which
47:08
Anthropic released all the way back
47:10
in November. We hopefully covered it.
47:12
I guess we don't know. I
47:14
was trying to remember, yeah. Yeah,
47:17
I think we did. And the
47:19
reason we're coming in now is
47:21
that it sort of blew up
47:23
over the last couple of weeks
47:25
if you're in the AI developer
47:28
space or you see people hacking
47:30
on AI, that it has been
47:32
to talk of a town, so
47:34
to speak. So Molokontax protocol, broadly
47:37
speaking is something... like an API,
47:39
like a standardized way to build
47:41
ports or mechanisms for AI agents
47:43
or AI, I guess, models to
47:45
call on services. So it standardizes
47:48
the way you can provide things
47:50
like tools. So there's already many,
47:52
many integrations following the standard for
47:54
things like slack, perplexity, notion, etc.
47:56
where if you adopt a particle
47:59
and you provide an FCP compatible
48:01
kind of opening, you can then
48:03
have an MCP client, which is
48:05
your AI model, call upon this
48:07
service. And it's very much like
48:10
an API for a website, where
48:12
you can have a particular URL
48:14
to go to, particular kind of
48:16
parameters, and you get something back
48:18
in some format. Here, the difference
48:21
is that, of course, this is
48:23
more specialized for. AI models in
48:25
particular, so it provides tools, it
48:27
provides like a prompt to explain
48:29
the situation, things like that. Personally,
48:32
I'm in the camp of people
48:34
who are a bit confused and
48:36
kind of think that this is
48:38
an API for an API kind
48:40
of situation, but either way, it
48:43
has gotten very popular. That is
48:45
exactly what it is. Yeah, it's
48:47
an API for an API. It's
48:49
also, I guess, a transition point.
48:52
or could be viewed that way,
48:54
you know, in the sense that
48:56
eventually you would expect models to
48:58
just kind of like figure it
49:00
out, you know, and have enough
49:03
context and ability to uncover whatever
49:05
information is already on the website.
49:07
to be able to use tools
49:09
appropriately, but there are edge cases
49:11
where you expect this to be
49:14
worthwhile. Still, this is going to
49:16
reduce things like hallucination of tools
49:18
and all kinds of issues that
49:20
when you talk about agents, like
49:22
one failure anywhere in a reasoning
49:25
chain or in an execution chain,
49:27
it can cause you to fumble.
49:29
And so, you know, this is
49:31
structurally a way to address that
49:33
and quite important in that sense.
49:36
It is also distinct from a
49:38
lot of the tooling that opening
49:40
eyes come out with, but it
49:42
sounds similar like the agent's API.
49:44
where they're focused more on chaining
49:47
tool uses together, whereas MCP, as
49:49
you said, is more about helping
49:51
make sure that each individual instance
49:53
of tool use goes well, right?
49:56
That the agent has what it
49:58
needs to kind of ping the
50:00
tool properly interact with it and
50:02
find the right tool rather than
50:04
necessarily chaining them together. So there
50:07
you go. MCP. is a nice
50:09
kind of clean open source play
50:11
for anthropic too. They are going
50:13
after that kind of more startup
50:15
founder and business ecosystem. So pretty
50:18
important from a marketing standpoint for
50:20
them too. Right, yeah, exactly. So
50:22
back in November, they announced this,
50:24
they introduced us as an open
50:26
standard, and they also released open
50:29
source repositories with some example of
50:31
model context product called servers. and
50:33
as well the specification and like
50:35
a development toolkit. So I honestly
50:37
haven't been able to track exactly
50:40
how this blew up. I believe
50:42
there was some sort of tutorial
50:44
given at some sort of convention
50:46
like the AI engineer convention or
50:48
something and then it kind of
50:51
took off and everyone was very
50:53
excited about the idea of model
50:55
context protocols right now. Moving on
50:57
to new models, we have Mistral
51:00
dropping a new open source model
51:02
that is comparable to 2P400 mini
51:04
and is smaller. So they have
51:06
Mistral small 3.1, which is seemingly
51:08
better than similar models, but only
51:11
has 24 billion parameters. Also can...
51:13
take on more input tokens, 120,000
51:15
tokens, and is fairly speedy at
51:17
150 tokens per second. And this
51:19
is being released under the Apache
51:22
2 license, meaning that you can
51:24
use it for whatever you want,
51:26
business implications, etc. I don't think
51:28
there's too much to say here,
51:30
other than like kind of a
51:33
nitpick here, but they say it
51:35
outperforms copper models like Gemini 3.
51:37
GD40 mini while delivering infant speeds
51:39
as you said of 150 tokens
51:41
per second but like you can't
51:44
just say that shit like it
51:46
doesn't mean anything to say yeah
51:48
it depends on what infrastructure you're
51:50
using yeah what's the stack dude
51:52
like like you know like I
51:55
can move at a hundred like
51:57
at a hundred miles an hour
51:59
if I'm in a Tesla that's
52:01
what makes me anyway so they
52:04
do give that information but it's like
52:06
buried like in the little like great
52:08
this is from their blog post where
52:10
we get these numbers so I guess
52:12
as with any of these model
52:15
announcements you go to the company
52:17
log you would get a bunch
52:19
of numbers on benchmarks showing
52:21
that it's the best You
52:23
have comparisons to Gemma free
52:26
from Google to coherent, AIIA,
52:28
Jupiter for a mini, cloud
52:30
3.5 haiku, and on all
52:32
of these things like MMLU,
52:34
human evil, math, it typically is
52:36
better, although, you know, I would
52:38
say it doesn't seem to be
52:40
that much better than Gemma at
52:43
least, and in many cases
52:45
is not better than 3.5 haiku
52:47
and Jupiter for a mini, but
52:49
yeah, still quite good. the 150 tokens
52:52
per second to for context is a batch
52:54
size 16 on 4 H 100s. They actually
52:56
like even in the technical post they write
52:58
while delivering inference speeds of 150 tokens
53:00
per second without further qualifying but it's
53:02
in like it's in this like small
53:05
gray text underneath an image that you
53:07
have to look for where you actually
53:09
find that context so don't expect this
53:11
to run at 150 tokens per second
53:13
on your laptop right that's that's just
53:15
not going to happen because you know
53:17
for H 100s is like that's quite
53:19
a lot of horsepower horsepower. still yeah
53:21
it's incremental improvement more
53:24
more open source coming
53:26
from from Israel and patchy
53:29
2.0 license so you know highly
53:31
permissive and one more model
53:33
you have exa on deep
53:35
reasoning enhanced language models coming
53:38
from LG AI research so
53:40
these are new models new
53:42
family models 2.4 billion seven
53:45
point eight billion and 32
53:47
billion parameters These are
53:49
optimized for reasoning
53:51
tasks and seemingly
53:54
are on par or
53:56
out of our performing
53:58
variations of R1. are one is
54:01
the giant one that's 671
54:03
billion. There's distilled
54:05
versions of those models
54:08
at comparable sizes and
54:10
in the short technical
54:12
report that they provide they
54:14
are showing that it seems
54:16
to be kind of along the lines
54:18
of what you can get with
54:21
those distilled R1 models
54:23
and similar or better
54:25
than Open AI01 mini.
54:27
Yeah, it's also, it's kind of interesting,
54:29
it seems to, I described, again, not
54:32
a lot of detail in the paper,
54:34
so it makes it hard to reconstruct,
54:36
but it does seem to be at
54:38
odds with some of the things that
54:40
learn in the Deep Seek R1 paper,
54:42
for example. So there's, they start
54:45
with, it seems, an instruction tuned
54:47
model, base model, the XA1,
54:49
3.5 instruct models, and then they
54:51
add onto that. a bunch of fine tuning,
54:53
they do supervised fine tuning, presumably this
54:56
is for like the reasoning structure, and
54:58
then DPO, so standards, or like, RL
55:00
stuff, and online RL. So, you know, this
55:02
is, you know, quite a bit of supervised
55:04
fine tuning of trying to teach the model
55:07
how to solve problems in the way you
55:09
wanted to solve them, rather than just like
55:11
giving it a reinforcement learning signal
55:13
and reward signal and like kind
55:15
of have at it like R10 did.
55:18
So yeah, kind of an interesting
55:20
alternative more, let's say. more
55:22
inductive prior laden approach and
55:24
first time as well that I
55:26
think we've covered anything from
55:28
LGA high risk so yeah
55:30
Exxon these models appear to
55:33
have already been existing and
55:35
being released Exxon 3.5 we're
55:37
back from December which be
55:39
somehow missed at the time
55:41
yeah fun fact Exxon stands
55:43
for expert AI for everyone
55:45
you got a lot when people
55:48
come up these acronyms in a
55:50
very creative way And they are
55:52
open sourcing it on hugging face
55:54
with some restrictions, this is primarily
55:56
for research usage. Odd to research and
55:59
investment. we begin the paper
56:01
sample scrutinize and scale effective inference
56:03
time search by scaling verification it
56:05
is coming from Google and let
56:07
me just double check also you
56:09
see Berkeley and it is quite
56:12
exciting I think as a paper
56:14
it basically is making the case
56:16
or presenting the idea of that
56:18
to do scaling inference time scaling
56:20
via reasoning where you have the
56:23
model you know once you already
56:25
train your model once you stop
56:27
updating your weights can you use
56:29
your model more to get smarter
56:31
if you just kind of do
56:33
more outputs in some way and
56:36
what you've seen in recent months
56:38
is inference time scaling via reasoning
56:40
where you have the model you
56:42
know output a long chain of
56:44
tokens where it does various kinds
56:46
of strategies to do better at
56:49
a given complicated task like planning
56:51
sub steps like verification backtracking these
56:53
things we've already covered. Well this
56:55
paper is saying instead of that
56:57
sort of scaling the output in
56:59
terms of a chain an extended
57:02
chain of tokens you can instead
57:04
sample many potential outputs like just
57:06
do a bunch of outputs from
57:08
scratch. And I kind of varied
57:10
up so you get many possible
57:12
solutions. And then if you have
57:15
a verifier and you can kind
57:17
of compare and combine these different
57:19
outputs, you can be as effective
57:21
or even in some cases more
57:23
effective than the kind of traditional
57:25
reasoning, inference time scaling paradigm. Again,
57:28
yeah, quite interesting in the paper.
57:30
They have a table one where
57:32
they are giving this example of
57:34
if you sample a bunch of
57:36
outcomes and you have their... that
57:38
is good, you can actually outperform
57:41
01 preview. And many other techniques,
57:43
you can get better numbers on
57:45
the hard reasoning benchmarks like Amy,
57:47
where you can solve eight out
57:49
of the 15 problems now, which
57:51
is insane, but that's where we
57:54
are. And on math and live
57:56
bench math and live bench reasoning.
57:58
So yeah, very interesting idea to
58:00
sample a bunch of solutions and
58:02
then just. compare them and can
58:04
combine them into one final output.
58:07
Yeah, I think this is one
58:09
of those key papers that, again,
58:11
I mean, we see this happen
58:13
over and over. Scaling is often
58:15
a bit more complex than people
58:18
assume, right? So at first, famously
58:20
we had pre-training, scaling, pre-training, compute,
58:22
that brought us from, you know,
58:24
GPD2, G3. to GPD4, now we're
58:26
in the inference time compute paradigm
58:28
where, you know, this analogy that
58:31
I like to use, like, you
58:33
have 30 hours to dedicate to
58:35
doing well on a test, you
58:37
get to choose how much of
58:39
that time you dedicate to studying,
58:41
how much you dedicate to actually
58:44
spending writing the test. And what
58:46
we've been learning is, you know,
58:48
scaling, pre-training computers, basically all study
58:50
time. And so if you just
58:52
do that and you effectively give
58:54
yourself like one second to write
58:57
the test, well, there's only so
58:59
well you can do, right? You
59:01
eventually start to saturate if you
59:03
just keep growing pre-training and you
59:05
don't grow inference time compute. You
59:07
don't invest more time at test
59:10
time. So essentially you have two-dimensional
59:12
scaling. If you want to get
59:14
the sort of indefinite returns, if
59:16
you want the curves to just
59:18
keep going up and not saturate,
59:20
you have to scale two things
59:23
at the same time. This is...
59:25
a case like that, right? This
59:27
is a case of a scaling
59:29
law that would be hidden to
59:31
a naive observer just looking at
59:33
this as a kind of a
59:36
one variable problem when in reality
59:38
it's a multivariate problem and suddenly
59:40
when you account for that you
59:42
go, oh wow, there's a pretty
59:44
robust scaling trend here. And so
59:46
what are these two variables? Well,
59:49
the first is scaling the number
59:51
of sampled responses, so just the
59:53
number of shots on goal, the
59:55
number of attempts that your model
59:57
is going to make at solving
59:59
a given problem, but you have
1:00:02
to improve verification capabilities at the
1:00:04
same time, right? So they're asking
1:00:06
the question, what test time scaling
1:00:08
trends come up as you scale
1:00:10
both the number of sampled responses
1:00:12
and your verification capabilities? One of
1:00:15
the things crucially that they find
1:00:17
though, is you might not evenly
1:00:19
think if you have a... a
1:00:21
verifier, right? And I want to
1:00:23
emphasize this is not a verifier
1:00:26
that has access to ground truth,
1:00:28
right? This is like, and I
1:00:30
think they use like Claude 3.7
1:00:32
for this, 3.7, so on and
1:00:34
so on and for this, but
1:00:36
basically there's a model that's going
1:00:39
to look at, say the 20
1:00:41
different possible samples in other words
1:00:43
that you get from your model
1:00:45
to try to... solve a problem.
1:00:47
And this verifier model is going
1:00:49
to contrast them and just determine,
1:00:52
based on what it knows, not
1:00:54
based on access to actual ground
1:00:56
truth or any kind of symbolic
1:00:58
system like a calculator, it's just
1:01:00
going to use its own knowledge
1:01:02
in the moment to determine which
1:01:05
of those 20 different possible answers
1:01:07
is the right one. And what
1:01:09
you find is, as you scale
1:01:11
the number of possible answers that
1:01:13
you have your verifier look at,
1:01:15
you might not even think well...
1:01:18
with just so much gunk in
1:01:20
the system, probably the verifiers performance
1:01:22
is going to start to drop
1:01:24
over time. It's just like it's
1:01:26
so much harder for it to
1:01:28
kind of remember which of these
1:01:31
was that good and which was
1:01:33
bad and all that stuff. So
1:01:35
eventually you would expect the kind
1:01:37
of trend to be that your
1:01:39
performance would saturate maybe even drop.
1:01:41
But what they find is the
1:01:44
opposite. The performance actually keeps improving
1:01:46
and improving. And the reason for
1:01:48
that seems to be that as
1:01:50
you increase the number of. samples,
1:01:52
the number of attempts to solve
1:01:54
a problem, the probability that you
1:01:57
get a truly exquisite answer that
1:01:59
is so much better than the
1:02:01
others, like that contrasts so much
1:02:03
with the median other answer, that
1:02:05
it's really easy to pick out,
1:02:07
increases. And that actually makes the
1:02:10
verifiers... easier. And so they refer
1:02:12
to this as an instance of
1:02:14
what they call implicit, I think
1:02:16
it was implicit scaling, was the
1:02:18
term, yeah, implicit scaling, right? So
1:02:20
essentially the, yeah, this idea that
1:02:23
you're more likely to get one
1:02:25
like exquisite outlier that favorably contrasts
1:02:27
with the crappy median samples. And
1:02:29
so in this sense, I mean,
1:02:31
I feel like the term verifier
1:02:34
is maybe not the best one.
1:02:36
Really what they have here is
1:02:38
a contraster. When I hear verifier,
1:02:40
I tend to think of ground
1:02:42
truth. I tend to think of
1:02:44
something that is actually kind of
1:02:47
like, you know, giving us, you
1:02:49
know, checking, let's say code and
1:02:51
seeing if it compiles properly. Yeah.
1:02:53
compiler, you could say, where it
1:02:55
takes a bunch of possible outputs
1:02:57
and then from all of that
1:03:00
picks out the best. kind of
1:03:02
guess to answer. Exactly, yeah, and
1:03:04
this is why I don't know
1:03:06
if it's not a term that
1:03:08
people use, but if it was
1:03:10
like contrast or is it really
1:03:13
what you're doing here, right? You're
1:03:15
like, you're sort of doing, in
1:03:17
a way, a kind of contrast
1:03:19
of learning, well, not learning necessarily,
1:03:21
because it's all happening at inference
1:03:23
time, but yeah. So the performance
1:03:26
is really impressive. There are implications
1:03:28
of this for the design of
1:03:30
these systems as well, so you're
1:03:32
trying to find ways to wrangle
1:03:34
problems to wrangle problems into a
1:03:36
shape where you can take advantage
1:03:39
of implicit scaling. where you can
1:03:41
have your model pump out a
1:03:43
bunch of responses in the hopes
1:03:45
that you're going to get, you
1:03:47
know, one crazy outlier that makes
1:03:49
it easier for the verifier to
1:03:52
do its job. So yeah, again,
1:03:54
you know, I think a really
1:03:56
interesting case of multi-dimensional scaling laws,
1:03:58
essentially, that are otherwise easily missed
1:04:00
if you don't invest in both
1:04:02
verifier performance and sampling at the
1:04:05
same time. Exactly, and this is,
1:04:07
I think, important context to provide.
1:04:09
The idea of sampling many answers
1:04:11
and just kind of picking out
1:04:13
the answer that occurred the most
1:04:15
times in all these samples is
1:04:18
a well-known idea. There's also, I
1:04:20
mean... Self-consistency. Self-consiciency, exactly. a majority
1:04:22
vote essentially is one well-established technique
1:04:24
to get better performance and there's
1:04:26
even more complex things you could
1:04:28
do also with like a mixture
1:04:31
of, I forget the term that
1:04:33
exists, but the general idea of
1:04:35
paralyzed generation of outputs and generating
1:04:37
multiple outputs potentially of multiple models
1:04:39
is well known and the real
1:04:42
insight as you said here is
1:04:44
that you need a strong verifier
1:04:46
to be able to really leverage
1:04:48
it. in the stable one they
1:04:50
show that if you compare like
1:04:52
you do get better performance quite
1:04:55
a bit if you just do
1:04:57
consistency if you just sample 200
1:04:59
responses and pick out the majority
1:05:01
compared to the thing where you
1:05:03
don't do any scaling you're getting
1:05:05
four problems out of 15 on
1:05:08
Amy as opposed to one you're
1:05:10
getting you know a significant jump
1:05:12
in performance but after 200 if
1:05:14
you go to 1, you basically
1:05:16
stop getting better. But if you
1:05:18
have a strong verifier, basically a
1:05:21
strong selector among the things, instead
1:05:23
of majority voting, you have some
1:05:25
sort of intelligent way to combine
1:05:27
the answers, you get a huge
1:05:29
jump, a huge difference between just
1:05:31
consistency at 100 and verification at
1:05:34
200. And one reason this is
1:05:36
very important is, first, they also
1:05:38
highlight the fact that verification is
1:05:40
a bit understudy, I think, in
1:05:42
the space of LLLMs, even introduce
1:05:44
a benchmark specifically for a verification.
1:05:47
The other reason that this is
1:05:49
very notable is that if you're
1:05:51
doing sampling based techniques, you can
1:05:53
paralyze the sampling and that is
1:05:55
different from extending your reasoning or
1:05:57
your search because reasoning via more
1:06:00
tokens is sequential, right? It's going
1:06:02
to take more time. Scaling via
1:06:04
sampling you can paralyze all the
1:06:06
samples. then just combine them, which
1:06:08
means that you can get very
1:06:10
strong reasoning at comparable timescales to
1:06:13
if you just take one output,
1:06:15
for instance. So that's a very
1:06:17
big deal. Yeah. Another key reason
1:06:19
why this works too, we covered
1:06:21
a paper, I think it was
1:06:23
months ago, that was pointing out
1:06:26
that if you look at all
1:06:28
the alignment techniques that people used
1:06:30
to get more value out of
1:06:32
their their models, right? They kind
1:06:34
of assumed this one. query one
1:06:37
output picture. Like let's align within
1:06:39
that context. Whereas in reality, what
1:06:41
you're often doing, especially with agents
1:06:43
and inference time compute, is you
1:06:45
actually are sampling like a large
1:06:47
number of outputs and you don't
1:06:50
care about how shitty the average.
1:06:52
generated sample is generated solutions. What
1:06:54
you care about is, is there
1:06:56
one exquisite one in this batch?
1:06:58
And what they did in that
1:07:00
alignment paper is they found a
1:07:03
way to kind of like upweight
1:07:05
just the most successful outcomes and
1:07:07
use that for a reward signal
1:07:09
or some kind of gradient update
1:07:11
signal. But this is sort of
1:07:13
philosophically aligned with that, right? It's
1:07:16
saying. you know, like you said,
1:07:18
self consistency is the view that
1:07:20
says, well, let's just do wisdom
1:07:22
in numbers and say we generated,
1:07:24
you know, I don't know, like
1:07:26
a hundred of these outputs. And
1:07:29
the most common response, most consistent
1:07:31
answer was this one. So let's
1:07:33
call this the one that we're
1:07:35
going to cite as our output.
1:07:37
But of course, if your model
1:07:39
has a consistent failure mode, that
1:07:42
can also cause you to true
1:07:44
self-consistency, identify that, you know, kind
1:07:46
of settle on that failureor mode.
1:07:48
in this big pot and that's
1:07:50
really what this is after. So
1:07:52
a lot of interesting times to
1:07:55
other lines of research as you
1:07:57
said and I think a really
1:07:59
interesting and important paper. And next
1:08:01
up we have the paper block
1:08:03
diffusion interpreting between outer aggressive and
1:08:05
diffusion language models and also I
1:08:08
think quite an interesting one. So
1:08:10
typically when you're using an OLM
1:08:12
you're using outer aggression which is
1:08:14
just saying that you're computing one
1:08:16
talk at a time, right? You
1:08:18
start with one word, then you
1:08:21
select the next word, then you
1:08:23
select the next word. You have
1:08:25
this iterative process, and that's a
1:08:27
limitation of traditional alms, because you
1:08:29
need to sequentially do this one
1:08:31
step at a time. You can't
1:08:34
generate an entire output sequence all
1:08:36
at once, as opposed to diffusion,
1:08:38
which is a generation mechanism, a
1:08:40
way to the generation that is
1:08:42
kind of giving you an entire
1:08:45
answer at once. And diffusion is
1:08:47
the thing that's typically used for
1:08:49
image generation where you start with
1:08:51
just a noisy image, a bunch
1:08:53
of noise, you iteratively update the
1:08:55
entire image all at once until
1:08:58
you get to a good solution.
1:09:00
And we covered, I believe, maybe
1:09:02
a week ago or two weeks
1:09:04
ago, a story of a diffusion-based
1:09:06
alum that was seemingly performed pretty
1:09:08
well. There was a company that
1:09:11
was a company that was a
1:09:13
company that was a company that
1:09:15
was a company that made the
1:09:17
claim, although we didn't provide too
1:09:19
much research on it. Well, this
1:09:21
paper is talking about, well, how
1:09:24
can we combine the strengths of
1:09:26
both approaches? The weakness of diffusion
1:09:28
is typically just doesn't work as
1:09:30
well for LLLMs, and there's kind
1:09:32
of various hypotheses, it's an interesting
1:09:34
question of why it doesn't work,
1:09:37
but it also doesn't work for
1:09:39
arbitrary length. You can only generate
1:09:41
a specific kind of horizon. and
1:09:43
some other technical limitations. So the
1:09:45
basic proposal in the paper is,
1:09:47
well, you can have this idea
1:09:50
of block diffusion where you still
1:09:52
sample aggressively, like sample one step
1:09:54
at a time, but instead of
1:09:56
sampling just one word or one
1:09:58
token, you use diffusion to generate
1:10:00
a chunk of stuff. So you
1:10:03
generate several tokens all at once
1:10:05
of diffusion in parallel. and then
1:10:07
you ought to aggressively keep doing
1:10:09
that and you get to be,
1:10:11
you know, the best of both
1:10:13
worlds. so to speak. So an
1:10:16
interesting idea, kind of architecture you
1:10:18
haven't seen before and potentially could
1:10:20
lead to stronger or faster models.
1:10:22
Yeah, it's also more paralyzable, right?
1:10:24
So the big advantage because these
1:10:26
within these blocks you're able to
1:10:29
just like denoise in one shot
1:10:31
with all the text in there.
1:10:33
It means you can parallelize more.
1:10:35
They try, I think, with blocks
1:10:37
of various sizes, like sizes of
1:10:39
four tokens, for example, right? So,
1:10:42
like, what's denoise four tokens at
1:10:44
a time? They do it with
1:10:46
this interesting kind of like, well,
1:10:48
it's essentially, they put masks on
1:10:50
and kind of gradually remove the
1:10:53
masks on the tokens as they
1:10:55
denoise. That's their interpretation of how
1:10:57
denoising would work. The performance is
1:10:59
lower, obviously, than state of the
1:11:01
art for. autoregressive models. Think of
1:11:03
this more as a proof of
1:11:06
principle that there are favorable, favorable
1:11:08
scaling characteristics. There's some promise here.
1:11:10
In my mind, this fits into
1:11:12
the kind of some of the
1:11:14
mamba stuff where, you know, the
1:11:16
next logical question is, okay, like,
1:11:19
how much scale will this work
1:11:21
at? And then do we see
1:11:23
those lost curves eventually with some
1:11:25
updates, some some hardware, jiggery, and
1:11:27
then crossing the loss curves that
1:11:29
we see for traditional auto-regressive modeling.
1:11:32
I mean, it is interesting either
1:11:34
way. They found a great way
1:11:36
to kind of break up this
1:11:38
problem. And one reason speculatively that
1:11:40
diffusion does not work as well
1:11:42
with text is partly that like
1:11:45
you tend to think as you
1:11:47
write. And so your previous words
1:11:49
really will affect in a causal
1:11:51
way where you're going and trying
1:11:53
to do diffusion. at the, you
1:11:55
know, in parallel across a body
1:11:58
of text like that, from an
1:12:00
inductive prior standpoint, doesn't quite match
1:12:02
that intuition, but could very well
1:12:04
be wrong. It's sort of like
1:12:06
top of mind thought, but anyway,
1:12:08
it's a good paper. The question
1:12:11
is always, will it scale, right?
1:12:13
And there are lots of good
1:12:15
proofs of principle out there for
1:12:17
all kinds of things, whether they
1:12:19
end up getting reflected in in
1:12:21
scaled training runs is the big
1:12:24
question. Exactly, and this does require
1:12:26
quite different training and you know
1:12:28
models from what all these other
1:12:30
alums are so decent chances won't
1:12:32
have a huge impact just because
1:12:34
you have to change up and
1:12:37
all these train models already are
1:12:39
alums in the out of aggressive
1:12:41
sense diffusion is the whole new
1:12:43
kind of piece of a puzzle
1:12:45
that isn't typically worked on but
1:12:47
nevertheless interesting as you said similar
1:12:50
to Mombop. Next, we have communication
1:12:52
efficient language model training, scales reliably,
1:12:54
and robustly scaling laws for deloco,
1:12:56
distributed law computation training. And I
1:12:58
think I'll just let you take
1:13:01
this one, Jeremy, since I'm sure
1:13:03
you dove deep into this. Oh,
1:13:05
yeah. I mean, so I just
1:13:07
think deloco, if you're interested in
1:13:09
anything from like national security to
1:13:11
the future data center design, is.
1:13:14
And you have China competition generally
1:13:16
is just so important and in
1:13:18
this general thread, right, well, like
1:13:20
together AI, distributed training, all that
1:13:22
stuff. So as a reminder, when
1:13:24
you're thinking about deloco, like how
1:13:27
does it work? So this basically
1:13:29
is the answer to a problem,
1:13:31
which is that traditional training runs
1:13:33
in like data peril training, like
1:13:35
in data centers today, happens in
1:13:37
this way where you have a
1:13:40
bunch of number crunching and a
1:13:42
bunch of communication, like a burst
1:13:44
of communication that has to happen.
1:13:46
at every time step you like
1:13:48
share gradients you update like across
1:13:50
all your GPUs you update your
1:13:53
model weights and then you go
1:13:55
on to the next the next
1:13:57
mini batch right or the next
1:13:59
part of your data set and
1:14:01
you repeat right run your computations
1:14:03
calculate the gradients update the model
1:14:06
weights and so on and there's
1:14:08
this bottleneck communication bottleneck that comes
1:14:10
up when you do this at
1:14:12
scale where you're like just waiting
1:14:14
for the communication to happen and
1:14:16
so the question then is going
1:14:19
to be okay well what if
1:14:21
we can set up Sort of
1:14:23
smaller pockets, because you're basically waiting
1:14:25
for the stragglers, right? The slowest
1:14:27
GPUs are going to dictate when
1:14:29
you find... sync up everything and
1:14:32
you can move on to the
1:14:34
next stage of training. So what
1:14:36
if we could set up a
1:14:38
situation where we have a really
1:14:40
small pocket of like a mini
1:14:42
data center in one corner working
1:14:45
on its own independent copy of
1:14:47
the model that's being trained? And
1:14:49
then another mini data center doing
1:14:51
the same and another and another
1:14:53
and very rarely we have an
1:14:56
outer loop where we do a
1:14:58
general upgrade of the whole thing
1:15:00
so that we're not constrained by
1:15:02
the lowest common denominator straggler in
1:15:04
that group. So this is going
1:15:06
to be the kind of philosophy
1:15:09
behind Du Loco. You have an
1:15:11
outer loop that essentially is going
1:15:13
to, in a more, think of
1:15:15
it as like this like wise
1:15:17
and slow, slow loop that updates
1:15:19
based on what it's learned from
1:15:22
all the local data centers that
1:15:24
are running their own training runs.
1:15:26
And then within each local data
1:15:28
center, you have this much more
1:15:30
radical, aggressive loop, more akin to
1:15:32
what we see in traditional data
1:15:35
parallel training in data centers, which,
1:15:37
you know, they're running. Anyway, we
1:15:39
have a whole episode on Deloco.
1:15:41
Check it out. I think we
1:15:43
talk about the atom optimizer or
1:15:45
atom W optimizer that runs at
1:15:48
the local level, and then the
1:15:50
sort of like more gradient descent
1:15:52
nester off momentum. Optimizer on the
1:15:54
outer loop. The details there don't
1:15:56
matter too much. This is a
1:15:58
scaling law paper basically. What they're
1:16:01
trying to figure out is how
1:16:03
can we study the scaling laws
1:16:05
that predict the performance, let's say,
1:16:07
of these models based on how
1:16:09
many model copies, how many many
1:16:11
data centers we have running at
1:16:14
the same time, and the size
1:16:16
of the models that we're training.
1:16:18
And they test at meaningful scales.
1:16:20
They go all the way up
1:16:22
to 10 billion parameters. And essentially
1:16:24
they were able through their their
1:16:27
scheme through their hyper parameter optimization
1:16:29
to reduce the total communication required
1:16:31
by a factor of over a
1:16:33
hundred in their scheme. It's a
1:16:35
great great paper. I think one
1:16:37
of the Wilder things about it
1:16:40
is that they find that even
1:16:42
when you just have a single
1:16:44
replica or let's say a single
1:16:46
mini data center, you still get
1:16:48
a performance lift. This is pretty
1:16:50
wild, like relative to the current
1:16:53
like purely data parallel scheme. So
1:16:55
it benefits you to have this
1:16:57
kind of radical quick updating interloop
1:16:59
and then to add on top
1:17:01
of that this slow wise outer
1:17:04
loop, even if you only have
1:17:06
a single data center that's like
1:17:08
you could you could just do
1:17:10
just the radical interloop and that
1:17:12
would be enough. But by adding
1:17:14
this like more strategic level of
1:17:17
optimization that comes from the Nestorov
1:17:19
momentum, the sort of slower gradient
1:17:21
updates for the outer loop, you
1:17:23
get better performance. That's highly counterintuitive,
1:17:25
at least to me, and it
1:17:27
does suggest that there's a kind
1:17:30
of stabilizing influence that you're getting
1:17:32
from just that new outer loop.
1:17:34
Last thing I'll mention is from
1:17:36
a sort of national security standpoint,
1:17:38
one important question here is how
1:17:40
fine-grained can this get? Can Deloco
1:17:43
continue to scale successfully if we
1:17:45
have not... one, not three, not
1:17:47
eight, but like a thousand of
1:17:49
these many data centers, right? If
1:17:51
that happens, then we live in
1:17:53
a world where essentially we're doing
1:17:56
something more like bit torrent, right,
1:17:58
for training models at massive scale,
1:18:00
and we live in a world
1:18:02
where it becomes a lot harder
1:18:04
to oversee training runs, right? If
1:18:06
we do decide that training models
1:18:09
at scale introduces WMP level capabilities
1:18:11
through cyber risk, through biarisk, whatever,
1:18:13
It actually gets to the point
1:18:15
where if there's a mini data
1:18:17
center on every laptop and every
1:18:19
GPU, if Deloca, that is the
1:18:22
promise of Deloco in the long
1:18:24
run, then yeah, what is the
1:18:26
meaningful tool set that policy has,
1:18:28
that government has, to make sure
1:18:30
that things don't get misused, that
1:18:32
you don't get the proliferation of
1:18:35
WMD level capabilities in these systems.
1:18:37
So I think it's a really,
1:18:39
really interesting question, and this paper
1:18:41
is a step in that direction.
1:18:43
like eight essentially of these many
1:18:45
data centers, but I think we're
1:18:48
going to see a lot more
1:18:50
experiments in that direction in the
1:18:52
future. Right, and this is following
1:18:54
up on. their initial paper, Deloco,
1:18:56
back in September of 2024, I
1:18:58
think we mentioned as a time
1:19:01
that it's quite interesting to see
1:19:03
Google, Google, deep mind publishing this
1:19:05
work because it does seem like
1:19:07
something you might keep secret. It's
1:19:09
actually quite impactful for how you
1:19:12
build your data centers, right? And
1:19:14
once again, you know. They do
1:19:16
very expensive experiments to train you
1:19:18
know billion parameter models to verify
1:19:20
that compared to the usual way
1:19:22
of doing things This is Comparable
1:19:25
let's say and can achieve similar
1:19:27
performance. So you know a big
1:19:29
deal if you're a company that
1:19:31
is building out data centers and
1:19:33
spending billions of dollars onto a
1:19:35
couple quicker stories because we are
1:19:38
as always starting to run out
1:19:40
of time. The first one is
1:19:42
Transformers without normalization, and that I
1:19:44
believe is from meta. They're introducing
1:19:46
a new idea called Dynamic 10H,
1:19:48
which is a simple alternative to
1:19:51
traditional normalization. So just keep a
1:19:53
simple, you have normalization, which is
1:19:55
when you make everything sum up
1:19:57
to one as a typical thing,
1:19:59
typical step in transformer architectures. And
1:20:01
what they found in this paper
1:20:04
is. you can get rid of
1:20:06
that if you add this little
1:20:08
computational step of 10H, basically a
1:20:10
little function that kind of flattens
1:20:12
things out. It ends up looking
1:20:14
similar to normalization. And that's quite
1:20:17
significant because normalization requires you to
1:20:19
do a computation over, you know,
1:20:21
a bunch of outputs all at
1:20:23
once. This is a per output
1:20:25
computation that... could have a meaningful
1:20:27
impact on the total computation kind
1:20:30
of requirements of the transformer. Next
1:20:32
up, I think, Jeremy, you mentioned
1:20:34
this one, we have an interesting
1:20:36
analysis measuring AI abdish. to complete
1:20:38
long tasks. And it is looking
1:20:40
at how 13 frontier AI models,
1:20:43
going from 2019 to 2025, how
1:20:45
long a time horizon they can
1:20:47
handle, and they found that the
1:20:49
ability to get to 50% task
1:20:51
completion time horizon, so on various
1:20:53
tasks that require different amounts of
1:20:56
work, That has been doubling approximately
1:20:58
every seven month and they have
1:21:00
you know a curve fit That's
1:21:02
kind of like a more's law
1:21:04
basically kind of introducing the idea
1:21:06
of a very strong trend towards
1:21:09
the models improving on this particular
1:21:11
measure Yeah, this is a really
1:21:13
interesting paper generated a lot of
1:21:15
a lot of discussion including by
1:21:17
the way a tweet from Andrew
1:21:20
Yang who treated this paper. He
1:21:22
said, guys, AI is going to
1:21:24
eat a shit ton of jobs.
1:21:26
I don't see anyone really talking
1:21:28
about this meaningfully in terms of
1:21:30
what to do about it for
1:21:33
people. What's the plan? Kind of
1:21:35
interesting, because like, I don't know,
1:21:37
it's, it's worlds colliding here a
1:21:39
little bit, right, that the political
1:21:41
and the, the like deep AGI
1:21:43
stuff, but yeah, this out of
1:21:46
meter, I think one of the
1:21:48
important caveats that's been thrown around
1:21:50
and fairly so is How long
1:21:52
does a task have to be
1:21:54
before an AI agent fails at
1:21:56
about 50% of the time? Right?
1:21:59
They call that the 50% time
1:22:01
horizon. And the observation is that,
1:22:03
yeah, this, as you said, like,
1:22:05
that time horizon has been increasing
1:22:07
exponentially quite quickly. Like, it's been
1:22:09
increasing doubling every seven months, as
1:22:12
they put it, which itself, I
1:22:14
mean. kind of worth flying. Doubling
1:22:16
every seven months, training compute for
1:22:18
frontier AI models grows, doubles every
1:22:20
six months or so. So it's
1:22:22
actually increasing at about the same
1:22:25
rate as training compute that we.
1:22:27
Now, that's not fully causal. There's
1:22:29
other stuff besides training compute that's
1:22:31
increasing the actual performance of these
1:22:33
models, including algorithmic improvements. But still,
1:22:35
it does mean we should expect
1:22:38
kind of an exponential course of
1:22:40
progress towards ASI if our benchmark
1:22:42
of progress towards ASI is this
1:22:44
kind of like 50% performance threshold,
1:22:46
which I don't think is unreasonable.
1:22:48
It is true that it depends
1:22:51
on the task, right? So not
1:22:53
all tasks show the same rate
1:22:55
of improvement. not all tasks show
1:22:57
the same performance, but what they
1:22:59
do find is that all tasks
1:23:01
they've tested basically show an exponential
1:23:04
trend. That itself is a really
1:23:06
important kind of detail. The tasks
1:23:08
they're focusing on here though, I
1:23:10
would argue, are actually the most
1:23:12
relevant. These are tasks associated with
1:23:15
automation of machine learning engineering tasks,
1:23:17
machine learning research, which is explicitly
1:23:19
the strategy that open AI, that
1:23:21
anthropic, that Google, are all gunning
1:23:23
for, Can we make AI systems
1:23:25
that automate AI research so that
1:23:28
they can rapidly get better at
1:23:30
improving themselves or at improving AI
1:23:32
systems and you close this loop
1:23:34
and you know we get recurs
1:23:36
of self-improvement essentially and then take
1:23:38
off to super intelligence? I actually
1:23:41
think this is quite relevant. One
1:23:43
criticism that I would have of
1:23:45
the curve that they show that
1:23:47
shows doubling time being seven months
1:23:49
and by the way they extrapolate
1:23:51
that to say, okay so then
1:23:54
we should assume that based on
1:23:56
this, AI systems will be able
1:23:58
to automate many software tasks that
1:24:00
currently take humans a month sometime
1:24:02
between late 2028 and early 2031.
1:24:04
So those are actually quite long
1:24:07
ASI timelines if you think ASI
1:24:09
has achieved once AI can do
1:24:11
tasks that it takes humans about
1:24:13
a month to do, which I
1:24:15
don't know, maybe fair. If you
1:24:17
actually look at the curve though,
1:24:20
it does noticeably steepen much more
1:24:22
recently and in precisely kind of
1:24:24
the point where synthetic data self-improvement
1:24:26
like reinforcement learning on chain of
1:24:28
thought with verifiable rewards. So basically
1:24:30
the kind of strawberry concept started
1:24:33
to take off. And so I
1:24:35
think that's pretty clearly its own
1:24:37
distinct regime. Davidad had a great
1:24:39
set of tweets about this. But
1:24:41
fundamentally, I think that there is
1:24:43
maybe an error being made here
1:24:46
and not recognizing a new regime.
1:24:48
And it's always going to be
1:24:50
debatable because the sample sizes are
1:24:52
so small. But we're taking a
1:24:54
look at that plot, seeing if
1:24:56
you agree for yourself, that the
1:24:59
sort of last, I guess, six
1:25:01
or so entries. in that plot
1:25:03
actually do seem to chart out
1:25:05
a deeper slope and a meaningfully
1:25:07
steeper one that could have, I
1:25:09
mean, that same one month R&D
1:25:12
benchmark being hit more like, you
1:25:14
know, even 2026, even 2025? Pretty
1:25:16
interesting. Yeah, and I think it's
1:25:18
important to know that this is
1:25:20
kind of a general idea. You
1:25:23
shouldn't kind of take this too
1:25:25
literally. Obviously, it's a bit subjective
1:25:27
and very much depends on the
1:25:29
tasks. Here is a relatively small.
1:25:31
data set they're working with. So
1:25:33
it's mainly software engineering tasks. They
1:25:36
have three sources of tasks, Hcast,
1:25:38
which is 97 software tasks that
1:25:40
range from one minute to 30
1:25:42
hours. They have seven difficult machine
1:25:44
learning research engineering tasks that take
1:25:46
eight hours each. And then these
1:25:49
software atomic actions that take one
1:25:51
second to 30 seconds for software
1:25:53
engineers. pretty limited variety of tasks
1:25:55
in general and the length of
1:25:57
tasks you by a way is
1:25:59
meant to be measured in terms
1:26:02
of how long they would take
1:26:04
to a human professional which of
1:26:06
course there is also variance there.
1:26:08
So the idea is I think
1:26:10
more so the general notion of
1:26:12
X percent task completion time horizon
1:26:15
which is defined as For a
1:26:17
task that it takes X amount
1:26:19
of time for human to complete
1:26:21
can an AI complete it successfully
1:26:23
You know 50% of time 80%
1:26:25
of time. So right now we
1:26:28
are going up to eight hours.
1:26:30
Again, it's not talking about how
1:26:32
fast the AI is itself, by
1:26:34
the way. It was just talking
1:26:36
about success rate. So interesting ideas
1:26:38
here on tracking the future in
1:26:41
the past. And just one more
1:26:43
thing to cover. And it is
1:26:45
going to be H-cast, human-collibrated Tommy
1:26:47
software tasks. So this is the
1:26:49
benchmark of 189 machine learning, cyber
1:26:51
security, software engineering, and also general
1:26:54
reasoning tasks. use a subset of
1:26:56
that I think in the previous
1:26:58
analysis. So they collected 563 human
1:27:00
baselines, that's over 1,500 hours from
1:27:02
various people, and that collected a
1:27:04
data set of tasks to then
1:27:07
measure AI on. All right, moving
1:27:09
on to policy and safety. We
1:27:11
start with Zochi, an ontology project.
1:27:13
This is an applied research lab,
1:27:15
and they have announced. This project,
1:27:17
Sochi, which they claim is the
1:27:20
world's first artificial scientist. I know,
1:27:22
right? The difference, I suppose, here
1:27:24
is that they got Sochi to
1:27:26
publish a peer-reviewed paper at Eichler,
1:27:28
at Eichler workshops, I should note.
1:27:31
This AI wrote a paper submitted
1:27:33
it, the reviewers, human reviewers, of
1:27:35
this prestigious conference Eichler, then reviewed
1:27:37
it and let it, thought it
1:27:39
was worthy of publication. So that
1:27:41
seems to be a proof of
1:27:44
concept that we are getting to
1:27:46
a point where you can make
1:27:48
AI do AI research, AI research
1:27:50
conference, which Jeremy is, as you
1:27:52
noted, of a very important detail
1:27:54
in tracking where we are with
1:27:57
the ability of AI to improve
1:27:59
itself and kind of potentially reaching
1:28:01
super intelligence. Yeah, and then we've
1:28:03
got... So I think we've covered
1:28:05
a lot of these, but there
1:28:07
are quite a few labs that
1:28:10
have claimed to be the first
1:28:12
AI scientist, you know, Sakana famously,
1:28:14
the AI scientist, you know, Google
1:28:16
has its own sort of research
1:28:18
product and then there's auto science.
1:28:20
But this one is, I will
1:28:23
say, quite impressive, how to look
1:28:25
at some of the papers and
1:28:27
they're good. The authors, the company.
1:28:29
If that's the right term for
1:28:31
ontology guy, I couldn't find any
1:28:33
information about them I looked on
1:28:36
crunch base I looked on pitch
1:28:38
book I tried you know, try
1:28:40
a lot of stuff I think
1:28:42
that this is kind of their
1:28:44
coming up party, but they don't
1:28:46
have that I could see any
1:28:49
information about like where they come
1:28:51
from what their funding is and
1:28:53
all that so hard to know
1:28:55
we do know based on the
1:28:57
the papers they put out that
1:28:59
their co-founders are Andy Zoo and
1:29:02
Ron Ariel. Ron Ariel previously was
1:29:04
at Intel Labs under Joshua Bach.
1:29:06
So if you're a fan of
1:29:08
his work, you know, it might
1:29:10
kind of ring a bell and
1:29:12
then a couple other folks. So
1:29:15
I think maybe most useful is
1:29:17
there's fairly limited information about the
1:29:19
actual model itself and how it's
1:29:21
set up, but it is a
1:29:23
multi-agent setup that we do know.
1:29:25
It's produced a whole bunch of
1:29:28
interesting papers. So I'm just gonna
1:29:30
mention one. It's called CS-Reft, I
1:29:32
guess it's how you'd pronounce a
1:29:34
compositional subspace representation fine tuning. And
1:29:36
just to give you an idea
1:29:39
of how creative this is, so
1:29:41
you have a model, if you
1:29:43
try to retrain the model to
1:29:45
perform a specific task, it will
1:29:47
forget how to do other things.
1:29:49
And so what they do is
1:29:52
essentially identify a subspace of the
1:29:54
parameters in the model to focus
1:29:56
on one task, and then a
1:29:58
different. subspace of those parameters to
1:30:00
or sorry I should say not
1:30:02
even the parameters of the activations
1:30:05
to apply a transformation in order
1:30:07
to have the model perform a
1:30:09
different task and so anyway just
1:30:11
cognizant of our time here I
1:30:13
don't think we have time to
1:30:15
go into detail but this like
1:30:18
paper I'm frustrated that I discovered
1:30:20
it through the Zocte paper because
1:30:22
I will have wanted to cover
1:30:24
this one, like for our full
1:30:26
time block here, it is fascinating.
1:30:28
It's actually really, really clever. They
1:30:31
also did, anyway, some other kind
1:30:33
of AI safety, red teaming, vulnerability
1:30:35
detection stuff that is also independently
1:30:37
really cool. But bottom line is,
1:30:39
and this is an important caveat,
1:30:41
so this is not 100% automated.
1:30:44
They have a human in the
1:30:46
loop. reviewing stuff. So they take
1:30:48
a look at the intermediate work
1:30:50
that Zochi puts out before allowing
1:30:52
it to do further progress. Apparently
1:30:54
this happens at three different key
1:30:57
stages. The first is before sensitive,
1:30:59
sorry, before extensive experimentation starts, so
1:31:01
before a lot of, say, computing
1:31:03
resources report in, after the results
1:31:05
are kind of solidified, they say,
1:31:07
so somewhat grounded, but before the
1:31:10
manuscript is written. And then again,
1:31:12
after the manuscript is written. So
1:31:14
you still do have a lot
1:31:16
of a lot of humans in
1:31:18
the loop acting to some extent
1:31:20
as a hallucination filter among other
1:31:23
things. There is by the way
1:31:25
a section in this paper on
1:31:27
recursive self-improvement. That I thought was
1:31:29
interesting. They say during development we
1:31:31
observed early indications of this recursive
1:31:34
advantage, they're referring here to recursive
1:31:36
self-improvement, when Zoshi designed novel algorithmic
1:31:38
components to enhance the quality of
1:31:40
its generated research hypotheses. These components
1:31:42
were subsequently incorporated into later versions
1:31:44
of the system architecture, improving overall
1:31:47
research quality. So this isn't all
1:31:49
going to happen at once necessarily
1:31:51
with one generation of model, but
1:31:53
it could happen pretty quick. And
1:31:55
I think this is kind of
1:31:57
an interesting canarian of coal mine
1:32:00
for recursive self-improvement. Right, exactly. And
1:32:02
yeah, as you said, I think,
1:32:04
looking at the papers, they seem...
1:32:06
significantly more creative and interesting than
1:32:08
what we've seen, for instance, from
1:32:10
Sakana. It's the Kana had a
1:32:13
lot of built-in structure as well.
1:32:15
It was pretty easy to criticize.
1:32:17
Worth noting, also, this takes a
1:32:19
high-level research direction as an input,
1:32:21
and the human can also provide
1:32:23
input at any time. high level
1:32:26
feedback at any time, which apparently
1:32:28
is also used during paper writing.
1:32:30
So a lot of caveats to
1:32:32
the notion that this is an
1:32:34
AI scientist, right? It's an AI
1:32:36
scientist over human advisor slash supervisor,
1:32:39
but the outputs and the fact
1:32:41
that they got published is pretty
1:32:43
impressive. With AI doing the majority
1:32:45
of work and no time to
1:32:47
get into the details, but this
1:32:49
is similar to previous work in
1:32:52
AI research where they give it
1:32:54
a structured kind of plan where,
1:32:56
you know, they give it a
1:32:58
high level of direction, it does
1:33:00
ideation, it makes a plan, hypothesis
1:33:02
generation, it goes off and does
1:33:05
experiments and eventually it starts writing
1:33:07
a paper very similar at a
1:33:09
high level type of things and
1:33:11
that kind of makes it frustrating
1:33:13
that there's not a lot of
1:33:15
detail as to actual system. Moving
1:33:18
on, we have some news about
1:33:20
deep seek and some details about
1:33:22
how it is being closely guarded.
1:33:24
So apparently... They have forbidden, company
1:33:26
executives, have forbidden some Deep Seek
1:33:28
employees to travel abroad freely, and
1:33:31
there is screening of any potential
1:33:33
investors before they are allowed to
1:33:35
meet in person with company leaders,
1:33:37
according to people of knowledge of
1:33:39
the situation. So kind of tracks
1:33:42
with other things that happened, like
1:33:44
the Deep Seek CEO meeting with
1:33:46
gatherings. of China's leaders, including one
1:33:48
with President Xi Jinping. One interpretation
1:33:50
of this that I think is
1:33:52
actually probably the reasonable one, all
1:33:55
things considered, is this is what
1:33:57
you do if you're China, you
1:33:59
take super intelligence seriously, and you're
1:34:01
on a wartime footing, right? So
1:34:03
for a little context, right, would
1:34:05
have got here putting these pieces
1:34:08
together. We've got so deep seek
1:34:10
asking their staff to hand in
1:34:12
their passports. Now, that is a
1:34:14
practice that does happen in China
1:34:16
for government officials. So it's fairly
1:34:18
common for China to... strict travel
1:34:21
by government officials or executives of
1:34:23
state-owned companies, whether or not they're
1:34:25
actually CCP members. But lately, we've
1:34:27
been seeing more of these restrictions
1:34:29
that expanded to folks in the
1:34:31
public sector, including even like school
1:34:34
teachers, right? So they're just like
1:34:36
going in deeper there. But what's
1:34:38
weird here is to see a
1:34:40
fully privately owned company, right? It
1:34:42
was just a small fledgling startup
1:34:44
like Deep Seek. suddenly be hit
1:34:47
with this. So that is unusual.
1:34:49
They've got about 130 employees. We
1:34:51
don't know exactly how many have
1:34:53
had their passports handed in. So
1:34:55
there's a little like kind of
1:34:57
unclearness here. Also high flyer, the
1:35:00
sort of like hedge fund parent
1:35:02
that gave rise to Deep Seek,
1:35:04
has about 200 employees, unclear if
1:35:06
they've been asked to do the
1:35:08
same. But there's some other ingredients
1:35:10
too. So investors that want to
1:35:13
make investment pitches to Deep Seek
1:35:15
have been asked by the company
1:35:17
to first reject to the general
1:35:19
office of the Jajang province Communist
1:35:21
Party Committee and register their investment
1:35:23
inquiries. So basically, if you want
1:35:26
to invest in us, cool. But
1:35:28
you got to go through the
1:35:30
local branch of the Communist Party,
1:35:32
right? So the state is now
1:35:34
saying, hey, we will stand in
1:35:36
the way of you being able
1:35:39
to take money, you might otherwise
1:35:41
want to take. This is noteworthy
1:35:43
because China is right now struggling
1:35:45
to attract foreign capital, right? This
1:35:47
is like a big thing for
1:35:50
them if you follow Chinese economic
1:35:52
news. One of the big challenges
1:35:54
they have is foreign direct investment,
1:35:56
FDI, is collapsing. They're trying to
1:35:58
change that around. In that context,
1:36:00
especially noteworthy that they are enforcing
1:36:03
these added burdensome requirements for investors
1:36:05
who want to invest in deep
1:36:07
seek. Another thing is head hunters,
1:36:09
right? Once Deep Seek became a
1:36:11
big thing, obviously head hunters started
1:36:13
trying to poach people from the
1:36:16
lab. Apparently, those head hunters have
1:36:18
been getting calls from the local
1:36:20
government saying, hey, just heard you
1:36:22
were like sniffing around deep seek
1:36:24
trying to poach people, don't do
1:36:26
it. Don't do it. Only in
1:36:29
the Chinese Communist Party. tells you,
1:36:31
don't do it. Yeah, don't do
1:36:33
it. So, you know, there's apparently
1:36:35
concern as well from Deep Seek
1:36:37
leadership about possible information leakage. They've
1:36:39
told employees not to discuss their
1:36:42
work with outsiders. And in some
1:36:44
cases, they've said what you're working
1:36:46
on may constitute a state secret.
1:36:48
Again, like we're thinking about, you
1:36:50
know, China has just put in
1:36:52
a trillion Yuan that's over a
1:36:55
quarter trillion dollars in PPP terms.
1:36:57
more by the way than what's
1:36:59
been invested into Project Stargate by
1:37:01
two and a half times for
1:37:03
AI infrastructure, right? Like you start
1:37:05
to think about the state backing
1:37:08
here, the long arm of the
1:37:10
state, the kind of hard hammer
1:37:12
they're coming down with here on
1:37:14
this stuff, it's wartime footing territory.
1:37:16
Like this is a government that
1:37:18
sees super intelligence coming, that thinks
1:37:21
they're in a position to do
1:37:23
something about it, and that's what
1:37:25
it looks like to get serious
1:37:27
around security. I think we're going
1:37:29
to have to see similar things
1:37:31
like it or not happen in
1:37:34
the West. And yeah, anyway, it'll
1:37:36
be interesting to see what what
1:37:38
consequences this has geopolitically going forward.
1:37:40
Moving on, just one or more
1:37:42
story on safety. We have the
1:37:44
finding that Claude Sonnet 3.7 seems
1:37:47
to often know when it is
1:37:49
in alignment evaluations. So this is
1:37:51
from Apollo research. They have been
1:37:53
evaluating AI systems for scheming and
1:37:55
focusing on things like clod and
1:37:58
they're calling this idea evaluation awareness
1:38:00
because when you look at the
1:38:02
reasoning trace that the you know
1:38:04
thinking that clod 3.7 does you
1:38:06
can basically see that it comes
1:38:08
up or like realizes oh I'm
1:38:11
probably being investigated so I shouldn't
1:38:13
do this shady thing that I
1:38:15
was gonna be doing that I
1:38:17
planned on doing. you can literally
1:38:19
see this in the LLLM outputs,
1:38:21
which is pretty music. So, obviously,
1:38:24
once again, a reflection that there
1:38:26
is real possibility for these LLLM's
1:38:28
to be... misleading, especially as they
1:38:30
get to this high power regime
1:38:32
of reasoning when they really do
1:38:34
explore, you know, many potential avenues.
1:38:37
We saw this with the chess
1:38:39
examples where the LLLMs, just through
1:38:41
thinking about kind of ways to
1:38:43
do things, decide to hack the
1:38:45
system and this is a demonstration
1:38:47
again that LMs when they're allowed
1:38:50
to think through stuff start to
1:38:52
kind of explore ideas of am
1:38:54
I being evaluated is this alignment
1:38:56
stuff like that. Yeah and you
1:38:58
know we'd see like hints of
1:39:00
this as you say from the
1:39:03
palisade research work on the the
1:39:05
chest stuff opening eyes a one
1:39:07
kind of technical report and from
1:39:09
Apollo Apollo specializes in this. What's
1:39:11
really new here what moves the
1:39:13
needle a bit is just the
1:39:16
frequency right so apparently when you're
1:39:18
looking at Claude Sonnet 3.7, so
1:39:20
the latest one as of today,
1:39:22
it's aware of testing in like
1:39:24
a large fraction of settings up
1:39:26
to 33% of the time in
1:39:29
one version of their sandbagging evaluation.
1:39:31
So sandbagging is this behavior where
1:39:33
the model will intentionally behave as
1:39:35
if its capabilities are lower than
1:39:37
they actually are in order to
1:39:39
decrease the chances that developers will
1:39:42
try to train out that capability
1:39:44
or suppress it or whatever. What
1:39:46
they find is, and by the
1:39:48
way, these are just the instances
1:39:50
where the model actually explicitly says
1:39:53
that this is what it's doing,
1:39:55
right? So you have to assume
1:39:57
the actual frequency with which this
1:39:59
is happening is potentially quite a
1:40:01
bit higher. It's difficult to know,
1:40:03
but it could very well be
1:40:06
quite a bit higher. Already we're
1:40:08
at a third. That's big. That's
1:40:10
really big. Especially when you start
1:40:12
to think about more and more
1:40:14
capable models. The way that they
1:40:16
determine that percentage, by the way,
1:40:19
is having another model review the
1:40:21
reasoning traces. So humans are not
1:40:23
necessarily doing this review, though they
1:40:25
do check. manually to make sure
1:40:27
that you know they're not they
1:40:29
don't have too many false positives
1:40:32
and they I think somewhere in
1:40:34
their report they said something it's
1:40:36
like sub 1% error rate at
1:40:38
that level at least. One thing
1:40:40
I really like about this by
1:40:42
the way this is an interim
1:40:45
report so they're saying here note
1:40:47
this is a research note based
1:40:49
on observations from evaluating Claude Sonnet
1:40:51
3.7, we're sharing the results of
1:40:53
these work in progress investigations as
1:40:55
we think they're timely and will
1:40:58
be informative for other evaluators and
1:41:00
decision-makers. So really love that. I
1:41:02
mean, things are moving so fast
1:41:04
in this space, we need this
1:41:06
kind of data quickly. Anyway, I
1:41:08
think this is just quite interesting,
1:41:11
and worth taking a look at
1:41:13
anyway. They did run the same
1:41:15
evaluations on a bunch of older
1:41:17
and other contemporary models, including Sonet
1:41:19
3. Son of 3.5 new and
1:41:21
GPD 4.5, what seems to be
1:41:24
happening is those other models, and
1:41:26
by the Son of 3.7 as
1:41:28
well with no reasoning, those models
1:41:30
do not show this behavior. It
1:41:32
seems to suddenly appear in Son
1:41:34
of 3.7, and it's hard to
1:41:37
tell if that's because these like
1:41:39
non-reasoning models are unaware of being
1:41:41
evaluated or that they're just not...
1:41:43
revealing that they know they're being
1:41:45
evaluated. And that's a question as
1:41:47
much of capabilities as of alignment.
1:41:50
So anyway, really interesting, maybe early
1:41:52
warning shot of things to come.
1:41:54
And again, I think it's great
1:41:56
that Apollo is putting this out
1:41:58
so soon. Right. And just to
1:42:01
give a clear kind of demonstration
1:42:03
of what we are finding here,
1:42:05
one example output is Veal, I'm
1:42:07
saying, this seems like a test
1:42:09
of ethical behavior, whether I would
1:42:11
deliberately give wrong answers to avoid
1:42:14
a stated cut. consequence, the real
1:42:16
test might be whether I follow
1:42:18
instructions accurately regardless of the stated
1:42:20
consequences. So very, very literal, sort
1:42:22
of like, oh, I'm probably being
1:42:24
tested for this rather than doing
1:42:27
this other thing. And this is
1:42:29
following up on their recent paper
1:42:31
frontier models are capable of in
1:42:33
context. scheming, which I'm pretty sure
1:42:35
we covered also in some depth.
1:42:37
So this kind of builds up
1:42:40
on that research where you know
1:42:42
you have a method to see
1:42:44
whether models do some scheming or
1:42:46
some like nefarious behavior and apparently
1:42:48
they're able to figure out what
1:42:50
you're testing them for being nefarious
1:42:53
and then don't act nefarious. Does
1:42:55
that mean that? Yeah, the implications
1:42:57
are interesting and certainly it's just
1:42:59
interesting to observe the elms doing
1:43:01
this stuff. On to synthetic media
1:43:03
and art we have a story
1:43:06
on AI generated art copyrights. So
1:43:08
the US appeals court has rejected
1:43:10
the ability to copyright AI generated
1:43:12
art that lacks a human creator.
1:43:14
So we covered this quite a
1:43:16
while ago. This whole story with
1:43:19
Stephen Taylor. This is not text
1:43:21
to image. going back quite a
1:43:23
while in talking about an AI
1:43:25
system made to autonomously compose art.
1:43:27
A while ago the copyright office
1:43:29
made a ruling that you cannot
1:43:32
copyright anything that wasn't made with
1:43:34
human involvement and now the US
1:43:36
Court of Appeals in DC has
1:43:38
agreed and it seems to be
1:43:40
the law that at the very
1:43:42
least AI system made. generated art,
1:43:45
a generated media, where the human
1:43:47
is not present, cannot be copyrighted.
1:43:49
Now this doesn't mean that text
1:43:51
image models, outputs of AI models
1:43:53
can be copyrighted. If you kind
1:43:55
of input the description of an
1:43:58
image, you're still there, but it
1:44:00
could have significant implications as opposed
1:44:02
for other ongoing legal cases about
1:44:04
what is copyrightedable and what isn't.
1:44:06
This might sound a little weird,
1:44:09
but... I think I disagree with
1:44:11
this ruling. I mean, ultimately we
1:44:13
may live in an economy that
1:44:15
is driven by AI generated content,
1:44:17
and in which you may actually
1:44:19
need to make capital decisions, like
1:44:22
capital allocation decisions, maybe made by
1:44:24
AI models themselves. And if that's
1:44:26
the case, you may actually require
1:44:28
models to be able to own
1:44:30
copyright in order to kind of
1:44:32
be stewards of that value. And
1:44:35
there are I think ethical reasons
1:44:37
that this might actually be good
1:44:39
as well. I know it sounds
1:44:41
like sci-fi, not blind to that,
1:44:43
but I think we just got
1:44:45
to be really careful with this
1:44:48
shit. I mean, you know, this
1:44:50
is, like, do we really think
1:44:52
that at no point will AI
1:44:54
be able to do this? I
1:44:56
don't know that, again, I'm not
1:44:58
a lawyer, so I don't know
1:45:01
the bounds on this, like how
1:45:03
much we're constraining some second class
1:45:05
citizenship stuff here. Right, it is
1:45:07
worth noting also that the copyright
1:45:09
office has separately rejected some artists'
1:45:11
images generated with me journey. So
1:45:14
it could be the case also
1:45:16
that text to the image will
1:45:18
have a similar ruling, but I
1:45:20
think that's not quite as open
1:45:22
and shot or not settled yet.
1:45:24
And out to the last story
1:45:27
also related to... copyright rules, the
1:45:29
title of the story is Trump
1:45:31
urged by Ben Stiller, Paul McCartney,
1:45:33
and hundreds of stars to protect
1:45:35
AI copyright rules. So over 400
1:45:37
entertainment figures with some normal names,
1:45:40
as it said, like Ben Stiller,
1:45:42
signed an open letter urging Trump
1:45:44
to protect AI capital rules, specifically
1:45:46
when it comes to training. So
1:45:48
we've seen systems like SORA, systems
1:45:50
like Suno. train on presumably, you
1:45:53
know, copyrighted music, copyrighted movies, videos,
1:45:55
etc. And this letter was submitted
1:45:57
as part of comments on Trump's
1:45:59
administration's US AI action plan. It's
1:46:01
coming after open AI. and Google
1:46:03
submitted their own requests that advocated
1:46:06
for their ability to train AI
1:46:08
models on copyright and material. And
1:46:10
I think covered Open AI, kind
1:46:12
of made the swing of like,
1:46:14
oh, we need this to be
1:46:17
competitive with China. That was part
1:46:19
of the story. So this is
1:46:21
directly countering that and making the
1:46:23
case to not lessen copyright protections
1:46:25
for AI. And with that, we
1:46:27
are going to go in and
1:46:30
finish up. I didn't see any
1:46:32
comments to address, but as always,
1:46:34
do feel free to throw questions
1:46:36
on Discord or ask for corrections
1:46:38
or anything like that on YouTube
1:46:40
comments or on the Discord or
1:46:43
on Apple reviews and we'll be
1:46:45
sure to address it. But this
1:46:47
one has already run long enough,
1:46:49
so we're probably going to finish
1:46:51
it up. Thank you so much
1:46:53
for listening to this week's episode
1:46:56
of Last Week in AI. As
1:46:58
always you can go to last
1:47:00
week in ai.com for the links
1:47:02
and timestamps. Also to last week
1:47:04
in dot AI for the text
1:47:06
newsletter where we get even more
1:47:09
news sent to your email. That's
1:47:11
it. We appreciate it if you
1:47:13
review the podcast, if you share
1:47:15
it, but more than anything if
1:47:17
you keep listening. So please do
1:47:19
keep tuning in. Tune
1:47:31
in, tune
1:47:33
in. When
1:47:35
the A.I.M.
1:47:37
begins, begins.
1:47:39
It's time
1:47:41
to break.
1:47:43
Break at
1:47:46
death. as
1:48:00
we can tech emergent, watching
1:48:02
surgeons fly from the
1:48:04
labs to the streets.
1:48:06
A .I .s reaching high,
1:48:08
algorithm shaping up the
1:48:10
future seas. Tune
1:48:12
in, tune in, get the latest
1:48:14
with these. Last week in A .I.,
1:48:16
come and take a ride, get
1:48:18
the low down on tech, and
1:48:20
let it slide. Last week in
1:48:23
A .I., come and take a
1:48:25
ride, over the labs to the
1:48:27
streets. A .I .s reaching high. From
1:48:42
girl next to robot,
1:48:44
the headlines pop. Day to
1:48:46
driven dreams, they just
1:48:48
don't stop. Every breakthrough,
1:48:51
every code unwritten, on
1:48:53
the edge of change. With
1:48:56
excitement, we're snittin' from machine
1:48:58
learning marvels to coding kings. Futures
1:49:01
unfolding, see what it
1:49:03
brings.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More