Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:01
Welcome to Practical AI, the
0:03
podcast that makes artificial intelligence
0:05
practical, productive, and accessible to
0:08
all. If you like this
0:10
show, you will love the
0:12
change log. It's news on
0:15
Mondays, deep technical interviews on
0:17
Wednesdays, and on Fridays, an
0:19
awesome talk show for your
0:21
weekend enjoyment. Find us by
0:24
searching for The Change Log
0:26
wherever you get your podcasts.
0:28
Thanks to our partners at
0:31
fly.io. Launch your AI apps
0:33
in five minutes or less.
0:35
Learn how at fly.io. Welcome
0:44
to another episode of the practical
0:46
AI podcast. This is Chris Benson.
0:49
I am your co-host. Normally, Daniel
0:51
White Neck is joining me as
0:53
the other co-host, but he's not
0:56
able to today. I am a
0:58
principal AI research engineer at Lockheed
1:00
Martin. Daniel is the CEO of
1:03
Prediction Guard. And with us today,
1:05
we have Kate Sol, who is
1:07
director of technical product management at
1:10
Granite for IBM. Welcome to the
1:12
show, Kate. Hey Chris, thanks
1:14
for having me. So I wanted
1:17
to, I know we're going to
1:19
dive shortly into what granite is
1:21
and some of our listeners are
1:23
probably already familiar with it, some
1:25
may not be, but before we
1:27
dive into that, wondering, we're talking
1:29
about AI models, that's what granite
1:31
is, and the world of LLLM's
1:33
generative AI. wondering if you could
1:36
start off talking a little bit
1:38
about your own background, how you
1:40
arrived at this, and we'll get
1:42
into a little bit about what
1:44
IBM is doing and why it's interested
1:46
in how it fits into the landscape
1:48
here for those who are not already
1:51
familiar with it. Perfect. Yeah, thanks
1:53
Chris. So I lead the technical
1:55
product management for granite, which is
1:57
IBM's large family of large language
1:59
models. that is produced by IBM
2:01
Research. And so I actually joined IBM
2:03
and IBM research a number of years
2:06
ago before large language models were really
2:08
became popular. You know, they had a
2:10
bit of a Netscape moment right back
2:13
in November of 2022. So I've been
2:15
working at the lab for a little
2:17
while. I'm... a little bit of a
2:20
odd duck, so to speak, in that
2:22
I don't have a research background, I
2:24
don't have a PhD, I come from
2:27
a business background, I worked in consulting
2:29
for a number of years, went to
2:31
business school. and joined IBM Research and
2:34
the AI lab here in order to
2:36
get more involved in technology. You know,
2:38
I've always kind of had one foot
2:41
in the tech space. I was a
2:43
data scientist for most of my tenure
2:45
as a consultant and always thought that
2:48
there was a lot of exciting things
2:50
going on in AI, and so I
2:52
joined the lab. and basically got to
2:54
work with a lot of generative AI
2:57
researchers before large language models really kind
2:59
of became big. And you know about
3:01
two and a half years ago a
3:04
lot of the technology we're working on
3:06
all of a sudden we started to
3:08
find and see that there were tremendous
3:11
business applications. You know, Open AI really
3:13
demonstrated what could happen if you took
3:15
this type of technology and Porsche fed
3:18
it enough compute to make it powerful.
3:20
It could do some really cool things.
3:22
So from there we worked as a
3:25
team really to spin up a program
3:27
and offering at IBM for our own
3:29
family of large language models that we
3:32
could offer our customers and the broader
3:34
open source ecosystem. Do you I'm curious
3:36
with one of the things that I've
3:39
you know we've noticed over time is
3:41
different organizations kind of are positioning the
3:43
These large language models within their product
3:46
offerings and in very unique ways and
3:48
and we've you know We could go
3:50
through some of your competitors and say
3:52
they do this way the how do
3:55
you guys see that in terms of?
3:57
You know how large language models fit
3:59
into your product offering is there is
4:02
there a vision that IBM has for
4:04
that? Yeah, I think the fundamental. premise
4:06
of large language models are that they
4:09
are a building block that you get
4:11
to build on and reuse in many
4:13
different ways, right, where one model can
4:16
drive a number of different use cases.
4:18
So, you know, from my perspective, that
4:20
value proposition resonates really clearly. We see
4:23
a lot of our customers, our own
4:25
internal offerings, where, you know, there's a
4:27
lot of effort on data curation and
4:30
collection and kind of creating and training
4:32
bespoke models for a specific task. And
4:34
now with large language models we get
4:37
to kind of use one model and
4:39
with very little label data all of
4:41
a sudden, you know, the world's here
4:43
oyster, there's a lot you can do.
4:46
And so that's a bit of the
4:48
reason why we have centralized the development
4:50
of our language language models within IBM
4:53
Research, not a specific product. It's one
4:55
offering that then feeds into many of
4:57
our different products in downstream applications. And
5:00
it allows us to kind of create
5:02
this building block that we can then
5:04
also offer customers to be able to
5:07
build on top of as well. And
5:09
open source ecosystem developers, you know, we
5:11
think there's a lot of different applications
5:14
for them. that one offering. And so,
5:16
you know, that's a little bit kind
5:18
of from the organizational side why we're
5:21
why it's kind of exciting, right, that
5:23
we get to do this all within
5:25
research. We don't have a P&L, so
5:28
to speak. We're doing this to create.
5:30
ultimately a tool that can support any
5:32
number of different use cases and downstream
5:35
applications. Very cool. And you mentioned open
5:37
source. I want to ask you because
5:39
that's always a big topic among organizations
5:41
is if I remember correctly, granted is
5:44
an under an Apache 2 license. Is
5:46
that correct? That's correct. I'm just curious
5:48
because we've seen strong arguments on both
5:51
sides. Why? Why is granite an open
5:53
source license like that? What was the
5:55
decision from IBM to go that direction?
5:58
Yeah, well there was kind of two
6:00
levels of decision making that we had
6:02
to. make when we talked about how
6:05
to license granite. One was open or
6:07
closed. So are we going to release
6:09
this model, release the weights out into
6:12
the world so that anyone can use
6:14
it regardless if they spend a dime
6:16
with IBM? And ultimately, IBM, you know,
6:19
believes strongly in the power of open
6:21
source ecosystems, a huge part of our
6:23
business is built around Red Hat and
6:26
being able to provide open source software
6:28
to our customers with enterprise guarantees guarantees.
6:30
And we felt that Open AI was
6:33
a far more responsible environment to develop
6:35
and to incubate this technology as a
6:37
whole and when you say open-source AI
6:39
you mean open-source AI just making sure
6:42
very important clarification very important clarification so
6:44
that was why we released our models
6:46
out into the open and then the
6:49
question was under what license because there
6:51
are a lot of models there are
6:53
a lot of licenses and a bit
6:56
of like a moment that everyone seeing
6:58
is you have a gamma license for
7:00
a gamma model. You've got a llama
7:03
license for a llama model. Everyone's coming
7:05
up with their own license. And, you
7:07
know, it kind of, in some ways,
7:10
it makes sense. Models are a bit
7:12
of a. weird artifact. They're not code.
7:14
You can't execute them on their own.
7:17
They're not software. They're not data per
7:19
se, but they are kind of like
7:21
a big bag of numbers at the
7:24
end of the day. So like, you
7:26
know, some of the traditional licenses, I
7:28
think some people didn't see a clear
7:31
fit, and so they came up with
7:33
their own. They're also all these different
7:35
kind of... potential risks that you might
7:37
want to solve for with a license
7:40
with a large language model that are
7:42
different than risk that you look at
7:44
with software or data. But at the
7:47
end of the day, IBM really wanted
7:49
just to keep this simple, like a
7:51
no-nonsense license that we felt would be
7:54
able to promote the broadest use from
7:56
the ecosystem without any restrictions. So we
7:58
went with Apache 2 because that's probably
8:01
the most widely used and just easy
8:03
to understand license that's out there. And
8:05
you know I think it really speaks
8:08
also to where we see models being
8:10
important building blocks that are further customized.
8:12
So we really believe the true value
8:15
in generative AIs being able to take
8:17
some of these. smaller open-source models and
8:19
build on top of it and even
8:22
start to customize it. And if you're
8:24
doing all that work and, you know,
8:26
building on top of something, you want
8:28
to make sure there are no restrictions
8:31
on all that IP you've just created.
8:33
And so that's ultimately why we went
8:35
with Apache 2.0. Understood. And one last
8:38
follow-up on licensing and then I'll move
8:40
on. It's more, it's partially just a
8:42
comment. IBM has a really strong legacy
8:45
as someone in the AI world and
8:47
decades of software development along with that.
8:49
I know both Red Hat with the
8:52
acquisition some years back, being strong on
8:54
open source and IBM both before and
8:56
after has, was it, was it, I'm
8:59
just curious, did that make it any
9:01
easier, do you think to go with
9:03
open source? Like, hey, we've done this
9:06
so much that we're gonna do that
9:08
with this thing too, even though it's
9:10
a little bit newer, you know, in
9:13
context. Culturally, did it seem easier to
9:15
get there than some companies that possibly
9:17
really struggle with that? They don't have
9:20
such a legacy in open source in
9:22
open source? I think it did make
9:24
it easier. I think there are always
9:26
going to be like any company going
9:29
down this journey has to take a
9:31
look at weight. We're spending how much
9:33
on what and you're going to give
9:36
it away for free and come up
9:38
with their own kind of equations on
9:40
how this starts to make sense. And
9:43
I think we've just experienced as a
9:45
company that the software and offerings we
9:47
create are so much stronger when we're
9:50
creating them as part of an open-source
9:52
ecosystem than something that we just keep
9:54
close to the best. So, you know,
9:57
it was a much easier business case,
9:59
so to speak, to make and to
10:01
get the sign-off that we needed. Ultimately,
10:04
our leadership was very supportive in order
10:06
to encourage this kind of open ecosystem.
10:08
Fantastic. Turning a little bit, as IBM
10:11
was diving into this into this realm
10:13
and starting, you know, and obviously like,
10:15
you have a history with grand that,
10:18
you know, you guys are on 3.2
10:20
at this point, that means that you've
10:22
been working on this for a period
10:24
of time, but as you're diving into
10:27
this very competitive ecosystem of building out
10:29
these open source models that are big,
10:31
they are expensive to make, and you're
10:34
looking for an outsized impact in the
10:36
world, how do you decide? how to
10:38
proceed with what kind of architecture you
10:41
want, you know, how did you guys
10:43
think about like, like you're looking at
10:45
competitors, some of them are closed source
10:48
like open AI is, some of them
10:50
like meta AI, you know, has llama
10:52
and you know, that series, as you're
10:55
looking at what's out there, how do
10:57
you make a choice about what is
10:59
right for what you guys are about
11:02
to go build, you know, because that's
11:04
one heck of an investment to make.
11:06
And I'm kind of curious how you,
11:09
when you're looking at that landscape. how
11:11
you make sense of that in terms
11:13
of where to invest? Yeah, absolutely. So,
11:16
you know, I think it's all about
11:18
trying to make educated bets that kind
11:20
of match your constraints that you're operating
11:22
with and your broader strategy. So, you
11:25
know, early on into our gener of
11:27
AI journey when we're kind of getting
11:29
the program up and running, you know,
11:32
we wanted to take fewer risks, we
11:34
wanted to learn how to do, you
11:36
know, common architectures, common patterns before we
11:39
started to get more quote-unquote innovative and
11:41
coming up with net new additions on
11:43
top. So early on the gen, and
11:46
you know, also you have to keep
11:48
in mind this field has just been
11:50
like. changing so quickly over the past
11:53
couple of years. So no one really
11:55
knew what they were doing. Like if
11:57
we look at how models were trained
12:00
two years ago and the decisions that
12:02
were made, the game was all about
12:04
as many parameters as possible and having
12:07
as little data as possible to keep
12:09
your training costs down. And now we've
12:11
totally switched. The general wisdom is as
12:13
much data as possible in a few
12:16
parameters as possible to keep your inference
12:18
costs down. once the model is finally
12:20
deployed. So the whole whole field's been
12:23
going. through a learning curve. But I
12:25
think early on, you know, our goal
12:27
was really working on trying to replicate
12:30
some of the architectures that were ORIA
12:32
out there, but innovate on the data.
12:34
So really focus in on how do
12:37
we create versions of these models that
12:39
are being released that deliver the same
12:41
type of functionality, but that we're trained
12:44
by IBM as a trusted partner working
12:46
very closely with all of our teams
12:48
to have a very. clear and ethical
12:51
data curation and sourcing pipeline to train
12:53
the models. So that was kind of
12:55
the first major innovation aim that we
12:58
had was actually not on the architecture
13:00
side. Then as we started to get
13:02
more confident as the fields started, I
13:05
don't want to say mature because we're
13:07
still very, again, very early innings, but
13:09
you know. We started to call less
13:11
to some shared understandings of how these
13:14
models should be trained and what works
13:16
or doesn't. You know, then our goal
13:18
really has started to focus on from
13:21
an architecture side, how can we be
13:23
as efficient as possible? How can we
13:25
train models that are going to be
13:28
economical for our customers to run? And
13:30
so that's where you've seen us focus
13:32
a lot on smaller models for right
13:35
now. And we're working on new architectures.
13:37
So for example, mixture of experts. There's
13:39
all sorts of things that we are
13:42
really focusing in really with kind of
13:44
the mantra of how do we make
13:46
this as efficient as possible for people
13:49
to further customize and to run in
13:51
their own environments. So that was a
13:53
fantastic start to as we dive into
13:56
granite itself, kind of laying it out.
13:58
You know, your last comments, you talked
14:00
about kind of the smaller, more economical
14:03
models so that you're getting efficient inference
14:05
on the customer side. You mentioned a
14:07
phrase, which some people may know, some
14:09
people may not mixture of experts, maybe
14:12
talk as we dive into, you know,
14:14
what granite is in its versions going
14:16
forward here. you start with mixture of
14:19
experts and what you mean by that?
14:21
Absolutely. So if we think of how
14:23
these models are being built, there are
14:26
essentially billions of parameters that are representing
14:28
small little numbers that basically are encoding
14:30
information. And, you know, to like draw
14:33
a really simple explanation, if you have
14:35
a, you know, a linear regression, like
14:37
you've got a scatterpoint and you're fitting
14:40
a line, y equals mx plus b,
14:42
like m is a parameter in that
14:44
equation, right? So this, that, except on
14:47
the scale of billions, with mixture of
14:49
experts, what we're looking at is, do
14:51
I really need all one billion parameters
14:54
every single time I run inference? Can
14:56
I use a subset? a large language
14:58
model, so that at inference time I'm
15:01
being far more selective and smart about
15:03
which parameters get called. Because if I'm
15:05
not using... all 8 billion or 120
15:07
billion parameters, I can run that inference
15:10
far faster. So it's much more efficient.
15:12
And so really it's just getting a
15:14
little bit more nuanced of instead of
15:17
like, I think a lot of early
15:19
days of generative AI is just throw
15:21
more compute at it and hope the
15:24
problem goes away. We're now trying to
15:26
like figure out how can we be
15:28
far more efficient in how we build
15:31
these models. So I appreciate the explanation
15:33
on a mixture of experts and that
15:35
makes a lot of sense in terms
15:38
of trying to use the model efficiently
15:40
for an inference by reducing the number
15:42
of parameters. I believe you're right now
15:45
you guys have, is it 8 billion
15:47
and 2 billion or the model? sizes
15:49
in terms of the parameters or have
15:52
I gotten that wrong? We got actually
15:54
a couple of sizes. So you're right,
15:56
we've got 8 billion and 2 billion.
15:58
But speaking of those mixture of expert
16:01
models, we actually have a couple of
16:03
tiny MOE models. MOE stands for a
16:05
mixture of experts. So we've got MOE
16:08
model with only a billion parameters and
16:10
a MOE model with 3 billion parameters.
16:12
But they respectively use far fewer parameters
16:15
at inference time. So they run really,
16:17
really quick, designed for more local applications,
16:19
like running out a CPU. So and
16:22
when. When you make the decision to
16:24
have different size models in terms of
16:26
the number of parameters and stuff, do
16:29
you have different use cases in mind
16:31
of how those models might be used?
16:33
And is there one set of scenarios
16:36
that you would put your $8 billion,
16:38
another one that would be that $3
16:40
billion that you mentioned? Yeah, absolutely. So
16:43
if we think about it, when we're
16:45
kind of designing the model sizes that
16:47
we want to train, a huge question
16:50
that we're trying to solve for is,
16:52
you know, what are the environments these
16:54
models going to be run on and
16:56
how do I, you know, maximize performance
16:59
without forcing someone to have to buy
17:01
another GPU to host it. So, you
17:03
know, there are models like the small
17:06
MOE models that were actually designed much
17:08
more for running on the edge locally
17:10
or on the computer, like just a
17:13
local laptop. We've got models that are
17:15
designed to run on a single GPU,
17:17
which is like our two billion and
17:20
eight billion models. Those are standard architecture,
17:22
not MOE. And we've got models on
17:24
our roadmap that are looking at how
17:27
can we kind of max out what
17:29
a single GPU. could run, and then
17:31
how can we max out what a
17:34
box of GPUs could run? So if
17:36
you got eight GPUs stitched together. So
17:38
we are definitely thinking about those different
17:41
kind of. tranches of compute availability that
17:43
customers might have. And each of those
17:45
tranches could relate to different use cases.
17:48
Like obviously, if you're thinking about something
17:50
that is local, you know, there's all
17:52
sorts of IOT type of use cases
17:54
that that could target. If you are
17:57
looking at something that has to be
17:59
run on, you know, a box of
18:01
HEPUs, you know, you're looking at something
18:04
that you have to be okay with
18:06
having a little bit more latency, you
18:08
know, time it takes for the model
18:11
to respond. bit higher value because it
18:13
costs more to run that. model and
18:15
so you're not going to run like
18:18
a really simple like you know help
18:20
me summarize this email task hitting you
18:22
know eight GPUs at once. So as
18:25
you talk about the segmentation of these
18:27
of the of the family of models
18:29
and how you're doing that, I know
18:32
one of the things you guys have
18:34
a white paper which will be linking
18:36
in on the show notes for folks
18:39
to go and take a look at
18:41
either during or after they listen here
18:43
and you talk about some of the
18:45
models being experimental chain of thought reasoning
18:48
capabilities. I was wondering if you could.
18:50
talk a little bit about what that
18:52
means. Yeah, so really excited with the
18:55
latest release of our granite models. Just
18:57
the end of February released granite 3.2,
18:59
which is an update to our 2
19:02
billion parameter model and our 8 billion
19:04
parameter model. And one of the kind
19:06
of superpowers we give this model in
19:09
the new release is we bring in
19:11
an experimental feature for reasoning. And so
19:13
what we mean by that is there's
19:16
this new concept, relatively new concept in
19:18
general. of AI called inference time compute,
19:20
where if you, what that really equates
19:23
to, just to put in plain language,
19:25
if you think longer and harder about
19:27
a prompt, about a question, you can
19:30
get a better response. I mean, this
19:32
works for humans, this is how you
19:34
and I think, but it's the same
19:37
is true for large language models. And
19:39
thinking here, you know, is a bit
19:41
of a risk of anthropomorphizing the term,
19:43
but it's where we've landed as a
19:46
field, so I'll run with it for
19:48
now. generate more tokens. So have the
19:50
model think through what's called a chain
19:53
of thought, you know, generates logical thought
19:55
processes and sequences of how the model
19:57
might approach answering before triggering the model
20:00
to then respond. And so we've trained
20:02
granite 8B 3.2 in order to be
20:04
able to do that chain of thought
20:07
reasoning natively, take advantage of this new
20:09
inference time compute area of innovation. And
20:11
what we've done is we've made it
20:14
selective selective. So if you've made it
20:16
selective. Don't to think long and hard
20:18
about what is 2 plus 2, you
20:21
turn it off and the model responds
20:23
faster just with the answer. If you
20:25
are giving it a more difficult question,
20:28
you know, pondering the meaning of life,
20:30
you might turn thinking on, and it's
20:32
going to think through a little bit
20:35
first before answering an answer, or with
20:37
a much, in general, a longer, kind
20:39
of more chain of thought style approach
20:41
towards explaining kind of step by step
20:44
why it's responding the way it is.
20:46
Do you anticipate, kind of, and I've
20:48
seen this done from different organizations in
20:51
different ways, do you anticipate that your
20:53
inference time compute capability is going to
20:55
be kind of there on all the
20:58
models and you're turning it on and
21:00
off? Or do you anticipate that some
21:02
of the models in your family are
21:05
more specializing in that and that's always
21:07
on versus others? Which way you kind
21:09
of mentioned the on and off? So
21:12
it sounded like you might have it
21:14
in all of the above. Yeah, you
21:16
know, right now. it's marked as an
21:19
experimental feature. I think we're still learning
21:21
a lot about how this is useful
21:23
and what it's going to be used
21:26
for, and that might dictate what makes
21:28
sense moving forward. But what we're seeing
21:30
is kind of universally, it's useful, one,
21:33
to try and improve the quality of
21:35
the answers, but two, as an explainability
21:37
feature, like if the model is going
21:39
through and explaining more how it came
21:42
up with a response, moving forward, which
21:44
is a different approach, right, than some
21:46
models which are just focused on reasoning.
21:49
I don't think we're going to see
21:51
that very long. You know, I think
21:53
more and more we're going to see
21:56
more selective reasoning, so like Claude 3.7
21:58
came out, they're actually doing a really
22:00
nice job with this, where you can
22:03
think longer or harder about something or
22:05
just think for a short amount of
22:07
time. So I think we're going to
22:10
see increasingly more and more folks move
22:12
in that direction. But there's still, again,
22:14
early innings, I'll say it again. So
22:17
we're going to learn a lot over
22:19
the next couple of months about where
22:21
this is having the most impact. And
22:24
I think that could have some structural
22:26
implications of how. we design our roadmap
22:28
moving forward. Gotcha. There has been a
22:30
larger push in the industry toward smaller
22:33
models. So kind of going back over
22:35
the. the recent history of LLLMs and
22:37
you know you saw initially you know
22:40
the just the number of parameters exploding
22:42
and the models becoming huge and obviously
22:44
you know we talked a little bit
22:47
about the fact that that's very expensive
22:49
on inference yeah to run these things
22:51
and over the last especially over the
22:54
last I don't know a year year
22:56
and a half there's been a much
22:58
stronger push especially with open source models
23:01
we've seen a lot of them on
23:03
hugging face pushing to smaller Do you
23:05
anticipate, is you're thinking about this capability
23:08
of being able to reason that that's
23:10
going to drive smaller model use toward
23:12
models like what you guys are creating
23:15
where you're saying, okay, we have these
23:17
large, you know, Claude has the, you
23:19
know, big models and out there, you
23:22
know, is an option or or a
23:24
llama model that's very large? Are you
23:26
guys anticipating kind of pulling a lot
23:28
more mind shear towards some of the
23:31
smaller ones? And do you anticipate? that
23:33
you're going to continue to focus on
23:35
on the smaller more efficient ones where
23:38
people can actually get them deployed out
23:40
there without without breaking the bank of
23:42
the organization. How is that fit in?
23:45
Yeah, so look at one thing to
23:47
keep in mind is even without thinking
23:49
about it without trying we're seeing small
23:52
models are increasingly able to do what
23:54
it took a big model to do
23:56
yesterday. So you look at what a
23:59
tiny, you know, 2 billion parameter, our
24:01
granite 2B model, for example, outperforms on
24:03
numerous benchmarks, you know, Lama 270B, which
24:06
is a much larger, but older generation.
24:08
I mean, it was state-of-the-art when it
24:10
was released, but the technology is just
24:13
moving so quickly. So, you know, we
24:15
do believe that by focusing on some
24:17
of the smaller sizes, that ultimately we're
24:20
going to get a lot of lift
24:22
just natively. because that is where the
24:24
technology is evolving. Like we're continuing to
24:26
find ways to pack more and more
24:29
performance in fewer and fewer parameters and
24:31
expand the scope of what you can
24:33
accomplish with a small language model. I
24:36
don't think that means we're going to.
24:38
ever get rid of big models? I
24:40
just think if you look at where
24:43
we're focusing, we're really looking at kind
24:45
of where are the models, you know,
24:47
if you think of the 80-20 rule,
24:50
like 80% of the use cases can
24:52
be handled by a model, you know,
24:54
maybe 8 billion parameters or less. That's
24:57
what we're targeting with granite and we're
24:59
really trying to focus in. We think
25:01
that there's definitely still always going to
25:04
be... innovation in opportunity and complex use
25:06
cases that you need larger models to
25:08
handle. And that's where we're really interested
25:11
to see, okay, how do we expand
25:13
the Granite family potentially focusing on more
25:15
efficient architectures like mixture of experts to
25:18
target those larger models and more complex
25:20
model sizes so that you still get
25:22
a little bit more of a more
25:24
practical implementation of a big model, recognizing
25:27
that again, you're always going to need
25:29
there's always going to be those outliers,
25:31
those really big cases. We just don't
25:34
think there's going to be as much
25:36
business value, frankly, behind those compared to
25:38
really focusing and delivering value on the
25:41
small to medium model space. I think
25:43
we've, that's one thing Daniel and I
25:45
have talked quite a bit about is
25:48
that we would agree with that. It's
25:50
I think the bulk of the use
25:52
cases are for the smaller ones. While
25:55
we're at it, you know, we've been
25:57
talking about various aspects of granite a
25:59
bit, but could we take a moment
26:02
and have you kind of go back
26:04
through the granite family and kind of.
26:06
talk about each component in the family,
26:09
what it does, you know, what it's
26:11
called, what it does, and just kind
26:13
of lay out the array of things
26:15
that you have to offer. Absolutely. So
26:18
the granite model family has the language
26:20
models that I went over. So between
26:22
1 billion to 8 billion parameters in
26:25
size. And again, we think those are
26:27
like the the work. course models, you
26:29
know, 80% of the tasks, we think
26:32
you can probably get away with a
26:34
model that's 8 billion parameters or less.
26:36
We also with 3.2 recently released a
26:39
vision model. So these models are for
26:41
vision understanding tasks. That's important. It's not
26:43
vision or image generation, which is where
26:46
a lot of the early, like, hyped
26:48
excitement on generative AI came from, is
26:50
like Dolly and those. We're focused on
26:53
models where you provide an image in
26:55
a prompt, and then the output is
26:57
text, the model response. So really useful
27:00
for things like. image and document understanding.
27:02
We specifically prioritize a very large amount
27:04
of document and chart Q&A type data
27:07
in its training data, really focusing on
27:09
performance on those types of tasks. So
27:11
you can think of, you know, having
27:13
a picture or an extract of a
27:16
chart from a PDF and being able
27:18
to answer questions about it. We think
27:20
there's a lot of opportunity. So rag
27:23
is a very popular workflow in enterprise,
27:25
right? Retrival augmented generation. Right now, all
27:27
of the images in your PDFs and
27:30
documents, they all get basically thrown away.
27:32
But we really like our working on
27:34
can we use our vision model to
27:37
actually include all of those charts, images,
27:39
figures, diagrams to help improve the model's
27:41
ability to answer questions in a rag
27:44
workflow. So I think that's going to
27:46
be huge. So lots of use cases
27:48
on the on the vision side. And
27:51
then we also have a number of
27:53
kind of companion models that are designed
27:55
to work in parallel with a language
27:58
model or a vision language model. So
28:00
we've got our granite guardian family of
28:02
models. And these are, we call them
28:05
guardrails. They're meant to sit in. right
28:07
in parallel with the large language model
28:09
that's running the main workflow. And they
28:11
monitor all the inputs that are coming
28:14
into the model and all the outputs
28:16
that are being provided by the model,
28:18
looking for potential adversarial prompts, jailbreaking attacks,
28:21
harmful inputs, harmful and biased outputs. They
28:23
can detect hallucination. and model responses. So
28:25
it's really meant to be a governance
28:28
layer that can sit and work right
28:30
alongside Granite, can actually work alongside any
28:32
model. So even if you've got an
28:35
open AI model, for example, you've deployed,
28:37
you can have Granite Guardian work right
28:39
in parallel. And ultimately, just be a
28:42
tool for responsible AI. And, you know,
28:44
the last model I'll talk about is
28:46
our embedding models, which again is meant
28:49
to be, you know, assist a model
28:51
in a broader general AI workflow. So
28:53
in a rag workflow, you'll often need
28:56
to take large amounts of documents or
28:58
text and convert them into what are
29:00
called embeddings that you can search over
29:03
in order to retrieve the most relevant
29:05
info and give it to the model.
29:07
So our granite embedding models are used
29:09
for that embedding models. meant to do
29:12
that conversion and can support in a
29:14
number of different similar kind of search
29:16
and retrieval style workflows working directly with
29:19
the granite large language model. Gotcha. I
29:21
know there was there was some comment
29:23
in the white paper also about time
29:26
series. Yes. Talk a little bit to
29:28
that for a second. Absolutely. So I
29:30
mentioned granted is multimodal net supports vision.
29:33
We also have time series as a
29:35
modality and I'm really glad you brought
29:37
these up because these models are really
29:40
exciting. So we talked about our focus
29:42
on efficiency. These models are like one
29:44
to two million parameters in size. That
29:47
is teeny tiny in today's generative AI
29:49
context. Even compared to other forecasting models,
29:51
these are really small generative AI based
29:54
time series forecasting models, but they are
29:56
right now. delivering top of the top
29:58
marks when it comes to performance. So
30:00
we just as part of this release
30:03
submitted our time series models to Salesforce
30:05
has a time series leaderboard called Gift.
30:07
They're the number one leaderboard on Gift
30:10
right now, number one model on Gifts
30:12
Leaderboard right now. And we're really excited.
30:14
They've got over 10 million downloads on
30:17
Hugging Face. They're really taking off in
30:19
the community. So it's a really excellent
30:21
offering in the time. series modality for
30:24
the Granite family. Okay, well thank you
30:26
for going through kind of the layout
30:28
of the family of models that you
30:31
guys have. I actually want to go
30:33
back and ask a quick question that
30:35
you talked a bit about guardian kind
30:38
of providing guardrails and stuff and that's
30:40
something that if you take a moment
30:42
to dive into, I think we often
30:45
tend to focus kind of on, you
30:47
know, the model and it's going to
30:49
do X, you know, whatever. I love
30:52
the notion of integrating these guardrails that
30:54
Guardian represents into a larger architecture to
30:56
address. kind of the quality issues surrounding
30:58
the inputs and the outputs on that.
31:01
How did you guys arrive at that?
31:03
I'm just, you know, and how did
31:05
you, you know, it's pretty cool. I
31:08
love the idea that not only is
31:10
it there for your own models, obviously,
31:12
but that, you know, that you could
31:15
have an end user go and apply
31:17
it to something else that they're doing,
31:19
maybe from a competitor or whatever. How
31:22
did you decide to do that? And,
31:24
you know, I think that's a fairly
31:26
unique thing that we don't tend to
31:29
hear as much from other organizations. Yeah,
31:31
you know, so Chris, the one of
31:33
the values again of being in the
31:36
open source ecosystem is we get to
31:38
like build on top of other people's
31:40
great ideas. So we actually weren't the
31:43
first ones to come up with it.
31:45
There's a few other guardrail type models
31:47
out there, but you know, IBM has
31:50
quite a large, especially IBM research presence
31:52
in security space, and there are challenges
31:54
in security that are very similar to
31:56
the large language models in general AI
31:59
that, you know. It's not totally new.
32:01
And what I think we've learned as
32:03
a company and as a field is
32:06
that you always need layers of security
32:08
when it comes to creating a robust
32:10
system against potential adversarial attacks and dealing
32:13
with even the model's own innate safety
32:15
alignment itself. So, you know, when we
32:17
saw some of the work going out
32:20
in the open source ecosystem on guardrails,
32:22
you know, I think it was kind
32:24
of a no-brainer from a perspective of
32:27
this is another great way to add
32:29
an additional layer on that generative AI
32:31
stack of security and safety to better
32:34
improve model robustness and figure out. you
32:36
know, IBM's hyper focused on what is
32:38
the practical way to implement general AI.
32:41
So what else is needed beyond efficiency?
32:43
We need trust, we need safety. Let's
32:45
create tools in that space. So it
32:47
kind of, you know, number of different
32:50
reasons all made it a very clear,
32:52
clear and easy when to go and
32:54
pursue. And we are actually able to
32:57
build on top of granite. So granite
32:59
Guardian is a fine-tuned version of granite.
33:01
that's laser focused on these tasks of
33:04
detecting and monitoring inputs going into the
33:06
model and outputs going out. And the
33:08
team has done a really excellent job
33:11
first starting at basic harm and bias
33:13
detectors, which I think is pretty prevalent
33:15
in other guardrail models that are out
33:18
there. But now we've really started to
33:20
kind of make it our own and
33:22
innovate. So some of the new features
33:25
that were released in the 3.2 granite
33:27
guardian models include hallucination detection, very few
33:29
models do that today, specifically hallucination detection
33:32
with function calling. So if you think
33:34
of an agent. you know, whenever an
33:36
LLLM agent is trying to access or
33:39
submit external information, it'll make a what's
33:41
called a tool call. And so when
33:43
it's making that tool call, it's providing
33:45
information based off of the conversation history
33:48
saying, you know, I need to look
33:50
up, you know, Kate Soles information in
33:52
the HR database. This is her first
33:55
name. She lives in Cambridge Mass, X,
33:57
Y, Z. And we want to make
33:59
sure the agent isn't hallucinating. made up
34:02
the wrong name or said Cambridge UK
34:04
instead of Cambridge Mass, the tool will
34:06
provide the incorrect response back but the
34:09
agent will have no idea and it
34:11
will keep operating with utmost certainty that
34:13
it's operating on correct information. So you
34:16
know it's just an interesting example of
34:18
you know some of the observability we're
34:20
trying to inject into response. responsible AI
34:23
workflows, particularly around things like agents, because
34:25
there's all sorts of new safety concerns
34:27
that really have to be taken into
34:30
account to make this technology practical and
34:32
implementable. and you know that's actually having
34:34
brought up agents and stuff and that
34:37
being kind of the really hot topic
34:39
of the moment of you know 2025
34:41
so far could you talk a little
34:43
bit about granite and agents and how
34:46
you guys you know how are you're
34:48
thinking you've gone through one example right
34:50
there but if you could expand on
34:53
that a little bit in terms of
34:55
you know how does how is IBM
34:57
thinking about positioning granite how do agents
35:00
fit in what does that ecosystem look
35:02
like you know, you've started to talk
35:04
about security a bit. Could you kind
35:07
of weave that story for us a
35:09
little bit? Absolutely. So yeah, obviously, IBM
35:11
is all in on agents and there's
35:14
just so much going on in the
35:16
space. A couple of key things that
35:18
I think are interesting to bring up.
35:21
So one is looking at the open
35:23
source ecosystem for building agents. So we
35:25
actually have a really fantastic team located
35:28
right here in Cambridge, Massachusetts that is
35:30
working on an agent framework and broader
35:32
agent stack called B AI, like a
35:35
bumble B. So we're working really closely
35:37
with them on how do we kind
35:39
of co-optimize a framework for agents with
35:41
a model that in order to be
35:44
able to have all sorts of new
35:46
tips and tricks so to speak that
35:48
you can harness when building agents. So
35:51
I don't want to give too much
35:53
away but I think there's a lot
35:55
of really interesting things that IBM's thinking
35:58
about agent framework and model co-design and
36:00
that only unlocks so much potential when
36:02
it comes to safety and security. because
36:05
there needs to be parts, for example,
36:07
of an LLLM, of an agent, that
36:09
agent developer programs that you never want
36:12
the user to be able to see.
36:14
There are parts of data that an
36:16
agent might retrieve as part of a
36:19
tool call that you don't want the
36:21
user to see. an agent that I'm
36:23
working with might have access to anybody's
36:26
HR records, but I... only have permission
36:28
to see my HR records. So how
36:30
can we design models and frameworks with
36:32
those concepts in mind in order to
36:35
better demark types of sensitive information that
36:37
should be hidden in order to protect
36:39
information that the prevent those types of
36:42
attack vectors through model co-design and agent
36:44
model and agent framework co-design. So I
36:46
think there's a lot of really exciting
36:49
work there. More broadly though, you know,
36:51
I think even on more traditional ideas
36:53
and implementations of agent, not that there's
36:56
a traditional one, this is so new,
36:58
but more classical agent implementations were working,
37:00
for example, with IBM Consulting. They have
37:03
an agent in assistant platform that is
37:05
where granite is the default agent and
37:07
assistant that gets built. And so that
37:10
allows IBM all sorts of economies of
37:12
scale. If you think about, we've now
37:14
got 160,000 consultants out in the world
37:17
using agents and assistants built off of
37:19
granite in order to be more efficient
37:21
and to help them with their client
37:24
and consulting projects. So we see a
37:26
ton of client zero, what we call
37:28
client zero. IBM is our, you know,
37:30
first client in that case of how
37:33
do we even internally build a... with
37:35
granite in order to improve IBM productivity.
37:37
Very cool. I'm kind of curious as
37:40
as you guys are looking at this
37:42
this array of considerations that you've just
37:44
been going through and as there is
37:47
more more push out into the edge
37:49
environments and you've already talked a little
37:51
bit about that earlier. As we're starting
37:54
to wind down could you talk a
37:56
little bit about kind of as as
37:58
things push a bit out of the
38:01
cloud and of the data center and
38:03
as we have been migrating away from
38:05
these gigantic models into a lot more
38:08
smaller hyper efficient models often that have
38:10
that that are doing better on performance
38:12
and stuff And we see so many
38:15
opportunities out there in a variety of
38:17
edge environments. Could you talk a little
38:19
bit about kind of where granite might
38:22
be going with that or where it
38:24
is now and kind of what the
38:26
what the thoughts about granite at the
38:28
edge might look like? Yeah, so I
38:31
think with granite at the edge, there's
38:33
a couple of different aspects. One is
38:35
how can we think about building with
38:38
models so that we can optimize for
38:40
smaller models in size? So when I
38:42
say building, I mean building prompts, building
38:45
applications so that we're not, you know,
38:47
designing prompts, how they're written today, which
38:49
I like to call it like the
38:52
Yolo method where I'm going to give
38:54
10 pages of instructions all at once
38:56
and say, go and do this, and
38:59
hope to get it, you know, the
39:01
model follows all those instructions and does
39:03
everything beautifully, like small models, no matter
39:06
how much this technology advances, probably aren't
39:08
going to get, you know, perfect scores
39:10
on that type of approach. So how
39:13
can we think about... broader kind of
39:15
programming frameworks for dividing things up into
39:17
much smaller pieces that a small model
39:20
can operate on. And then how do
39:22
we leverage model and hardware code design
39:24
to run those small pieces really fast?
39:26
So, you know, I think there's a
39:29
lot of opportunity, you know, across
39:31
the stack of how people are
39:33
building with models, the models themselves
39:35
and the hardware that the model
39:37
is running on, that's going to
39:39
allow us to push things. much
39:41
further to the edge than we've
39:43
really experienced so far. It's going
39:45
to require a bit of a
39:47
mine shift again. Like right now
39:49
I think we're all really happy
39:51
that we can be a bit
39:53
lazy when we write our prompts
39:55
and just like, you know, write
39:57
kind of word vomit prompts down.
39:59
But I think if we can
40:01
get a little bit more like
40:03
kind of software engineering, mine set
40:05
in terms of how you program
40:07
and build, it's going to allow
40:09
us to break things into much
40:11
smaller components and push those components
40:13
even farther to the edge. That
40:15
makes sense. That makes a lot
40:18
of sense. I guess, kind of
40:20
final question for you as we
40:22
talk about this, kind of, any
40:24
other thought, you talked a little
40:26
bit about kind of where you
40:28
think things are going, what the
40:30
future looks like when you are
40:32
kind of winding up for the
40:34
day and you're at that moment
40:36
where you're kind of just your
40:38
mind wonders a little bit any
40:40
anything that appeals to you that
40:42
kind of goes through your head.
40:44
So I think the thing I've
40:46
been most obsessed about lately is
40:48
you know we need to get
40:50
to the point as a field
40:52
where models are measured by like
40:54
how efficient they're efficient frontier is
40:56
not by like you know did
40:58
they get to 0.01 higher on
41:00
a metric or benchmark. So I
41:02
think we're starting to see this
41:04
with like the reasoning with granite,
41:06
you can turn it on and
41:08
off, with the reasoning with claw,
41:10
you can pay more, you know,
41:12
have harder thoughts, you know, longer
41:14
thoughts or shorter thoughts. But you
41:16
know, I really want to see
41:18
us get to the point, and
41:20
I think we've got the, like,
41:23
the table is set for this.
41:25
We've got the pieces in place
41:27
to really start to focus in
41:29
on how can I make my
41:31
model as efficient as possible, but
41:33
as flexible, but as flexible as
41:35
possible. So I can choose anywhere
41:37
that I want to be on
41:39
that performance cost-cost curve. So if
41:41
my task isn't, you know, very
41:43
difficult, I... don't want to spend
41:45
a lot of money on it,
41:47
I'm going to route this in
41:49
such a way with very little
41:51
thinking to a small model and
41:53
I'm going to be able to
41:55
achieve, you know, acceptable performance. And
41:57
if my task is really high
41:59
value, you know, I'm going to
42:01
pay more and I don't need
42:03
to like think about this. It's
42:05
just going to happen either from
42:07
the model architecture, from being able
42:09
to reason or not reason, from
42:11
routing that might be happening behind
42:13
an API. but cheaper model. I
42:15
think all of that needs to
42:17
be, you know, we need to
42:19
get to the point where no
42:21
one's having to think about this
42:23
or solve or design it. And
42:25
I really want to see, I
42:28
want to see these curves, and
42:30
I want to be able to
42:32
see us push those curves as
42:34
far to the left as possible,
42:36
making things more and more efficient,
42:38
versus like here's a number on
42:40
the leaderboard. Like I'm ready to
42:42
move beyond that. Fantastic. A great
42:44
conversation. Thank you so much. Kate
42:46
Sol for joining us on the
42:48
Prat Clay I podcast today. Really
42:50
appreciate it. A lot of insight
42:52
there. So thanks for coming on.
42:54
Hope we can get you back
42:56
on sometime. Thanks so much Chris.
42:58
Really appreciate you having me on
43:00
the show. If you haven't checked
43:02
out our change log newsletter, head
43:04
to change log.com/news. There you'll find
43:06
29 reasons, yes, 29 reasons why
43:08
you should subscribe. I'll tell you
43:10
reason number 17, you might actually
43:12
start looking forward to Mondays. Sounds
43:14
like somebody's got a case of
43:16
the Mondays! 28 more reasons are
43:18
waiting for you at change log.com/news.
43:20
Thanks again to our partners at
43:22
fly.io to break master cylinder for
43:24
the beats and to you for
43:26
listening. That is all for now,
43:28
but we'll talk to you again
43:30
next time.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More