Optimizing for efficiency with IBM’s Granite

Optimizing for efficiency with IBM’s Granite

Released Friday, 14th March 2025
Good episode? Give it some love!
Optimizing for efficiency with IBM’s Granite

Optimizing for efficiency with IBM’s Granite

Optimizing for efficiency with IBM’s Granite

Optimizing for efficiency with IBM’s Granite

Friday, 14th March 2025
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:01

Welcome to Practical AI, the

0:03

podcast that makes artificial intelligence

0:05

practical, productive, and accessible to

0:08

all. If you like this

0:10

show, you will love the

0:12

change log. It's news on

0:15

Mondays, deep technical interviews on

0:17

Wednesdays, and on Fridays, an

0:19

awesome talk show for your

0:21

weekend enjoyment. Find us by

0:24

searching for The Change Log

0:26

wherever you get your podcasts.

0:28

Thanks to our partners at

0:31

fly.io. Launch your AI apps

0:33

in five minutes or less.

0:35

Learn how at fly.io. Welcome

0:44

to another episode of the practical

0:46

AI podcast. This is Chris Benson.

0:49

I am your co-host. Normally, Daniel

0:51

White Neck is joining me as

0:53

the other co-host, but he's not

0:56

able to today. I am a

0:58

principal AI research engineer at Lockheed

1:00

Martin. Daniel is the CEO of

1:03

Prediction Guard. And with us today,

1:05

we have Kate Sol, who is

1:07

director of technical product management at

1:10

Granite for IBM. Welcome to the

1:12

show, Kate. Hey Chris, thanks

1:14

for having me. So I wanted

1:17

to, I know we're going to

1:19

dive shortly into what granite is

1:21

and some of our listeners are

1:23

probably already familiar with it, some

1:25

may not be, but before we

1:27

dive into that, wondering, we're talking

1:29

about AI models, that's what granite

1:31

is, and the world of LLLM's

1:33

generative AI. wondering if you could

1:36

start off talking a little bit

1:38

about your own background, how you

1:40

arrived at this, and we'll get

1:42

into a little bit about what

1:44

IBM is doing and why it's interested

1:46

in how it fits into the landscape

1:48

here for those who are not already

1:51

familiar with it. Perfect. Yeah, thanks

1:53

Chris. So I lead the technical

1:55

product management for granite, which is

1:57

IBM's large family of large language

1:59

models. that is produced by IBM

2:01

Research. And so I actually joined IBM

2:03

and IBM research a number of years

2:06

ago before large language models were really

2:08

became popular. You know, they had a

2:10

bit of a Netscape moment right back

2:13

in November of 2022. So I've been

2:15

working at the lab for a little

2:17

while. I'm... a little bit of a

2:20

odd duck, so to speak, in that

2:22

I don't have a research background, I

2:24

don't have a PhD, I come from

2:27

a business background, I worked in consulting

2:29

for a number of years, went to

2:31

business school. and joined IBM Research and

2:34

the AI lab here in order to

2:36

get more involved in technology. You know,

2:38

I've always kind of had one foot

2:41

in the tech space. I was a

2:43

data scientist for most of my tenure

2:45

as a consultant and always thought that

2:48

there was a lot of exciting things

2:50

going on in AI, and so I

2:52

joined the lab. and basically got to

2:54

work with a lot of generative AI

2:57

researchers before large language models really kind

2:59

of became big. And you know about

3:01

two and a half years ago a

3:04

lot of the technology we're working on

3:06

all of a sudden we started to

3:08

find and see that there were tremendous

3:11

business applications. You know, Open AI really

3:13

demonstrated what could happen if you took

3:15

this type of technology and Porsche fed

3:18

it enough compute to make it powerful.

3:20

It could do some really cool things.

3:22

So from there we worked as a

3:25

team really to spin up a program

3:27

and offering at IBM for our own

3:29

family of large language models that we

3:32

could offer our customers and the broader

3:34

open source ecosystem. Do you I'm curious

3:36

with one of the things that I've

3:39

you know we've noticed over time is

3:41

different organizations kind of are positioning the

3:43

These large language models within their product

3:46

offerings and in very unique ways and

3:48

and we've you know We could go

3:50

through some of your competitors and say

3:52

they do this way the how do

3:55

you guys see that in terms of?

3:57

You know how large language models fit

3:59

into your product offering is there is

4:02

there a vision that IBM has for

4:04

that? Yeah, I think the fundamental. premise

4:06

of large language models are that they

4:09

are a building block that you get

4:11

to build on and reuse in many

4:13

different ways, right, where one model can

4:16

drive a number of different use cases.

4:18

So, you know, from my perspective, that

4:20

value proposition resonates really clearly. We see

4:23

a lot of our customers, our own

4:25

internal offerings, where, you know, there's a

4:27

lot of effort on data curation and

4:30

collection and kind of creating and training

4:32

bespoke models for a specific task. And

4:34

now with large language models we get

4:37

to kind of use one model and

4:39

with very little label data all of

4:41

a sudden, you know, the world's here

4:43

oyster, there's a lot you can do.

4:46

And so that's a bit of the

4:48

reason why we have centralized the development

4:50

of our language language models within IBM

4:53

Research, not a specific product. It's one

4:55

offering that then feeds into many of

4:57

our different products in downstream applications. And

5:00

it allows us to kind of create

5:02

this building block that we can then

5:04

also offer customers to be able to

5:07

build on top of as well. And

5:09

open source ecosystem developers, you know, we

5:11

think there's a lot of different applications

5:14

for them. that one offering. And so,

5:16

you know, that's a little bit kind

5:18

of from the organizational side why we're

5:21

why it's kind of exciting, right, that

5:23

we get to do this all within

5:25

research. We don't have a P&L, so

5:28

to speak. We're doing this to create.

5:30

ultimately a tool that can support any

5:32

number of different use cases and downstream

5:35

applications. Very cool. And you mentioned open

5:37

source. I want to ask you because

5:39

that's always a big topic among organizations

5:41

is if I remember correctly, granted is

5:44

an under an Apache 2 license. Is

5:46

that correct? That's correct. I'm just curious

5:48

because we've seen strong arguments on both

5:51

sides. Why? Why is granite an open

5:53

source license like that? What was the

5:55

decision from IBM to go that direction?

5:58

Yeah, well there was kind of two

6:00

levels of decision making that we had

6:02

to. make when we talked about how

6:05

to license granite. One was open or

6:07

closed. So are we going to release

6:09

this model, release the weights out into

6:12

the world so that anyone can use

6:14

it regardless if they spend a dime

6:16

with IBM? And ultimately, IBM, you know,

6:19

believes strongly in the power of open

6:21

source ecosystems, a huge part of our

6:23

business is built around Red Hat and

6:26

being able to provide open source software

6:28

to our customers with enterprise guarantees guarantees.

6:30

And we felt that Open AI was

6:33

a far more responsible environment to develop

6:35

and to incubate this technology as a

6:37

whole and when you say open-source AI

6:39

you mean open-source AI just making sure

6:42

very important clarification very important clarification so

6:44

that was why we released our models

6:46

out into the open and then the

6:49

question was under what license because there

6:51

are a lot of models there are

6:53

a lot of licenses and a bit

6:56

of like a moment that everyone seeing

6:58

is you have a gamma license for

7:00

a gamma model. You've got a llama

7:03

license for a llama model. Everyone's coming

7:05

up with their own license. And, you

7:07

know, it kind of, in some ways,

7:10

it makes sense. Models are a bit

7:12

of a. weird artifact. They're not code.

7:14

You can't execute them on their own.

7:17

They're not software. They're not data per

7:19

se, but they are kind of like

7:21

a big bag of numbers at the

7:24

end of the day. So like, you

7:26

know, some of the traditional licenses, I

7:28

think some people didn't see a clear

7:31

fit, and so they came up with

7:33

their own. They're also all these different

7:35

kind of... potential risks that you might

7:37

want to solve for with a license

7:40

with a large language model that are

7:42

different than risk that you look at

7:44

with software or data. But at the

7:47

end of the day, IBM really wanted

7:49

just to keep this simple, like a

7:51

no-nonsense license that we felt would be

7:54

able to promote the broadest use from

7:56

the ecosystem without any restrictions. So we

7:58

went with Apache 2 because that's probably

8:01

the most widely used and just easy

8:03

to understand license that's out there. And

8:05

you know I think it really speaks

8:08

also to where we see models being

8:10

important building blocks that are further customized.

8:12

So we really believe the true value

8:15

in generative AIs being able to take

8:17

some of these. smaller open-source models and

8:19

build on top of it and even

8:22

start to customize it. And if you're

8:24

doing all that work and, you know,

8:26

building on top of something, you want

8:28

to make sure there are no restrictions

8:31

on all that IP you've just created.

8:33

And so that's ultimately why we went

8:35

with Apache 2.0. Understood. And one last

8:38

follow-up on licensing and then I'll move

8:40

on. It's more, it's partially just a

8:42

comment. IBM has a really strong legacy

8:45

as someone in the AI world and

8:47

decades of software development along with that.

8:49

I know both Red Hat with the

8:52

acquisition some years back, being strong on

8:54

open source and IBM both before and

8:56

after has, was it, was it, I'm

8:59

just curious, did that make it any

9:01

easier, do you think to go with

9:03

open source? Like, hey, we've done this

9:06

so much that we're gonna do that

9:08

with this thing too, even though it's

9:10

a little bit newer, you know, in

9:13

context. Culturally, did it seem easier to

9:15

get there than some companies that possibly

9:17

really struggle with that? They don't have

9:20

such a legacy in open source in

9:22

open source? I think it did make

9:24

it easier. I think there are always

9:26

going to be like any company going

9:29

down this journey has to take a

9:31

look at weight. We're spending how much

9:33

on what and you're going to give

9:36

it away for free and come up

9:38

with their own kind of equations on

9:40

how this starts to make sense. And

9:43

I think we've just experienced as a

9:45

company that the software and offerings we

9:47

create are so much stronger when we're

9:50

creating them as part of an open-source

9:52

ecosystem than something that we just keep

9:54

close to the best. So, you know,

9:57

it was a much easier business case,

9:59

so to speak, to make and to

10:01

get the sign-off that we needed. Ultimately,

10:04

our leadership was very supportive in order

10:06

to encourage this kind of open ecosystem.

10:08

Fantastic. Turning a little bit, as IBM

10:11

was diving into this into this realm

10:13

and starting, you know, and obviously like,

10:15

you have a history with grand that,

10:18

you know, you guys are on 3.2

10:20

at this point, that means that you've

10:22

been working on this for a period

10:24

of time, but as you're diving into

10:27

this very competitive ecosystem of building out

10:29

these open source models that are big,

10:31

they are expensive to make, and you're

10:34

looking for an outsized impact in the

10:36

world, how do you decide? how to

10:38

proceed with what kind of architecture you

10:41

want, you know, how did you guys

10:43

think about like, like you're looking at

10:45

competitors, some of them are closed source

10:48

like open AI is, some of them

10:50

like meta AI, you know, has llama

10:52

and you know, that series, as you're

10:55

looking at what's out there, how do

10:57

you make a choice about what is

10:59

right for what you guys are about

11:02

to go build, you know, because that's

11:04

one heck of an investment to make.

11:06

And I'm kind of curious how you,

11:09

when you're looking at that landscape. how

11:11

you make sense of that in terms

11:13

of where to invest? Yeah, absolutely. So,

11:16

you know, I think it's all about

11:18

trying to make educated bets that kind

11:20

of match your constraints that you're operating

11:22

with and your broader strategy. So, you

11:25

know, early on into our gener of

11:27

AI journey when we're kind of getting

11:29

the program up and running, you know,

11:32

we wanted to take fewer risks, we

11:34

wanted to learn how to do, you

11:36

know, common architectures, common patterns before we

11:39

started to get more quote-unquote innovative and

11:41

coming up with net new additions on

11:43

top. So early on the gen, and

11:46

you know, also you have to keep

11:48

in mind this field has just been

11:50

like. changing so quickly over the past

11:53

couple of years. So no one really

11:55

knew what they were doing. Like if

11:57

we look at how models were trained

12:00

two years ago and the decisions that

12:02

were made, the game was all about

12:04

as many parameters as possible and having

12:07

as little data as possible to keep

12:09

your training costs down. And now we've

12:11

totally switched. The general wisdom is as

12:13

much data as possible in a few

12:16

parameters as possible to keep your inference

12:18

costs down. once the model is finally

12:20

deployed. So the whole whole field's been

12:23

going. through a learning curve. But I

12:25

think early on, you know, our goal

12:27

was really working on trying to replicate

12:30

some of the architectures that were ORIA

12:32

out there, but innovate on the data.

12:34

So really focus in on how do

12:37

we create versions of these models that

12:39

are being released that deliver the same

12:41

type of functionality, but that we're trained

12:44

by IBM as a trusted partner working

12:46

very closely with all of our teams

12:48

to have a very. clear and ethical

12:51

data curation and sourcing pipeline to train

12:53

the models. So that was kind of

12:55

the first major innovation aim that we

12:58

had was actually not on the architecture

13:00

side. Then as we started to get

13:02

more confident as the fields started, I

13:05

don't want to say mature because we're

13:07

still very, again, very early innings, but

13:09

you know. We started to call less

13:11

to some shared understandings of how these

13:14

models should be trained and what works

13:16

or doesn't. You know, then our goal

13:18

really has started to focus on from

13:21

an architecture side, how can we be

13:23

as efficient as possible? How can we

13:25

train models that are going to be

13:28

economical for our customers to run? And

13:30

so that's where you've seen us focus

13:32

a lot on smaller models for right

13:35

now. And we're working on new architectures.

13:37

So for example, mixture of experts. There's

13:39

all sorts of things that we are

13:42

really focusing in really with kind of

13:44

the mantra of how do we make

13:46

this as efficient as possible for people

13:49

to further customize and to run in

13:51

their own environments. So that was a

13:53

fantastic start to as we dive into

13:56

granite itself, kind of laying it out.

13:58

You know, your last comments, you talked

14:00

about kind of the smaller, more economical

14:03

models so that you're getting efficient inference

14:05

on the customer side. You mentioned a

14:07

phrase, which some people may know, some

14:09

people may not mixture of experts, maybe

14:12

talk as we dive into, you know,

14:14

what granite is in its versions going

14:16

forward here. you start with mixture of

14:19

experts and what you mean by that?

14:21

Absolutely. So if we think of how

14:23

these models are being built, there are

14:26

essentially billions of parameters that are representing

14:28

small little numbers that basically are encoding

14:30

information. And, you know, to like draw

14:33

a really simple explanation, if you have

14:35

a, you know, a linear regression, like

14:37

you've got a scatterpoint and you're fitting

14:40

a line, y equals mx plus b,

14:42

like m is a parameter in that

14:44

equation, right? So this, that, except on

14:47

the scale of billions, with mixture of

14:49

experts, what we're looking at is, do

14:51

I really need all one billion parameters

14:54

every single time I run inference? Can

14:56

I use a subset? a large language

14:58

model, so that at inference time I'm

15:01

being far more selective and smart about

15:03

which parameters get called. Because if I'm

15:05

not using... all 8 billion or 120

15:07

billion parameters, I can run that inference

15:10

far faster. So it's much more efficient.

15:12

And so really it's just getting a

15:14

little bit more nuanced of instead of

15:17

like, I think a lot of early

15:19

days of generative AI is just throw

15:21

more compute at it and hope the

15:24

problem goes away. We're now trying to

15:26

like figure out how can we be

15:28

far more efficient in how we build

15:31

these models. So I appreciate the explanation

15:33

on a mixture of experts and that

15:35

makes a lot of sense in terms

15:38

of trying to use the model efficiently

15:40

for an inference by reducing the number

15:42

of parameters. I believe you're right now

15:45

you guys have, is it 8 billion

15:47

and 2 billion or the model? sizes

15:49

in terms of the parameters or have

15:52

I gotten that wrong? We got actually

15:54

a couple of sizes. So you're right,

15:56

we've got 8 billion and 2 billion.

15:58

But speaking of those mixture of expert

16:01

models, we actually have a couple of

16:03

tiny MOE models. MOE stands for a

16:05

mixture of experts. So we've got MOE

16:08

model with only a billion parameters and

16:10

a MOE model with 3 billion parameters.

16:12

But they respectively use far fewer parameters

16:15

at inference time. So they run really,

16:17

really quick, designed for more local applications,

16:19

like running out a CPU. So and

16:22

when. When you make the decision to

16:24

have different size models in terms of

16:26

the number of parameters and stuff, do

16:29

you have different use cases in mind

16:31

of how those models might be used?

16:33

And is there one set of scenarios

16:36

that you would put your $8 billion,

16:38

another one that would be that $3

16:40

billion that you mentioned? Yeah, absolutely. So

16:43

if we think about it, when we're

16:45

kind of designing the model sizes that

16:47

we want to train, a huge question

16:50

that we're trying to solve for is,

16:52

you know, what are the environments these

16:54

models going to be run on and

16:56

how do I, you know, maximize performance

16:59

without forcing someone to have to buy

17:01

another GPU to host it. So, you

17:03

know, there are models like the small

17:06

MOE models that were actually designed much

17:08

more for running on the edge locally

17:10

or on the computer, like just a

17:13

local laptop. We've got models that are

17:15

designed to run on a single GPU,

17:17

which is like our two billion and

17:20

eight billion models. Those are standard architecture,

17:22

not MOE. And we've got models on

17:24

our roadmap that are looking at how

17:27

can we kind of max out what

17:29

a single GPU. could run, and then

17:31

how can we max out what a

17:34

box of GPUs could run? So if

17:36

you got eight GPUs stitched together. So

17:38

we are definitely thinking about those different

17:41

kind of. tranches of compute availability that

17:43

customers might have. And each of those

17:45

tranches could relate to different use cases.

17:48

Like obviously, if you're thinking about something

17:50

that is local, you know, there's all

17:52

sorts of IOT type of use cases

17:54

that that could target. If you are

17:57

looking at something that has to be

17:59

run on, you know, a box of

18:01

HEPUs, you know, you're looking at something

18:04

that you have to be okay with

18:06

having a little bit more latency, you

18:08

know, time it takes for the model

18:11

to respond. bit higher value because it

18:13

costs more to run that. model and

18:15

so you're not going to run like

18:18

a really simple like you know help

18:20

me summarize this email task hitting you

18:22

know eight GPUs at once. So as

18:25

you talk about the segmentation of these

18:27

of the of the family of models

18:29

and how you're doing that, I know

18:32

one of the things you guys have

18:34

a white paper which will be linking

18:36

in on the show notes for folks

18:39

to go and take a look at

18:41

either during or after they listen here

18:43

and you talk about some of the

18:45

models being experimental chain of thought reasoning

18:48

capabilities. I was wondering if you could.

18:50

talk a little bit about what that

18:52

means. Yeah, so really excited with the

18:55

latest release of our granite models. Just

18:57

the end of February released granite 3.2,

18:59

which is an update to our 2

19:02

billion parameter model and our 8 billion

19:04

parameter model. And one of the kind

19:06

of superpowers we give this model in

19:09

the new release is we bring in

19:11

an experimental feature for reasoning. And so

19:13

what we mean by that is there's

19:16

this new concept, relatively new concept in

19:18

general. of AI called inference time compute,

19:20

where if you, what that really equates

19:23

to, just to put in plain language,

19:25

if you think longer and harder about

19:27

a prompt, about a question, you can

19:30

get a better response. I mean, this

19:32

works for humans, this is how you

19:34

and I think, but it's the same

19:37

is true for large language models. And

19:39

thinking here, you know, is a bit

19:41

of a risk of anthropomorphizing the term,

19:43

but it's where we've landed as a

19:46

field, so I'll run with it for

19:48

now. generate more tokens. So have the

19:50

model think through what's called a chain

19:53

of thought, you know, generates logical thought

19:55

processes and sequences of how the model

19:57

might approach answering before triggering the model

20:00

to then respond. And so we've trained

20:02

granite 8B 3.2 in order to be

20:04

able to do that chain of thought

20:07

reasoning natively, take advantage of this new

20:09

inference time compute area of innovation. And

20:11

what we've done is we've made it

20:14

selective selective. So if you've made it

20:16

selective. Don't to think long and hard

20:18

about what is 2 plus 2, you

20:21

turn it off and the model responds

20:23

faster just with the answer. If you

20:25

are giving it a more difficult question,

20:28

you know, pondering the meaning of life,

20:30

you might turn thinking on, and it's

20:32

going to think through a little bit

20:35

first before answering an answer, or with

20:37

a much, in general, a longer, kind

20:39

of more chain of thought style approach

20:41

towards explaining kind of step by step

20:44

why it's responding the way it is.

20:46

Do you anticipate, kind of, and I've

20:48

seen this done from different organizations in

20:51

different ways, do you anticipate that your

20:53

inference time compute capability is going to

20:55

be kind of there on all the

20:58

models and you're turning it on and

21:00

off? Or do you anticipate that some

21:02

of the models in your family are

21:05

more specializing in that and that's always

21:07

on versus others? Which way you kind

21:09

of mentioned the on and off? So

21:12

it sounded like you might have it

21:14

in all of the above. Yeah, you

21:16

know, right now. it's marked as an

21:19

experimental feature. I think we're still learning

21:21

a lot about how this is useful

21:23

and what it's going to be used

21:26

for, and that might dictate what makes

21:28

sense moving forward. But what we're seeing

21:30

is kind of universally, it's useful, one,

21:33

to try and improve the quality of

21:35

the answers, but two, as an explainability

21:37

feature, like if the model is going

21:39

through and explaining more how it came

21:42

up with a response, moving forward, which

21:44

is a different approach, right, than some

21:46

models which are just focused on reasoning.

21:49

I don't think we're going to see

21:51

that very long. You know, I think

21:53

more and more we're going to see

21:56

more selective reasoning, so like Claude 3.7

21:58

came out, they're actually doing a really

22:00

nice job with this, where you can

22:03

think longer or harder about something or

22:05

just think for a short amount of

22:07

time. So I think we're going to

22:10

see increasingly more and more folks move

22:12

in that direction. But there's still, again,

22:14

early innings, I'll say it again. So

22:17

we're going to learn a lot over

22:19

the next couple of months about where

22:21

this is having the most impact. And

22:24

I think that could have some structural

22:26

implications of how. we design our roadmap

22:28

moving forward. Gotcha. There has been a

22:30

larger push in the industry toward smaller

22:33

models. So kind of going back over

22:35

the. the recent history of LLLMs and

22:37

you know you saw initially you know

22:40

the just the number of parameters exploding

22:42

and the models becoming huge and obviously

22:44

you know we talked a little bit

22:47

about the fact that that's very expensive

22:49

on inference yeah to run these things

22:51

and over the last especially over the

22:54

last I don't know a year year

22:56

and a half there's been a much

22:58

stronger push especially with open source models

23:01

we've seen a lot of them on

23:03

hugging face pushing to smaller Do you

23:05

anticipate, is you're thinking about this capability

23:08

of being able to reason that that's

23:10

going to drive smaller model use toward

23:12

models like what you guys are creating

23:15

where you're saying, okay, we have these

23:17

large, you know, Claude has the, you

23:19

know, big models and out there, you

23:22

know, is an option or or a

23:24

llama model that's very large? Are you

23:26

guys anticipating kind of pulling a lot

23:28

more mind shear towards some of the

23:31

smaller ones? And do you anticipate? that

23:33

you're going to continue to focus on

23:35

on the smaller more efficient ones where

23:38

people can actually get them deployed out

23:40

there without without breaking the bank of

23:42

the organization. How is that fit in?

23:45

Yeah, so look at one thing to

23:47

keep in mind is even without thinking

23:49

about it without trying we're seeing small

23:52

models are increasingly able to do what

23:54

it took a big model to do

23:56

yesterday. So you look at what a

23:59

tiny, you know, 2 billion parameter, our

24:01

granite 2B model, for example, outperforms on

24:03

numerous benchmarks, you know, Lama 270B, which

24:06

is a much larger, but older generation.

24:08

I mean, it was state-of-the-art when it

24:10

was released, but the technology is just

24:13

moving so quickly. So, you know, we

24:15

do believe that by focusing on some

24:17

of the smaller sizes, that ultimately we're

24:20

going to get a lot of lift

24:22

just natively. because that is where the

24:24

technology is evolving. Like we're continuing to

24:26

find ways to pack more and more

24:29

performance in fewer and fewer parameters and

24:31

expand the scope of what you can

24:33

accomplish with a small language model. I

24:36

don't think that means we're going to.

24:38

ever get rid of big models? I

24:40

just think if you look at where

24:43

we're focusing, we're really looking at kind

24:45

of where are the models, you know,

24:47

if you think of the 80-20 rule,

24:50

like 80% of the use cases can

24:52

be handled by a model, you know,

24:54

maybe 8 billion parameters or less. That's

24:57

what we're targeting with granite and we're

24:59

really trying to focus in. We think

25:01

that there's definitely still always going to

25:04

be... innovation in opportunity and complex use

25:06

cases that you need larger models to

25:08

handle. And that's where we're really interested

25:11

to see, okay, how do we expand

25:13

the Granite family potentially focusing on more

25:15

efficient architectures like mixture of experts to

25:18

target those larger models and more complex

25:20

model sizes so that you still get

25:22

a little bit more of a more

25:24

practical implementation of a big model, recognizing

25:27

that again, you're always going to need

25:29

there's always going to be those outliers,

25:31

those really big cases. We just don't

25:34

think there's going to be as much

25:36

business value, frankly, behind those compared to

25:38

really focusing and delivering value on the

25:41

small to medium model space. I think

25:43

we've, that's one thing Daniel and I

25:45

have talked quite a bit about is

25:48

that we would agree with that. It's

25:50

I think the bulk of the use

25:52

cases are for the smaller ones. While

25:55

we're at it, you know, we've been

25:57

talking about various aspects of granite a

25:59

bit, but could we take a moment

26:02

and have you kind of go back

26:04

through the granite family and kind of.

26:06

talk about each component in the family,

26:09

what it does, you know, what it's

26:11

called, what it does, and just kind

26:13

of lay out the array of things

26:15

that you have to offer. Absolutely. So

26:18

the granite model family has the language

26:20

models that I went over. So between

26:22

1 billion to 8 billion parameters in

26:25

size. And again, we think those are

26:27

like the the work. course models, you

26:29

know, 80% of the tasks, we think

26:32

you can probably get away with a

26:34

model that's 8 billion parameters or less.

26:36

We also with 3.2 recently released a

26:39

vision model. So these models are for

26:41

vision understanding tasks. That's important. It's not

26:43

vision or image generation, which is where

26:46

a lot of the early, like, hyped

26:48

excitement on generative AI came from, is

26:50

like Dolly and those. We're focused on

26:53

models where you provide an image in

26:55

a prompt, and then the output is

26:57

text, the model response. So really useful

27:00

for things like. image and document understanding.

27:02

We specifically prioritize a very large amount

27:04

of document and chart Q&A type data

27:07

in its training data, really focusing on

27:09

performance on those types of tasks. So

27:11

you can think of, you know, having

27:13

a picture or an extract of a

27:16

chart from a PDF and being able

27:18

to answer questions about it. We think

27:20

there's a lot of opportunity. So rag

27:23

is a very popular workflow in enterprise,

27:25

right? Retrival augmented generation. Right now, all

27:27

of the images in your PDFs and

27:30

documents, they all get basically thrown away.

27:32

But we really like our working on

27:34

can we use our vision model to

27:37

actually include all of those charts, images,

27:39

figures, diagrams to help improve the model's

27:41

ability to answer questions in a rag

27:44

workflow. So I think that's going to

27:46

be huge. So lots of use cases

27:48

on the on the vision side. And

27:51

then we also have a number of

27:53

kind of companion models that are designed

27:55

to work in parallel with a language

27:58

model or a vision language model. So

28:00

we've got our granite guardian family of

28:02

models. And these are, we call them

28:05

guardrails. They're meant to sit in. right

28:07

in parallel with the large language model

28:09

that's running the main workflow. And they

28:11

monitor all the inputs that are coming

28:14

into the model and all the outputs

28:16

that are being provided by the model,

28:18

looking for potential adversarial prompts, jailbreaking attacks,

28:21

harmful inputs, harmful and biased outputs. They

28:23

can detect hallucination. and model responses. So

28:25

it's really meant to be a governance

28:28

layer that can sit and work right

28:30

alongside Granite, can actually work alongside any

28:32

model. So even if you've got an

28:35

open AI model, for example, you've deployed,

28:37

you can have Granite Guardian work right

28:39

in parallel. And ultimately, just be a

28:42

tool for responsible AI. And, you know,

28:44

the last model I'll talk about is

28:46

our embedding models, which again is meant

28:49

to be, you know, assist a model

28:51

in a broader general AI workflow. So

28:53

in a rag workflow, you'll often need

28:56

to take large amounts of documents or

28:58

text and convert them into what are

29:00

called embeddings that you can search over

29:03

in order to retrieve the most relevant

29:05

info and give it to the model.

29:07

So our granite embedding models are used

29:09

for that embedding models. meant to do

29:12

that conversion and can support in a

29:14

number of different similar kind of search

29:16

and retrieval style workflows working directly with

29:19

the granite large language model. Gotcha. I

29:21

know there was there was some comment

29:23

in the white paper also about time

29:26

series. Yes. Talk a little bit to

29:28

that for a second. Absolutely. So I

29:30

mentioned granted is multimodal net supports vision.

29:33

We also have time series as a

29:35

modality and I'm really glad you brought

29:37

these up because these models are really

29:40

exciting. So we talked about our focus

29:42

on efficiency. These models are like one

29:44

to two million parameters in size. That

29:47

is teeny tiny in today's generative AI

29:49

context. Even compared to other forecasting models,

29:51

these are really small generative AI based

29:54

time series forecasting models, but they are

29:56

right now. delivering top of the top

29:58

marks when it comes to performance. So

30:00

we just as part of this release

30:03

submitted our time series models to Salesforce

30:05

has a time series leaderboard called Gift.

30:07

They're the number one leaderboard on Gift

30:10

right now, number one model on Gifts

30:12

Leaderboard right now. And we're really excited.

30:14

They've got over 10 million downloads on

30:17

Hugging Face. They're really taking off in

30:19

the community. So it's a really excellent

30:21

offering in the time. series modality for

30:24

the Granite family. Okay, well thank you

30:26

for going through kind of the layout

30:28

of the family of models that you

30:31

guys have. I actually want to go

30:33

back and ask a quick question that

30:35

you talked a bit about guardian kind

30:38

of providing guardrails and stuff and that's

30:40

something that if you take a moment

30:42

to dive into, I think we often

30:45

tend to focus kind of on, you

30:47

know, the model and it's going to

30:49

do X, you know, whatever. I love

30:52

the notion of integrating these guardrails that

30:54

Guardian represents into a larger architecture to

30:56

address. kind of the quality issues surrounding

30:58

the inputs and the outputs on that.

31:01

How did you guys arrive at that?

31:03

I'm just, you know, and how did

31:05

you, you know, it's pretty cool. I

31:08

love the idea that not only is

31:10

it there for your own models, obviously,

31:12

but that, you know, that you could

31:15

have an end user go and apply

31:17

it to something else that they're doing,

31:19

maybe from a competitor or whatever. How

31:22

did you decide to do that? And,

31:24

you know, I think that's a fairly

31:26

unique thing that we don't tend to

31:29

hear as much from other organizations. Yeah,

31:31

you know, so Chris, the one of

31:33

the values again of being in the

31:36

open source ecosystem is we get to

31:38

like build on top of other people's

31:40

great ideas. So we actually weren't the

31:43

first ones to come up with it.

31:45

There's a few other guardrail type models

31:47

out there, but you know, IBM has

31:50

quite a large, especially IBM research presence

31:52

in security space, and there are challenges

31:54

in security that are very similar to

31:56

the large language models in general AI

31:59

that, you know. It's not totally new.

32:01

And what I think we've learned as

32:03

a company and as a field is

32:06

that you always need layers of security

32:08

when it comes to creating a robust

32:10

system against potential adversarial attacks and dealing

32:13

with even the model's own innate safety

32:15

alignment itself. So, you know, when we

32:17

saw some of the work going out

32:20

in the open source ecosystem on guardrails,

32:22

you know, I think it was kind

32:24

of a no-brainer from a perspective of

32:27

this is another great way to add

32:29

an additional layer on that generative AI

32:31

stack of security and safety to better

32:34

improve model robustness and figure out. you

32:36

know, IBM's hyper focused on what is

32:38

the practical way to implement general AI.

32:41

So what else is needed beyond efficiency?

32:43

We need trust, we need safety. Let's

32:45

create tools in that space. So it

32:47

kind of, you know, number of different

32:50

reasons all made it a very clear,

32:52

clear and easy when to go and

32:54

pursue. And we are actually able to

32:57

build on top of granite. So granite

32:59

Guardian is a fine-tuned version of granite.

33:01

that's laser focused on these tasks of

33:04

detecting and monitoring inputs going into the

33:06

model and outputs going out. And the

33:08

team has done a really excellent job

33:11

first starting at basic harm and bias

33:13

detectors, which I think is pretty prevalent

33:15

in other guardrail models that are out

33:18

there. But now we've really started to

33:20

kind of make it our own and

33:22

innovate. So some of the new features

33:25

that were released in the 3.2 granite

33:27

guardian models include hallucination detection, very few

33:29

models do that today, specifically hallucination detection

33:32

with function calling. So if you think

33:34

of an agent. you know, whenever an

33:36

LLLM agent is trying to access or

33:39

submit external information, it'll make a what's

33:41

called a tool call. And so when

33:43

it's making that tool call, it's providing

33:45

information based off of the conversation history

33:48

saying, you know, I need to look

33:50

up, you know, Kate Soles information in

33:52

the HR database. This is her first

33:55

name. She lives in Cambridge Mass, X,

33:57

Y, Z. And we want to make

33:59

sure the agent isn't hallucinating. made up

34:02

the wrong name or said Cambridge UK

34:04

instead of Cambridge Mass, the tool will

34:06

provide the incorrect response back but the

34:09

agent will have no idea and it

34:11

will keep operating with utmost certainty that

34:13

it's operating on correct information. So you

34:16

know it's just an interesting example of

34:18

you know some of the observability we're

34:20

trying to inject into response. responsible AI

34:23

workflows, particularly around things like agents, because

34:25

there's all sorts of new safety concerns

34:27

that really have to be taken into

34:30

account to make this technology practical and

34:32

implementable. and you know that's actually having

34:34

brought up agents and stuff and that

34:37

being kind of the really hot topic

34:39

of the moment of you know 2025

34:41

so far could you talk a little

34:43

bit about granite and agents and how

34:46

you guys you know how are you're

34:48

thinking you've gone through one example right

34:50

there but if you could expand on

34:53

that a little bit in terms of

34:55

you know how does how is IBM

34:57

thinking about positioning granite how do agents

35:00

fit in what does that ecosystem look

35:02

like you know, you've started to talk

35:04

about security a bit. Could you kind

35:07

of weave that story for us a

35:09

little bit? Absolutely. So yeah, obviously, IBM

35:11

is all in on agents and there's

35:14

just so much going on in the

35:16

space. A couple of key things that

35:18

I think are interesting to bring up.

35:21

So one is looking at the open

35:23

source ecosystem for building agents. So we

35:25

actually have a really fantastic team located

35:28

right here in Cambridge, Massachusetts that is

35:30

working on an agent framework and broader

35:32

agent stack called B AI, like a

35:35

bumble B. So we're working really closely

35:37

with them on how do we kind

35:39

of co-optimize a framework for agents with

35:41

a model that in order to be

35:44

able to have all sorts of new

35:46

tips and tricks so to speak that

35:48

you can harness when building agents. So

35:51

I don't want to give too much

35:53

away but I think there's a lot

35:55

of really interesting things that IBM's thinking

35:58

about agent framework and model co-design and

36:00

that only unlocks so much potential when

36:02

it comes to safety and security. because

36:05

there needs to be parts, for example,

36:07

of an LLLM, of an agent, that

36:09

agent developer programs that you never want

36:12

the user to be able to see.

36:14

There are parts of data that an

36:16

agent might retrieve as part of a

36:19

tool call that you don't want the

36:21

user to see. an agent that I'm

36:23

working with might have access to anybody's

36:26

HR records, but I... only have permission

36:28

to see my HR records. So how

36:30

can we design models and frameworks with

36:32

those concepts in mind in order to

36:35

better demark types of sensitive information that

36:37

should be hidden in order to protect

36:39

information that the prevent those types of

36:42

attack vectors through model co-design and agent

36:44

model and agent framework co-design. So I

36:46

think there's a lot of really exciting

36:49

work there. More broadly though, you know,

36:51

I think even on more traditional ideas

36:53

and implementations of agent, not that there's

36:56

a traditional one, this is so new,

36:58

but more classical agent implementations were working,

37:00

for example, with IBM Consulting. They have

37:03

an agent in assistant platform that is

37:05

where granite is the default agent and

37:07

assistant that gets built. And so that

37:10

allows IBM all sorts of economies of

37:12

scale. If you think about, we've now

37:14

got 160,000 consultants out in the world

37:17

using agents and assistants built off of

37:19

granite in order to be more efficient

37:21

and to help them with their client

37:24

and consulting projects. So we see a

37:26

ton of client zero, what we call

37:28

client zero. IBM is our, you know,

37:30

first client in that case of how

37:33

do we even internally build a... with

37:35

granite in order to improve IBM productivity.

37:37

Very cool. I'm kind of curious as

37:40

as you guys are looking at this

37:42

this array of considerations that you've just

37:44

been going through and as there is

37:47

more more push out into the edge

37:49

environments and you've already talked a little

37:51

bit about that earlier. As we're starting

37:54

to wind down could you talk a

37:56

little bit about kind of as as

37:58

things push a bit out of the

38:01

cloud and of the data center and

38:03

as we have been migrating away from

38:05

these gigantic models into a lot more

38:08

smaller hyper efficient models often that have

38:10

that that are doing better on performance

38:12

and stuff And we see so many

38:15

opportunities out there in a variety of

38:17

edge environments. Could you talk a little

38:19

bit about kind of where granite might

38:22

be going with that or where it

38:24

is now and kind of what the

38:26

what the thoughts about granite at the

38:28

edge might look like? Yeah, so I

38:31

think with granite at the edge, there's

38:33

a couple of different aspects. One is

38:35

how can we think about building with

38:38

models so that we can optimize for

38:40

smaller models in size? So when I

38:42

say building, I mean building prompts, building

38:45

applications so that we're not, you know,

38:47

designing prompts, how they're written today, which

38:49

I like to call it like the

38:52

Yolo method where I'm going to give

38:54

10 pages of instructions all at once

38:56

and say, go and do this, and

38:59

hope to get it, you know, the

39:01

model follows all those instructions and does

39:03

everything beautifully, like small models, no matter

39:06

how much this technology advances, probably aren't

39:08

going to get, you know, perfect scores

39:10

on that type of approach. So how

39:13

can we think about... broader kind of

39:15

programming frameworks for dividing things up into

39:17

much smaller pieces that a small model

39:20

can operate on. And then how do

39:22

we leverage model and hardware code design

39:24

to run those small pieces really fast?

39:26

So, you know, I think there's a

39:29

lot of opportunity, you know, across

39:31

the stack of how people are

39:33

building with models, the models themselves

39:35

and the hardware that the model

39:37

is running on, that's going to

39:39

allow us to push things. much

39:41

further to the edge than we've

39:43

really experienced so far. It's going

39:45

to require a bit of a

39:47

mine shift again. Like right now

39:49

I think we're all really happy

39:51

that we can be a bit

39:53

lazy when we write our prompts

39:55

and just like, you know, write

39:57

kind of word vomit prompts down.

39:59

But I think if we can

40:01

get a little bit more like

40:03

kind of software engineering, mine set

40:05

in terms of how you program

40:07

and build, it's going to allow

40:09

us to break things into much

40:11

smaller components and push those components

40:13

even farther to the edge. That

40:15

makes sense. That makes a lot

40:18

of sense. I guess, kind of

40:20

final question for you as we

40:22

talk about this, kind of, any

40:24

other thought, you talked a little

40:26

bit about kind of where you

40:28

think things are going, what the

40:30

future looks like when you are

40:32

kind of winding up for the

40:34

day and you're at that moment

40:36

where you're kind of just your

40:38

mind wonders a little bit any

40:40

anything that appeals to you that

40:42

kind of goes through your head.

40:44

So I think the thing I've

40:46

been most obsessed about lately is

40:48

you know we need to get

40:50

to the point as a field

40:52

where models are measured by like

40:54

how efficient they're efficient frontier is

40:56

not by like you know did

40:58

they get to 0.01 higher on

41:00

a metric or benchmark. So I

41:02

think we're starting to see this

41:04

with like the reasoning with granite,

41:06

you can turn it on and

41:08

off, with the reasoning with claw,

41:10

you can pay more, you know,

41:12

have harder thoughts, you know, longer

41:14

thoughts or shorter thoughts. But you

41:16

know, I really want to see

41:18

us get to the point, and

41:20

I think we've got the, like,

41:23

the table is set for this.

41:25

We've got the pieces in place

41:27

to really start to focus in

41:29

on how can I make my

41:31

model as efficient as possible, but

41:33

as flexible, but as flexible as

41:35

possible. So I can choose anywhere

41:37

that I want to be on

41:39

that performance cost-cost curve. So if

41:41

my task isn't, you know, very

41:43

difficult, I... don't want to spend

41:45

a lot of money on it,

41:47

I'm going to route this in

41:49

such a way with very little

41:51

thinking to a small model and

41:53

I'm going to be able to

41:55

achieve, you know, acceptable performance. And

41:57

if my task is really high

41:59

value, you know, I'm going to

42:01

pay more and I don't need

42:03

to like think about this. It's

42:05

just going to happen either from

42:07

the model architecture, from being able

42:09

to reason or not reason, from

42:11

routing that might be happening behind

42:13

an API. but cheaper model. I

42:15

think all of that needs to

42:17

be, you know, we need to

42:19

get to the point where no

42:21

one's having to think about this

42:23

or solve or design it. And

42:25

I really want to see, I

42:28

want to see these curves, and

42:30

I want to be able to

42:32

see us push those curves as

42:34

far to the left as possible,

42:36

making things more and more efficient,

42:38

versus like here's a number on

42:40

the leaderboard. Like I'm ready to

42:42

move beyond that. Fantastic. A great

42:44

conversation. Thank you so much. Kate

42:46

Sol for joining us on the

42:48

Prat Clay I podcast today. Really

42:50

appreciate it. A lot of insight

42:52

there. So thanks for coming on.

42:54

Hope we can get you back

42:56

on sometime. Thanks so much Chris.

42:58

Really appreciate you having me on

43:00

the show. If you haven't checked

43:02

out our change log newsletter, head

43:04

to change log.com/news. There you'll find

43:06

29 reasons, yes, 29 reasons why

43:08

you should subscribe. I'll tell you

43:10

reason number 17, you might actually

43:12

start looking forward to Mondays. Sounds

43:14

like somebody's got a case of

43:16

the Mondays! 28 more reasons are

43:18

waiting for you at change log.com/news.

43:20

Thanks again to our partners at

43:22

fly.io to break master cylinder for

43:24

the beats and to you for

43:26

listening. That is all for now,

43:28

but we'll talk to you again

43:30

next time.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features