Towards high-quality (maybe synthetic) datasets

Towards high-quality (maybe synthetic) datasets

Released Wednesday, 9th October 2024
Good episode? Give it some love!
Towards high-quality (maybe synthetic) datasets

Towards high-quality (maybe synthetic) datasets

Towards high-quality (maybe synthetic) datasets

Towards high-quality (maybe synthetic) datasets

Wednesday, 9th October 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:06

Welcome to Practical AI. If

0:09

you work in artificial intelligence, aspire

0:12

to, or are curious how

0:14

AI-related tech is changing the world,

0:17

this is the show for you.

0:20

Thank you to our partners at

0:22

fly.io, the home of changelog.com. Fly

0:26

transforms containers into micro VMs that

0:28

run on their hardware in 30

0:30

plus regions on six continents. So

0:32

you can launch your app near

0:34

your users. Learn more at

0:36

fly.io. Okay

0:45

friends, I'm here with Annie Sexton over

0:47

at Fly. Annie, you know we use

0:50

Fly here at ChangeLog. We love Fly.

0:53

It is such an awesome platform and we love building on

0:55

it. For those who don't

0:57

know much about Fly, what's special

0:59

about building on Fly? Fly gives

1:01

you a lot of flexibility, like a

1:03

lot of flexibility on multiple fronts. And

1:06

on top of that you get, so I've

1:08

talked a lot about the networking and

1:10

that's obviously one thing, but there's various

1:12

data stores that we partner with that

1:15

are really easy to use. Actually

1:17

one of my favorite partners is Tigris.

1:20

I can't say enough good things about

1:22

them when it comes to object storage.

1:25

I never in my life thought I would have so many

1:27

opinions about object storage, but I do now. Tigris

1:29

is a partner of Fly and it's

1:31

S3 compatible object storage that basically

1:33

seems like it's a CDN, but

1:35

it's not. It's basically object storage

1:37

that's globally distributed without needing to

1:39

actually set up a CDN at

1:41

all. It's like automatically distributed

1:44

around the world. And it's also incredibly

1:46

easy to use and set up. Like

1:48

creating a bucket is literally one command.

1:50

So it's partners like that, that I

1:52

think are this sort of extra icing

1:54

on top of Fly that really makes

1:56

it sort of the platform that has

1:58

everything that you need. So we

2:00

use Tigris here at Change Log. Are they

2:02

built on top of Fly? Is this one

2:05

of those examples of being able to build

2:07

on Fly? Yeah, so Tigris is built on

2:09

top of Fly's infrastructure and that's what allows

2:11

it to be globally distributed. I do have

2:14

a video on this, but basically the way

2:16

it works is whenever, like let's say a

2:18

user uploads an asset to a particular bucket.

2:20

Well, that gets uploaded directly to the region

2:22

closest to the user, whereas with a CDN,

2:25

there's sort of like a centralized place where

2:27

assets need to get copied to and then

2:29

eventually they get sort of trickled out to

2:31

all of the different global locations. Whereas with

2:33

Tigris, the moment you upload something, it's available

2:36

in that region instantly and then it's eventually

2:38

cached in all the other regions as well

2:40

as it's requested. In fact, with Tigris, you

2:42

don't even have to select which regions things

2:44

are stored in. You just get these regions

2:47

for free. And then on top of that,

2:49

it is so much easier

2:51

to work with. I feel like the

2:53

way they manage permissions, the way they

2:56

handle bucket creation, making things public or

2:58

private is just so much

3:00

simpler than other solutions. And the good news is

3:02

that you don't actually need to change your code

3:05

if you're already using S3. It's S3 compatible,

3:07

so like whatever SDK you're using is probably

3:10

just fine and all you got to do

3:12

is update the credentials. So it's super easy.

3:14

Very cool. Thanks, Annie. So

3:16

Fly has everything you need.

3:18

Over 3 million applications, including

3:20

ours here at ChangeLog multiple

3:22

applications, have launched on fly

3:24

boosted by global anti-cast load

3:26

balancing, zero configuration, private networking,

3:28

hardware isolation, instant wire guard

3:31

VPN connections, push button deployments

3:33

that scale to thousands of

3:35

instances. It's all there for

3:37

you right now. To play

3:39

your app in five minutes,

3:41

go to fly.io. Again, fly.io.

3:56

Welcome to another episode of the

3:58

Practical AI podcast. is Daniel

4:00

White Knack. I am CEO at Prediction

4:03

Guard where we're building a private

4:05

secure gen AI platform. And I'm joined

4:07

as always by Chris Benson, who is

4:09

a principal AI research engineer at Lockheed

4:12

Martin. How are you doing, Chris? Great

4:14

today, Daniel. How are you? It's

4:16

a beautiful fall day and a good

4:19

day to

4:23

take a walk around the block and think about

4:25

interesting AI things and clear your

4:27

mind before getting back

4:29

into some data collaboration, which

4:32

is what we're going to talk about

4:34

today. Chris, I don't know if you

4:36

remember our conversation. It

4:38

was just me on that one, but with Bing

4:41

Sun Chua, who talked about

4:43

broccoli AI, the type of

4:45

AI that's healthy for organizations.

4:48

And in that episode, he made a

4:51

call out to Argeela, which was a

4:53

big part of his solution

4:55

that he was developing in

4:57

a particular vertical. I'm really

5:00

happy today that we have with us

5:02

Ben Burton-Chua, who is a machine

5:04

learning engineer at Argeela, and

5:06

also David Berenstein,

5:08

who is a developer advocate

5:10

engineer working on building Argeela

5:12

and distill label at Hugging

5:15

Face. Welcome, David and Ben.

5:17

Thank you. Great to be here. Hi. Thanks

5:20

for having us. Yeah, so like I was saying,

5:22

I think for some

5:24

time maybe if you're

5:26

coming from a data science perspective, there's

5:30

been tooling maybe around

5:32

data that manages

5:34

training data sets or evaluation

5:36

sets or maybe MLOps tooling

5:38

and this sort of thing.

5:41

And part of that has to do with preparation

5:43

and curation of data sets. But

5:46

I found interesting, I mentioned the

5:48

previous conversation with Bing Sun. He

5:50

talked a lot about collaborating

5:52

with his subject matter experts

5:54

and his company around the

5:56

data sets he was creating

5:58

for. for text classification. And

6:01

that's where Argeela came up. So I'm

6:03

wondering if maybe one of you could

6:05

talk a little bit at a higher

6:07

level when you're talking about

6:11

data collaboration in the context

6:13

of the current kind of

6:15

AI environment. What does that

6:17

mean generally? And how would

6:19

you maybe distinguish that from

6:21

previous generations of tooling in

6:23

maybe similar or different ways?

6:26

So data collaboration, at least from our point

6:28

of view, the collaboration between

6:30

both the domain level experts that really

6:33

have high domain knowledge, actually

6:35

know what they're talking about, in

6:37

terms of the data, the inputs, and the outputs

6:39

that the models are supposed to give within

6:42

their domain. And then you have the data scientists,

6:44

or the AI engineers, and this side of the

6:47

coin that are more technical. They know from a

6:49

technical point of view what the models expect and

6:51

what the models should output. And

6:53

then the collaboration between them is, yeah,

6:55

now even higher because nowadays you can

6:57

actually prompt other items with natural language,

7:00

and you actually need to ensure that

7:02

both the models actually

7:04

perform well, and also the prompts and

7:06

these kind of things. So the collaboration

7:08

is even more important nowadays. And

7:11

that's also the case for, still the case

7:13

for tax-cap models and these kind of things,

7:15

which we also support within our job. I

7:18

guess maybe in the context of, let's say

7:20

there's a new team that's

7:23

exploring the adoption of AI technology, maybe

7:25

for the first time. Maybe they're not

7:28

coming from that data science background, the

7:30

sort of heavy MLOps stuff,

7:32

but maybe they've been excited

7:34

by this latest wave of

7:36

AI technologies. How would

7:39

you go about helping them understand

7:41

how their own data, the data

7:43

that they would curate, the data

7:45

that they would maybe collaborate on

7:48

is relevant to and where that fits

7:50

into the certain workflow.

7:52

So yeah, I imagine someone may

7:54

be familiar with what you can

7:57

do with chat GPT

7:59

or pasting. in certain documents

8:01

or other things. And

8:03

now they're kind of wrestling through how

8:05

to set up their own domain specific

8:07

AI workflows in their organization. What

8:10

would you kind of describe about

8:12

how their own domain data and

8:14

how collaborating around that fits into

8:16

common AI workflows? Yeah, so something

8:19

that I like to think about a

8:21

lot around this subject is like machine

8:24

learning textbooks. And they often

8:26

talk about modeling a problem as well

8:28

as building a model, right? There's a

8:30

famous mama and matter cycle. And

8:33

in that, when you model a problem,

8:35

you're basically trying to explain and define

8:37

the problem. So I have articles and

8:39

I need to know whether

8:42

they are a positive or

8:44

negative rating. And I'm describing

8:46

that problem. And then I'm

8:48

going to need to describe that problem to

8:51

a domain expert or an annotator through guidelines.

8:54

And when I can describe that problem in such a

8:56

way that the annotator or the

8:58

domain expert answers that question clearly

9:00

enough, then I know that that's a modeled and

9:03

clear problem. And it's something that I could then

9:05

take on to build a model around. In

9:08

simple terms, it makes sense. And

9:11

so I think when you're going

9:13

into a new space at Generative AI,

9:15

and you're trying to understand your business

9:17

context around these tools, you

9:20

can start off by modeling the problem in

9:22

simple terms, by looking at the data and

9:24

saying, okay, does this label make sense to

9:26

this articles? If I sort all these articles

9:29

down by these labels or by this ranking,

9:31

are these the kinds of things I'm expecting?

9:34

Starting off at quite low numbers, right? Like single

9:37

articles and kind of building up to tens

9:39

of hundreds. And as you do

9:41

that, you begin to understand and also

9:43

iterate on the problem and kind of change it and adapt

9:45

it as you go. And once

9:47

you've got up to a reasonable scale of the

9:49

problem, you can then say, right, this is something

9:51

that a machine learning model could learn. I

9:55

guess on that front, maybe one

9:58

of the big confusions, that I've

10:00

seen floating around these days is

10:02

the kind of

10:05

data that's relevant to some

10:07

of these workflows. So it

10:09

might be easy for people to

10:11

think about a labeled

10:14

data set for a text classification

10:16

problem, right? Like here's this text

10:18

coming in, I'm going to label

10:20

it spam or not spam or

10:22

in some categories. But I think

10:24

sometimes a sentiment that I've got

10:26

very often is, hey, our company

10:28

has this big file store of

10:30

documents. And somehow

10:32

I'm going to fine

10:35

tune, quote unquote, a

10:37

generative model with just this blob

10:39

of documents and then it will

10:41

perform better for me. And there's

10:44

two elements of that that are

10:46

kind of mushy. One

10:48

is like, well, to what end for

10:50

what task? What are you trying to do? And

10:52

then also how you curate that

10:55

data then really matters. Is

10:57

this a sentiment that you all are

10:59

seeing or how for this latest wave

11:01

of models, like how would

11:03

you describe if a company has a bunch

11:06

of documents and they're in this situation,

11:08

they're like, hey, we know we have

11:10

data and we know that these models

11:12

can get better. And maybe

11:14

we could even create our own private

11:16

model with our own domain of data.

11:18

What would you walk them through to

11:21

explain where to start with that process

11:23

and how to start curating their data

11:25

maybe in a less general way,

11:27

but towards some end? I think

11:30

in these scenarios, it's always good to

11:32

first establish a baseline or a benchmark

11:34

because what we often see is that

11:36

people come to us or come to

11:38

like the open source space. They say,

11:40

okay, we really want to fine tune

11:42

a model. We really want to do

11:44

like a super extensive rack pipeline with

11:46

all of the bells and whistles included

11:48

and then kind of start working on

11:50

these documents. But what we often

11:52

see is that they don't even have a bench

11:54

line to actually start with. So that's

11:56

normally what we recommend also whenever you work

11:58

with a rack pipeline. ensure that

12:00

all of the documents that you index

12:03

are actually properly indexed, properly chunked. Whenever

12:06

you actually execute a pipeline and

12:08

use would store these retrieved documents

12:10

or these based on the questions

12:12

and the queries in RGLR

12:14

or any other data and a nation to

12:17

you can actually have a look at the

12:19

documents, see if they make sense, see if

12:22

the retrieval makes sense, but also if the

12:24

generated output makes sense. And then whenever you

12:26

have that baseline set up from there, actually

12:28

start iterating and kind of making additions to

12:31

your pipeline. Shall I add re-ranking potentially to

12:33

the retrieval of the retrieval as in functioning

12:35

properly? Shall I add a fine-tuned version of

12:37

the model? Should I switch from the latest

12:40

LAMA model of 3 billion to

12:42

7 billion or these kind of things?

12:45

And then from there on, you can actually consider

12:47

maybe either fine-tuning model if that's actually needed or

12:49

fine-tuning one of the retrievers or these

12:51

kind of things. As you're saying

12:54

that, as you're speaking from this kind of profound

12:56

expertise you have, and I think a lot of

12:58

folks really have trouble just

13:00

getting started. And you asked some

13:02

great questions there, but

13:04

I think some of those are really tough for someone

13:06

who's just getting into it, like which way to go

13:08

and some of the selections that you would go with

13:11

that. Could you talk a little

13:13

bit about, kind of go back over the same

13:15

thing, but kind of make up a little workflow

13:17

that's kind of hands-on on just like you might

13:19

see this and this is how I would decide

13:21

that just for a moment, just so people can

13:23

kind of grasp kind of the thought

13:26

process you're going, because you kind of described

13:28

a process, but if you could be a

13:30

little bit more descriptive about that, I

13:33

think when I talk to people, once they get going they kind

13:35

of go to the next step and go to the next step

13:37

and go to the next step, but the first four

13:39

or five big question marks at

13:41

the beginning, they don't know which one to handle. I

13:44

can add some practical steps onto that that

13:46

I've worked with in the past. That'd be

13:48

fantastic. Yeah, so

13:51

one thing that you can do that

13:53

is really straightforward is actually

13:55

to write down a list of the kinds

13:57

of questions that you're expecting your system to

13:59

have. to answer. And

14:02

you can get that list by speaking to domain experts

14:04

or if you are a domain expert you can write

14:06

it yourself, right? And it doesn't

14:09

need to be an extensive exhaustive list, it can

14:11

be quite a small starting set. You

14:13

can then take those questions away and start

14:15

to look at documents or pools and sections

14:18

of documents from this lake that

14:20

you potentially have and associate

14:22

those documents with those questions and

14:25

then start to look if a

14:28

model can answer those questions with those

14:30

documents. In fact, by

14:32

not even building anything, by starting

14:35

to use a chat TPT or hugging

14:37

chat or any of these kind of

14:39

interfaces and just seeing this very very

14:42

low simple benchmark, see is

14:44

that feasible? Whilst at

14:46

the same time starting to ask yourself

14:48

can I as a domain expert answer

14:50

this? And that's kind of where our

14:52

giller comes in at the very first step. So you start to

14:54

put these documents in front

14:56

of people with those questions and you

14:59

start to search through those documents and

15:01

say to people can you answer this question or here's

15:04

an answer from a

15:06

model to this question in a very

15:08

small setting and you

15:10

start to get basic early signals of

15:12

quality and from there

15:15

you would start to introduce

15:17

proper retrieval. So you would

15:19

scale up your you

15:21

would take all of your documents, say you had

15:23

a hundred documents associated with your ten questions, you

15:26

put all those hundred documents in an index and iterate

15:28

over your ten questions and see

15:31

okay are the right documents aligning with the

15:33

right questions here? Then you start

15:35

to scale up your documents and make it more and

15:37

more of a real-world situation. You would

15:39

start to scale up your questions. You

15:41

could do both of these synthetically and

15:44

then if you still started to see positive

15:47

signals you could start to

15:49

scale and if you start to see negative

15:51

signals I'm no longer getting the right documents

15:53

associated with the right questions. I

15:56

personally would always start from the

15:59

simplest levers. in a

16:01

rag setup. And what I mean

16:03

there is that you have a number of different things

16:05

that you can optimize. So you

16:07

have retrieval, you can optimize

16:09

it semantically, or you

16:11

can optimize it in a rule-based retrieval.

16:15

You can optimize the generative model, you

16:17

can optimize the prompt. And

16:19

the simplest movers, the simplest levers

16:23

are the rule-based retrieval, the word

16:25

search, and then the semantic search.

16:27

So I would first of all add like a

16:29

hybrid search. What happens if I make sure

16:31

that there's an exact match in that document for the

16:33

word in my query? Does that

16:36

improve or improve my results?

16:39

And then I would just move

16:41

through that process basically. What's up

16:45

friends? I'm

16:56

here with a friend of mine, a good

16:58

friend of mine, Michael Greenwich, CEO and founder

17:01

of Work OS. Work OS

17:03

is the all-in-one enterprise, SSO and

17:05

a whole lot more solution for

17:07

everyone from a brand new

17:09

startup to a enterprise and all the

17:12

AI apps in between. So Michael, when

17:14

is too early or too late to

17:16

begin to think about being enterprise ready?

17:18

It's not just a single point in

17:20

time where people make this transition. It

17:22

occurs at many steps of the business.

17:25

Enterprise single sign-on like SAML, Auth, you

17:27

usually don't need that until you have

17:29

users. You're not gonna need

17:31

that when you're getting started. And we call it

17:33

an enterprise feature, but I think what you'll find

17:35

is there's companies when you sell to like a

17:38

50 person company, they might want this. They actually,

17:40

especially if they care about security, they might want

17:42

that capability in it. So it's more of like

17:44

SMB features even, if they're tech

17:46

forward. At Work OS we provide a ton of

17:48

other stuff that we give away for free for

17:51

people earlier in their life cycle. We just don't

17:53

charge you for it. So that Auth kit stuff

17:55

I mentioned, that identity service, we give that away

17:57

for free up to a million users. million

18:00

users. And this competes with

18:02

Auth0 and other platforms that have much, much lower

18:04

free plans. I'm talking like 10,000, 50,000, like we

18:06

give you a million free. Because we really want

18:11

to give developers the best tools and capabilities

18:13

to build their products faster, you know, to

18:15

go to market much, much faster. And where

18:17

we charge people money for the service is

18:19

on these enterprise things. If you end up

18:21

being successful and grow and scale up market,

18:23

that's where we monetize. And that's also when

18:25

you're making money as a business. So we

18:27

really like to align, you know, our incentives

18:29

across that. So we have people using Authkit

18:31

that are brand new apps,

18:33

just getting started. Companies in Y Combinator,

18:36

side projects, hackathon things, you know, things

18:38

that are not necessarily commercial focus, but

18:40

could be someday, they're kind of future

18:42

proofing their tech stack by using WorkOS.

18:45

On the other side, we have companies much, much

18:47

later that are really big, who typically

18:50

don't like us talking about them. They're logos,

18:52

you know, because they're big, big customers. But they

18:54

say, hey, we tried to build the stuff

18:56

or we have some existing technology, but sort of

18:58

unhappy with it. The developer that built it

19:00

maybe has left. I was talking last week with

19:02

a company that does over a billion in

19:04

revenue each year. And their skim connection, the

19:06

user provisioning was written last summer by an intern

19:09

who's no longer obviously at the company and

19:11

the thing doesn't really work. And so they're looking

19:13

for a solution for that. So there's a

19:15

really wide spectrum. We'll serve companies that are in

19:17

a, you know, their offices in a coffee

19:19

shop or their living room all the way through.

19:21

They have a, you know, their own building in

19:24

downtown San Francisco or New York or something.

19:26

And it's the same platform, same technology, same tools

19:28

on both sides. The volume is obviously different.

19:30

And sometimes the way we support them from a

19:32

kind of customer support perspective is a little

19:34

bit different. Their needs are different, but same technology,

19:37

same platform, just like AWS, right? You can use

19:39

AWS and pay them $10 a month. You

19:41

can also pay them $10 million

19:43

a month, same product or more for sure.

19:45

Or more. Well,

19:47

no matter where you're at

19:50

on your enterprise ready journey,

19:52

WorkOS has a solution for

19:55

you. They're trusted by perplexity

19:57

copy.ai, loom for sale indeed.

20:00

And so many more. You

20:02

can learn more. and check

20:05

them out at workos.com. That's

20:07

workos.com. Again, workos.com. I'm

20:23

guessing that you all, you know, the fact

20:25

that you're supporting all of these use cases

20:28

on top of RGLA on

20:30

the data side makes me think,

20:33

like you say, there's so many

20:35

things to optimize in terms of

20:37

that RAG process, but there's also

20:39

so many AI workflows that are

20:41

being thought of, whether

20:44

that be, you know, code generation

20:46

or assistance or, you

20:48

know, content generation, information extraction. But then

20:50

you kind of go beyond that. David,

20:53

you mentioned text classification and of

20:55

course there's image use cases. So

20:57

I'm wondering from you all,

21:00

at this point, you know,

21:02

one of the things Chris and I have talked about on

21:04

the show a bit is, you know,

21:06

we're still big proponents and, you know, believe

21:08

that in enterprises a lot of

21:10

times there is a lot of mixing of,

21:13

you know, rule-based systems and more

21:16

kind of traditional, I guess, if you

21:18

want to think about it that way,

21:20

machine learning and smaller models, and then

21:23

bringing in these larger Gen AI models

21:25

as kind of orchestrators or, you know,

21:27

query layer things. And

21:30

that's a story we've been kind of telling,

21:32

but I think it's interesting that we have

21:34

both of you here in the sense that,

21:36

like, you really, I'm sure there's certain things

21:38

that you don't or can't track about what

21:41

you're doing, but just even anecdotally out of

21:43

the users that

21:45

you're supporting on RGLA, what

21:47

have you seen in terms of what

21:50

is the mix between those, you know,

21:52

using RGLA for this sort of, maybe

21:54

what people would consider traditional data

21:57

science type of models like text

21:59

classification. and or image

22:01

classification type of things and these

22:04

maybe newer workflows like RAG and other

22:06

things. How do you see that balance?

22:08

And do you see people using both

22:11

or one or the other? Yeah,

22:13

any insights there? I think we

22:16

recently had this company from Germany,

22:18

Alamin's over at one of our

22:21

meetups that we host. And they

22:23

had an interesting use case where

22:26

they collaborated with this healthcare insurance

22:28

platform in Germany. And

22:30

one of the things that you see

22:32

with large language models is that these

22:35

large language models can't really produce German

22:37

language properly, mostly trained

22:39

on English text. And

22:41

that was also one of their

22:44

issues. And what they did was

22:46

actually a huge classification and generation

22:49

pipeline, combining a lot of these techniques

22:51

where they would initially get an email

22:53

in that they would classify to a

22:56

certain category, then based on the category,

22:58

they would kind of define what kind

23:00

of email template, what kind of front

23:02

template they would use. Then

23:05

based on the front template, they

23:07

would kind of start generating and

23:09

composing one of these response emails

23:11

that you would expect for like

23:13

a customer query requests coming in

23:15

for the healthcare insurance companies. And

23:17

then in order to actually ensure

23:19

that the formatting and phrasing and

23:21

the German language was applied properly,

23:23

they would then based on that

23:26

prompt, regenerate the email once more.

23:28

So prompt an LLM to kind

23:30

of improve the quality of the

23:32

initial proposed output. And then after

23:34

all of these different steps of

23:36

classification of retrieval, augmented generation of

23:38

an initial generation and

23:40

a regeneration, they would then end up

23:43

with their eventual output. So what we

23:45

see is that all

23:47

of these techniques are normally combined.

23:50

And also a thing that we

23:52

are strong believers in is that whenever

23:55

there is a smaller model or an

23:57

easier approach applicable, why not go for

23:59

that? instead of using one of these

24:01

super big large language models. So

24:03

if you can just classify is this relevant

24:06

or is this not relevant and based on

24:08

that, actually decide what to do. That makes

24:10

a lot of sense. And also,

24:13

one of the interesting things that I've seen one

24:16

of these open source platforms, Haystack, out there

24:19

using is also this query

24:22

classification pipeline, where they would

24:24

classify incoming queries as either

24:27

a key terminology

24:29

search, a question query,

24:31

or actually a phrase for an

24:33

LLM to actually start prompting LLM.

24:35

And based on that, actually redirect

24:37

all of their queries to the

24:39

correct model. And that's also an

24:41

interesting approach that we've seen. Quick

24:44

follow up on that. And it's

24:46

just something I wanted to draw out because we've

24:48

drawn it out across some other episodes a bit.

24:51

You were just making a recommendation go for

24:53

the smaller model versus the larger model. For

24:56

people trying to follow, and there's the

24:59

divergent mindsets, could you take just a

25:01

second and say why you would advocate

25:03

for that, what the benefit, what the

25:05

virtue is in the context of

25:07

everything else? I would say smaller

25:11

models are generally hostable by yourself.

25:13

So it's more private. Smaller

25:15

models, they are more cost efficient.

25:19

Smaller models can also be fine-tuned easier

25:21

to your specific use case. So even

25:24

what we see a lot of people coming

25:26

to us about is actually fine-tuning

25:28

LLMs. But even the big companies

25:31

out there with huge amounts of

25:33

money and resources and dedicated research

25:35

teams still have difficulties

25:37

for fine-tuning LLMs. So

25:40

whenever you, instead of within your

25:42

retrieval augment the generation pipeline, fine-tune

25:45

the LLM for the generation part,

25:47

you can actually choose to fine-tune

25:49

one of these retriever models that

25:51

you can actually fine-tune on consumer

25:54

grade hardware. You can actually fine-tune

25:56

it very easily on any arbitrary

25:58

data scientist developer device. And

26:00

then instead of having to

26:02

deploy anything on one of the cloud

26:04

providers, you can start with that. And

26:07

for a similar reason for a

26:10

RAC pipeline, whenever you provide an

26:12

LLM with garbage within such a

26:14

retreivlement generation pipeline, you actually also

26:16

ensure that there's less relevant content

26:19

and the output of the LLM

26:21

is also going to be worse.

26:24

Yeah, I've seen a lot of cases where,

26:27

I think it was Travis Fisher who was on the

26:29

show, he advocated for

26:31

this hierarchy of how you should approach these

26:33

problems and there's like maybe seven things on

26:36

his hierarchy that you should try before fine

26:38

tuning. And I think in a lot of cases I've

26:40

seen people maybe jump to that,

26:43

they're like, oh, I forget

26:45

which one of you said this, but this

26:47

naive RAG approach didn't get me quite

26:49

there, so now I need to fine

26:52

tune. When in reality, there's sort of

26:54

a huge number of things in between

26:56

those two places and you might end

26:58

up just getting a worse performing model,

27:00

depending on how you go about the

27:02

fine tune. One of the

27:04

things, David, you kind of walk through

27:06

these different, the example of

27:08

the specific company that had these

27:10

workflows that involved a variety of

27:12

different operations, which I

27:15

assume to Ben, you

27:18

mentioning earlier, starting with a test set

27:20

and that sort of thing and

27:22

how to think about the tasks.

27:25

I'm wondering if you can specifically

27:27

now talk just a little bit

27:29

about Argeela, specifically people might be

27:31

familiar generally with like data annotation,

27:33

they might be familiar, maybe

27:36

even with how to upload some data

27:38

to quote fine tune some of these

27:41

models in an API sense, or maybe

27:43

even in a more advanced way with

27:46

Q-Lora or something like that, but

27:48

could you take a minute and

27:51

just talk through kind of Argeela's

27:53

approach to data annotation and data

27:55

collaboration it's

27:57

kind of hard on a podcast because we don't have a lot

28:00

of data. a visual to show for people, but as best you

28:02

can help people to imagine

28:04

if I'm using Argeela to

28:07

do data collaboration, what

28:09

does that look like in terms of what I would

28:11

set up and who's involved, what

28:14

actions are they doing, that sort of thing? Argeela,

28:16

there's two sides to it, right?

28:20

There's a Python SDK, which is

28:22

intended for the AI machine learning

28:24

engineer, and there's a

28:27

UI, which is intended for your

28:29

domain expert. In reality,

28:31

the engineers often also use the UI, and

28:33

you kind of iterate on that as you

28:35

would because it gives you a representation of

28:37

your task. But there's these two

28:40

sides. The UI is kind

28:42

of lightweight. It can be deployed in a Docker

28:44

container or on hugging face spaces, so it's really

28:46

easy to spin up. And

28:48

the SDK is really

28:51

about describing a feedback task and

28:53

describing the kind of information that

28:55

you want. So you

28:58

use Python classes to construct your

29:00

dataset settings. You'll say,

29:02

okay, my fields are a piece

29:04

of text, a chat,

29:07

or an image, and the questions

29:09

are a text question, so

29:11

like some kind of feedback, a comment, for

29:13

example, a label question,

29:16

so positive or negative labels, for

29:18

example, a rating,

29:21

say between one and five, or

29:23

a ranking. Example

29:25

one is better than example two, and you

29:27

can rank a set of examples. And

29:30

with that definition of a feedback

29:33

task, you can create

29:35

that on your server, in your

29:37

UI, and then you can push

29:39

what we call records, your samples,

29:42

into that dataset, and then they'll

29:45

be shown within the UI, and your

29:47

annotator can see all of the questions. They'll have

29:49

nice descriptions that were defined in the SDK. They

29:52

can tweak and kind of change those as well if you

29:54

need in the UI because that's a little bit easier. You

29:57

can distribute the task between a team. you

29:59

can say, OK, this record

30:01

will be accepted once we have at

30:04

least two reviews of it. And

30:06

you can say that some questions are required and

30:08

some aren't, and they can skip through some of

30:10

the questions. The UI has

30:12

loads of keyboard shortcuts, like with numbers and

30:15

arrows and returns. So you can move through

30:17

it really fast. It's kind of optimized for

30:19

that. And different screen sizes. One

30:22

thing we're starting to see is that, as

30:25

LLMs get really good at quite long

30:27

documents, that some of the stuff that

30:29

they're dealing with is a multi-page document

30:32

or a really detailed image and

30:34

then a chat conversation. And

30:37

then we want a comment and a

30:39

ranking question. So it's a

30:41

lot of information in the screen. So the

30:43

UI kind of scales a bit like an IDE. So

30:46

you can drag it around to give yourself

30:48

enough width to see all this stuff. And then

30:50

you can move through it in a reasonably

30:52

efficient way with keyboard shortcuts and stuff. Interesting.

30:55

And what do you see as kind

30:57

of the backgrounds of the

30:59

roles of people that are using

31:01

this tool? Because one of the

31:04

interesting things from my perspective, especially

31:07

with this kind of latest

31:09

wave is there's

31:11

maybe less data scientists, kind of

31:13

AI people, that that their background

31:16

and more software engineers

31:18

and just non-technical domain experts. So

31:20

how do you kind of think

31:22

about the roles within

31:25

that and what are you seeing in terms

31:27

of who is using the system? For

31:29

us, I think it's, yeah, from

31:32

the SDK Python side, it's really

31:34

still developers. And then

31:36

from the UI side, it's like anyone

31:38

in the team that needs to have

31:40

some data labeled with domain knowledge. Often

31:43

these are also going to be like

31:45

the AI experts. And one

31:47

of the cool things is that whenever

31:49

an AI expert actually sets up a

31:51

data set besides these fields and questions,

31:53

they can actually come up with some

31:56

interesting features that they can add on top of the

31:58

data set. They are. are

32:00

also able to add semantic search,

32:02

like attach records or semantic representation

32:05

of the records to one of the

32:07

records, which actually enables the users within

32:09

the UI to label way more efficiently.

32:11

So for example, if someone

32:14

sees a very representative example of something

32:16

that's bad within their data set, they

32:18

can do a semantic search,

32:20

find the most similar documents, and then

32:22

continue with the labeling on top of

32:25

that. Besides that, you can also, for

32:27

example, filter based on model certainties. So

32:29

let's say that your model is very

32:31

uncertain about an initial prediction that you

32:33

have within your UI. And

32:36

it's really interesting for the domain expert

32:38

or for the data scientist to go

32:40

and have a look at that specific

32:42

record or that range of

32:45

uncertainties. And then based on that,

32:47

the labeling or the data curation

32:49

or whatever you would like to call

32:51

it becomes way more engaging and way

32:53

more interesting. And on top of

32:56

that, another thing that we are starting

32:58

to explore is actually using this AI

33:00

feedback and synthetic data within RGL

33:02

as well. And that's actually

33:04

one of the other products that we're working

33:06

on, and it's called this C label. So

33:09

nowadays, what you can do with

33:11

LLMs is also actually use LLMs

33:13

to evaluate questions, for

33:15

example, to evaluate whether something is

33:17

label A, B, or C, whether

33:20

something is a good or bad

33:22

response. And you see all

33:25

kinds of tools, open source tools out

33:27

there. And that's also a thing that

33:29

we are looking at for integrating with

33:31

the UI, where instead of doing this

33:33

more from a data science, a CK

33:35

perspective, users without any

33:37

technical knowledge would actually be able to

33:39

tweak these guidelines that been highlighted earlier

33:41

and then say, OK, maybe instead of

33:44

taking this into account, you should focus

33:46

a bit more on the

33:48

harm that potentially is within your data

33:50

or the risks that are within your

33:52

data. And then you would

33:54

be able to prompt an LLM once

33:56

again to label your data, and then

33:58

you wouldn't directly need. the bite on

34:01

STK anymore. I was thinking about

34:03

as you were describing that I work at

34:05

a large organization and we certainly

34:07

have a lot of domain experts

34:09

in the organization I work at

34:11

that are either non-technical or semi-technical

34:14

and as users they

34:16

will sometimes find it intimidating you know kind

34:18

of getting into all this as they're starting

34:20

a project could you talk a little bit

34:22

about what it's like for a

34:25

non-technical person to sit down with Argeela

34:27

and start to work in a productive

34:29

way what what does that experience like

34:31

for them because it's one thing like the

34:33

technical people kind of just know they dive into

34:35

it they're gonna use the SDK they've used other

34:38

SDKs but there can

34:40

be a bit of handholding for people who

34:42

are not used to that could you describe

34:44

the user experience for that non-technical subject matter

34:46

expert coming in and what labeling is like

34:48

and just kind of paint a

34:51

picture of words on what their experience

34:53

might be like yeah I

34:55

mean one thing I guess I'd start off by

34:57

saying is that Argeela is kind

34:59

of the latest iteration of

35:02

a problem that has existed for a

35:04

long time in machine learning and data

35:06

science right about collecting feedback from domain

35:08

experts and it's kind of gone through spreadsheets

35:11

and various other tools

35:13

that were substandard and

35:16

really bad user experiences where

35:19

domain experts were asked for

35:21

information but information was

35:23

extracted and then models have been

35:25

trained really poorly on

35:28

that information so as a

35:30

field we kind of know that it's something that

35:32

we have to take really seriously and and that's

35:34

kind of what Argeela is built on top of

35:36

right that's part of our DNA as a product

35:38

is like optimizing the

35:42

feedback process as a user experience

35:44

problem and so when

35:46

the user sits down to use Argeela the

35:49

intention is that all of

35:51

the information should be right there in

35:53

front of them inside their single like

35:55

record view so what that means

35:57

is they've got a set of guidelines that

36:00

edited and marked down. They

36:02

can contain images, links to various

36:04

pages or other external documents if they need

36:06

and they can just kind of scroll through

36:08

that. It's always there, it's always available. They've

36:11

then also got like basic metrics so

36:13

they'll know how many records they've got

36:15

left, how many they've labeled. They can

36:17

view that kind of team status and

36:19

see what's going on. And

36:21

then on the left they have their fields

36:24

which they can scroll through and on the

36:26

right they'll have a set of questions. As

36:29

I said they can move through these in

36:31

keyboard shortcuts and they can switch the view

36:33

so that they can scroll kind of infinitely

36:35

or they can move into a kind of

36:37

page swiping. If you're looking at

36:39

really small records with like a couple of lines

36:42

and you're just assigning a simple label to, you

36:45

can do that in bulk. So as

36:47

we said you could use a semantic

36:49

search, give me all the records

36:51

that are similar to this one and I'll bulk label those

36:54

or you could search for terms inside those

36:56

records and you can bulk label those. And

36:59

then once you're finished you'll know about

37:01

it. And one of the interesting things

37:03

that I've done personally quite often is

37:05

sit together with the

37:07

domain experts and their AI engineers

37:10

to kind of walk them through

37:12

how to configure our GLI most

37:14

usefully for both of them. And

37:16

then the domain experts come with a lot

37:18

of things to the table like I want

37:20

to see this specific representation. What if we

37:22

could do this, what if we could do

37:24

that. Then the AI engineers think

37:27

about like the data side of things. Is

37:29

this possible from our point of view from

37:31

our side? And then me as a mediator

37:33

so to say kind of make the

37:36

most out of the ARJEL configuration. And

37:38

that's also how we see this collaboration

37:40

process going where domain experts really work

37:42

together also with AI engineers because AI

37:45

engineers or machine learning engineers actually know

37:47

what's possible from the data, what it

37:50

means to get high quality data for

37:52

fine tuning model. Because whenever our domain

37:55

engineer comes up with something

37:57

that's useful for them in terms of

37:59

labeling doesn't mean. necessarily that it's actually

38:01

proper data that's going to come out

38:03

of there in terms of fine tuning

38:05

a model. And that's also a part

38:07

of, I guess, the collaboration that we're

38:09

talking about. What's

38:26

up, friends? I've got something exciting to share

38:28

with you today. A sleep technology that's pushing

38:30

the boundaries of what's possible in our bedrooms.

38:33

Let me introduce you to eight sleep

38:35

and their cutting edge Pod4 Ultra. I

38:37

haven't gotten mine yet, but it's on

38:39

its way. I'm literally counting the days.

38:42

So what exactly is the

38:44

Pod4 Ultra? Imagine a

38:47

high-tech mattress cover that you can

38:49

easily add to any bed. But

38:51

this isn't just any cover. It

38:53

is packed with sensors, heating, and

38:55

cooling elements, and it's all controlled

38:57

by sophisticated AI algorithms. It's like

38:59

having a sleep lab, a smart

39:02

thermostat, and a personal sleep coach

39:04

all rolled into a single device.

39:07

It uses a network of sensors

39:09

to track a wide array of

39:11

biometrics while you sleep, sleep stages,

39:13

heart rate variability, respiratory rate, temperature,

39:16

and more. It uses

39:18

precision temperature control to regulate your

39:20

body's sleep cycles. It can

39:22

cool you down to a chilly 55 degrees

39:24

Fahrenheit or warm you up to a good,

39:27

nice, solid temperature of 110 Fahrenheit.

39:30

And it does this separately for

39:32

each side of the bed. This means

39:34

you and your partner can have your

39:37

own ideal sleep temperatures. But

39:39

the really cool part is that the

39:41

Pod uses AI and it

39:43

uses machine learning to learn your sleep

39:45

patterns over time. And it uses this

39:48

data to automatically adjust the temperature of

39:50

your bed throughout the night according to

39:52

your body's preferences. Instead of just giving

39:55

you some stats, it understands them and

39:57

it does something about it. Your

39:59

bed. You literally get smarter as you

40:02

sleep over time. And all this functionality

40:04

is accessible through a comprehensive mobile app.

40:06

You get sleep analytics, trends over time,

40:09

and you even get a daily sleep

40:11

fitness score. Now, I don't have

40:13

mine yet. It is on its way, thanks to

40:15

our friends over at 8sleep. And I'm literally counting

40:17

the days till I get it, because I love

40:19

this stuff. But if you're ready to

40:22

take your sleep and your recovery to the next

40:24

level, head over to 8sleep.com/Practical

40:26

AI and

40:28

use our code Practical AI to get 350 bucks

40:30

off your very own

40:33

Pod 4 Ultra. And you can try it

40:35

free for 30 days. I

40:37

don't think you wanna send

40:39

it back, but you can

40:41

if you want to. They're

40:44

currently shipping to the US,

40:46

Canada, United Kingdom, Europe, and

40:48

Australia. Again, 8sleep.com/Practical AI. I

40:50

wanna maybe double click on

40:52

something that David, you

41:13

just said and sort of passing, which

41:15

I think is quite significant. I

41:17

don't know if some people might have

41:19

caught it, but when you were talking about the still label,

41:21

you also talked about AI

41:23

feedback. So AI feedback and

41:26

synthetic data. So I'd love to get

41:28

into those topics a little bit. Maybe

41:30

first coming from the AI feedback side,

41:32

I think this is super interesting because,

41:36

Ben, you talked about how this is a

41:38

kind of more general problem that people have

41:40

been looking at in

41:43

various ways from various perspectives for

41:45

a long time in terms of

41:47

this data collaboration labeling piece. But

41:49

there is this kind of very

41:52

interesting element now where we have

41:54

the ability to utilize these very

41:57

powerful, maybe general purpose.

42:00

construction following type of

42:02

models to actually

42:04

act as labelers within

42:06

the system or at least generate

42:09

drafts of labels or

42:12

feedback or even preferences

42:14

and scores and all of

42:16

those sorts of things. So

42:18

I'm wondering if one of you could speak

42:20

to that. Some people might

42:22

find this kind of strange that

42:25

we're kind of giving feedback

42:27

to AI systems with AI

42:29

systems, which seems circular

42:31

and maybe like why would that work

42:33

or just sort of maybe that's kind

42:36

of produces some weird feelings for people. But

42:38

I think it is a significant thing that

42:40

is happening. And so

42:42

yeah, either of you would want to

42:45

kind of dive into that. What does

42:47

it specifically mean in AI feedback? How

42:49

are you seeing that being used most

42:51

productively? So when we create

42:53

a data set either manually

42:55

or with AI feedback or AI

42:57

generation, we have all the

42:59

information there to understand the problem. We have

43:01

a set of guidelines, we have a set

43:04

of labels, definitions of those labels, with documents

43:06

and definitions of those documents. We

43:08

give those to a manual annotator or

43:10

we'll go out and collect those documents and

43:12

we'll give those documents to the manual annotator.

43:15

And we're trying to describe that problem so that the person understands

43:17

it to create the data. We

43:19

can essentially take all of the same resources

43:21

and give those to an LLM and get

43:23

the LLM to perform the same steps. So

43:25

there's two parts to that. There's

43:28

a generative part where the LLM

43:30

can generate documents. So

43:32

let's say we've got a hundred documents in our

43:34

data set that we want 10,000. We

43:37

can say generate a document like

43:40

this one, but and add variation

43:42

on top of that. And

43:45

we can fan out our data set, our

43:47

documents from 100 to 10,000. We

43:50

could then take those same documents or

43:53

a pool of documents from elsewhere and

43:56

we could get feedback on that. So that

43:58

could be qualitative feedback. tell

44:00

me which of these documents are relevant to

44:03

this task, tell me which

44:05

of these documents are of a high quality,

44:07

are concise, are detailed,

44:09

these kind of attributes. So we could

44:12

filter down our large dataset or our

44:14

generated dataset to the best document. We

44:17

could also add labels. So we could say, tell

44:20

me which of these documents relates to my business

44:22

use case or not, these kinds

44:24

of things, apply topics to

44:26

these documents. And then

44:28

we can, in doing so, create a classification

44:30

dataset from those labels. Or

44:33

we could, in one

44:35

example, take a set of documents and

44:37

use a generative model to generate questions

44:39

or queries about those documents. And we

44:41

could use that to create a Q&A

44:44

dataset or a retrieval dataset

44:47

where we generate search queries based

44:49

on documents. When you're doing that

44:51

and you're generating the datasets with

44:54

another model, how

44:56

much do you have to worry about hallucination playing into

44:58

that? It sounds like you have a good process for

45:00

trying to catch it there.

45:03

But is that a small issue?

45:05

Is that a larger issue? Any guidance

45:07

on that? That's one of the

45:09

main issues, definitely. It is

45:11

probably the main issue. And

45:13

so really, it's about both sides

45:16

of that process that I described,

45:18

that generating side and that evaluating

45:20

side. So you get the

45:23

large-scale models to do as much as

45:25

possible to expose hallucination by

45:27

evaluating themselves. And

45:29

typically, you're getting larger models

45:31

to evaluate. So they're

45:33

a more performant model and they should

45:35

hallucinate less. The task

45:38

of identifying hallucinations is not the

45:40

same as generating a document. So

45:42

typically, LLMs are better at identifying

45:44

hallucinations and nonsense. If you give

45:46

them the context, then they

45:48

are not generating it. And

45:51

so you combine that within a pipeline.

45:54

And then you would take that to a domain

45:56

expert in a tool like Argylla.

45:58

And so that's really why. we

46:00

have these two tools, Distal Label and

46:02

Argylla, because without

46:04

Argylla, Distal Label would

46:07

suffer from a lot of those problems. Yeah,

46:10

and I guess that brings us to the

46:12

second tool, the Distal Label, which I know

46:15

has some to do with this

46:17

synthetic data piece as well. And I'm really

46:19

intrigued to hear about this, because I also

46:21

see some of what

46:23

you have on the documentation

46:26

about what are people building with Distal

46:28

Label. I do note

46:30

a couple of data

46:32

sets like the Open Hermes data

46:34

set, the Intel Orca DPO data

46:36

set. These are data sets that

46:38

have been part of

46:41

the lineage of models that

46:43

I found very, very useful.

46:45

So first off, thanks for

46:47

building tooling that's created really

46:50

useful models in my own life.

46:52

But beyond that, David,

46:55

do you want to go into a little bit

46:57

about what Distal Label is and maybe even tie

46:59

into some of those things and how it's proven

47:01

to be a useful piece of

47:03

the process in

47:05

creating some of those models? I

47:08

think the idea of

47:10

Distal Label started to have

47:13

a year ago, more or less, or maybe

47:15

a year ago, where we saw these initial

47:19

new models coming out, like

47:21

Alpaca and Dolly from Databricks,

47:23

Alpaca from Stanford, where

47:25

there were data sets

47:27

being generated

47:30

with OpenAI frontier models being evaluated

47:32

with OpenAI frontier models, and then

47:34

published and actually used for fine

47:37

tuning one of these models. So

47:39

apparently there were research groups

47:41

or companies investing time in this. But what we

47:43

also saw is when we would upload these data

47:46

sets into Agilla, actually started looking at

47:48

the data that there were a lot

47:50

of flaws within there. And then whenever

47:52

like Ultra half feedback, which is one

47:55

of these specific papers that really started

47:57

to scale the synthetic data and AI

47:59

feedback concept came in. out, we thought,

48:01

okay, maybe it's worth to look

48:03

into a package that can actually

48:06

help us facilitate creating

48:08

datasets that we can then eventually

48:10

fine-tune within our Jela. And

48:12

that's when we started work on the initial

48:14

version of this label. So it's kind

48:17

of like application framework, like llama

48:20

index or length chain, if you're

48:22

familiar with those, but then specifically

48:24

focused on synthetic data generation and

48:27

AI feedback. What

48:29

we try to do is organize

48:31

everything into this part-finding structure,

48:33

where you have either steps

48:35

that are about basic data

48:37

operations, tasks that

48:39

are about prompt templates or prompting

48:43

and prompt templates. You can think about

48:45

either providing feedback, maybe rewriting some initial

48:47

input that you provide to that prompt

48:49

template, or maybe ranking or

48:52

generating from scratch or these kind of

48:54

things. And then

48:56

these tasks are actually executed

48:58

by LLMs, and these are

49:00

then all fit together within

49:02

a pipelining structure. The

49:05

thing for these tasks is

49:07

that nowadays we actually look at

49:09

all of the most recent research

49:11

implementations or most recent research papers,

49:13

and we try to implement them

49:15

whenever they come out and are

49:17

actually relevant for synthetic data generation.

49:19

So you really go from the

49:22

finicky prompt engineering, so to

49:24

say, to evaluated prompts that

49:27

we've implemented. And

49:29

the nice thing about our pipelining

49:31

structure is also that we run

49:33

everything asynchronously. So there's multiple LLM

49:35

executions being done at once, which

49:38

will really speed up your pipeline.

49:41

And on top of that, we also cache all

49:43

of the intermediate results. So as

49:45

you can imagine, calling the OpenAI API

49:47

can be quite costly. And

49:49

whenever you run a pipeline, a lot of things

49:52

can go wrong. But

49:55

whenever you actually rerun our pipelines within

49:57

this label, you actually have these cached

49:59

results already there. as we would avoid

50:01

kind of incurring additional costs whenever something

50:04

within the pipeline breaks. Yeah,

50:06

that's awesome. And I know that one

50:08

element of this is the

50:11

kind of creation of synthetic

50:13

data for further fine tuning

50:15

LLMs to increase performance

50:18

or maybe to some sort

50:20

of alignment goal or something like that. But

50:23

also I know from working

50:25

with a lot of healthcare

50:28

companies, manufacturers, others

50:31

that are more security

50:33

privacy conscious in

50:35

my day job. Part of the pitch

50:38

around synthetic data is maybe

50:40

also creating data sets that

50:43

might not kind of poison

50:45

LLMs with a bunch of your own

50:47

sort of private information that could be

50:50

sort of exposed as part of an

50:52

answer that someone prompts the

50:54

model in some way. This data is embedded

50:56

in the data set and all of that.

50:58

So yeah, I would definitely encourage people to

51:01

check out this still label. And you said

51:03

it's been around for half a year or

51:05

so. How have you seen

51:08

the kind of usage and adoption

51:10

so far? The usage

51:12

and adoption has been quite good in

51:14

terms of the number of data

51:16

sets that have been released. So you

51:19

mentioned the Intel Orca DPO data set,

51:21

which was an example use case of

51:23

how we were initially

51:26

using it, where we had this original

51:28

data set that had been labeled by

51:30

Intel employees with preferences

51:33

of what would be like the preferred

51:35

response to a given prompt. And

51:38

we actually use this label to

51:40

kind of clean that based on

51:42

asking prompting LLMs ourselves to reevaluate

51:44

these chosen rejected pairs within the

51:47

original data set, filtering

51:49

out all of the ambiguities. So

51:51

sometimes the LLM wouldn't align with

51:53

the original chosen rejected pair.

51:56

And based on that, we were actually able to scale

51:58

down the data set by 50. percent, leading

52:01

to less training time, and also

52:04

leading to a higher performing model. And

52:06

that was one of the really famous

52:08

examples that inspired some people within the

52:11

open source community to actually start looking

52:13

at this label, to start using this

52:15

label to generate data sets. There

52:17

are some Hagging Face teams that

52:19

have actually been generating millions and

52:22

millions of rows of synthetic data

52:24

using this label. And that's pretty

52:26

cool to see that people are

52:28

actually using it at skill. And

52:31

besides that, there's also these smaller companies,

52:33

so to say, but like

52:36

Allemind, the German consultancy, the

52:38

German startup that I mentioned

52:40

before, using it to also

52:43

rewrite and resynthesize emails

52:45

within actual production use

52:47

cases. It's really fascinating.

52:49

You guys are pushing the state of the

52:51

art in a big

52:54

way. With the work that you've

52:56

done in DistaLabel and

52:58

Argella, where do you think things

53:00

are going? Like when you're kind

53:02

of end of whatever your task is of the

53:04

day and you're kind of just letting your mind

53:06

wander and thinking about the

53:09

future, where do each of y'all

53:11

go in terms of what you think's going

53:13

to happen, what you're excited about, what you're

53:15

hoping will happen, what you might be

53:17

working on in a few months or maybe a year

53:19

or two? What were your thoughts? I

53:21

suppose for me, it's about two

53:23

main things. And the

53:25

first would be modalities. So moving

53:28

out of text and into image

53:30

and audio and video and

53:32

also kind of UX environments

53:35

so that maybe in Argella, but

53:37

also in DistaLabel, that we can generate

53:39

synthetic data sets in different modalities and

53:41

that we can review those. And that's

53:45

a necessity and something that we're already working on and

53:47

we've already got features around, but we've got kind of

53:49

more coming. And then the second

53:52

one, which I suppose is a bit

53:54

more far fetched and that's a bit

53:56

more about kind of tightening the loop

53:58

between the various parts. application.

54:00

So between distilable argyllar and

54:02

the application that you're building,

54:05

so that you can deal with feedback as

54:07

it's coming from your domain expert that's using

54:09

your application and potentially argyllar at the same

54:11

time. So we can kind of

54:13

synthesize on top of that to evaluate that feedback

54:16

that we're getting and generate based on that feedback.

54:19

So we can add that into our giller and

54:21

then we can respond to that synthetic

54:24

generation, that synthetic data. And

54:26

then we can use that to train

54:28

our model, this kind of tight loop

54:30

between the end user, the application and

54:32

our feedback. Yeah, and for

54:34

me, it kind of aligns with what

54:36

you mentioned before, man, like the multi-modality,

54:38

smaller and more efficient models, things that

54:40

can actually run on a device. I've

54:43

been playing around with this app this morning

54:45

that you can actually load local

54:47

LLM into, like a small smaller

54:50

like QN or a llama model

54:52

from Meta. And it actually runs

54:54

on an iPhone 13, which is really cool.

54:56

It's private, it runs quite quickly. And

54:59

the thing that I've been wanting to

55:01

play around with is the speech to

55:03

speech models, where you can actually have

55:05

real-time speech to speech. I'm currently learning

55:07

Spanish at the moment. And one of

55:09

the difficult things there is not being

55:12

secure enough to actually talk to people out

55:14

on the streets and these kind of things.

55:16

So whenever you would be able to kind

55:19

of practice that at home privately on your

55:21

device, talk some Spanish into

55:23

an LLM, get some Spanish back, maybe some

55:25

corrections in English. These kind of scenarios are

55:28

super cool for me whenever they would be able to

55:31

come true. Yeah,

55:34

this is Muy Bueno. And yeah,

55:36

I've been really excited to talk to

55:38

you both and would love to have

55:40

you both back on the show sometime

55:42

to update on those things. Thank you

55:45

for what you all are doing, both

55:47

in terms of tooling and

55:49

our GLM hugging face

55:51

more broadly in terms of how you're

55:53

driving things forward in the

55:55

community and especially the open

55:57

source side. Thank

56:00

you for taking time to talk with us and

56:03

hope to talk again soon. Yeah,

56:05

thank you. And thanks for having us. Thank

56:14

you. All right,

56:16

that is Practical AI for this

56:18

week. Subscribe now.

56:21

If you haven't already, head

56:23

to practicalai.fm for all the

56:25

ways. And join our

56:27

free Slack team where you can hang

56:29

out with Daniel, Chris, and the entire

56:31

changelog community. Sign

56:33

up today at practicalai.fm

56:36

slash community. Thanks

56:38

again to our partners at fly.io, to

56:41

our beat freaking residents, Breakmaster Cylinder, and

56:43

to you for listening. We appreciate you

56:45

spending time with us. That's

56:48

all for now. We'll talk to you again

56:50

next time. Bye.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features