Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:06
Welcome to Practical AI. If
0:09
you work in artificial intelligence, aspire
0:12
to, or are curious how
0:14
AI-related tech is changing the world,
0:17
this is the show for you.
0:20
Thank you to our partners at
0:22
fly.io, the home of changelog.com. Fly
0:26
transforms containers into micro VMs that
0:28
run on their hardware in 30
0:30
plus regions on six continents. So
0:32
you can launch your app near
0:34
your users. Learn more at
0:36
fly.io. Okay
0:45
friends, I'm here with Annie Sexton over
0:47
at Fly. Annie, you know we use
0:50
Fly here at ChangeLog. We love Fly.
0:53
It is such an awesome platform and we love building on
0:55
it. For those who don't
0:57
know much about Fly, what's special
0:59
about building on Fly? Fly gives
1:01
you a lot of flexibility, like a
1:03
lot of flexibility on multiple fronts. And
1:06
on top of that you get, so I've
1:08
talked a lot about the networking and
1:10
that's obviously one thing, but there's various
1:12
data stores that we partner with that
1:15
are really easy to use. Actually
1:17
one of my favorite partners is Tigris.
1:20
I can't say enough good things about
1:22
them when it comes to object storage.
1:25
I never in my life thought I would have so many
1:27
opinions about object storage, but I do now. Tigris
1:29
is a partner of Fly and it's
1:31
S3 compatible object storage that basically
1:33
seems like it's a CDN, but
1:35
it's not. It's basically object storage
1:37
that's globally distributed without needing to
1:39
actually set up a CDN at
1:41
all. It's like automatically distributed
1:44
around the world. And it's also incredibly
1:46
easy to use and set up. Like
1:48
creating a bucket is literally one command.
1:50
So it's partners like that, that I
1:52
think are this sort of extra icing
1:54
on top of Fly that really makes
1:56
it sort of the platform that has
1:58
everything that you need. So we
2:00
use Tigris here at Change Log. Are they
2:02
built on top of Fly? Is this one
2:05
of those examples of being able to build
2:07
on Fly? Yeah, so Tigris is built on
2:09
top of Fly's infrastructure and that's what allows
2:11
it to be globally distributed. I do have
2:14
a video on this, but basically the way
2:16
it works is whenever, like let's say a
2:18
user uploads an asset to a particular bucket.
2:20
Well, that gets uploaded directly to the region
2:22
closest to the user, whereas with a CDN,
2:25
there's sort of like a centralized place where
2:27
assets need to get copied to and then
2:29
eventually they get sort of trickled out to
2:31
all of the different global locations. Whereas with
2:33
Tigris, the moment you upload something, it's available
2:36
in that region instantly and then it's eventually
2:38
cached in all the other regions as well
2:40
as it's requested. In fact, with Tigris, you
2:42
don't even have to select which regions things
2:44
are stored in. You just get these regions
2:47
for free. And then on top of that,
2:49
it is so much easier
2:51
to work with. I feel like the
2:53
way they manage permissions, the way they
2:56
handle bucket creation, making things public or
2:58
private is just so much
3:00
simpler than other solutions. And the good news is
3:02
that you don't actually need to change your code
3:05
if you're already using S3. It's S3 compatible,
3:07
so like whatever SDK you're using is probably
3:10
just fine and all you got to do
3:12
is update the credentials. So it's super easy.
3:14
Very cool. Thanks, Annie. So
3:16
Fly has everything you need.
3:18
Over 3 million applications, including
3:20
ours here at ChangeLog multiple
3:22
applications, have launched on fly
3:24
boosted by global anti-cast load
3:26
balancing, zero configuration, private networking,
3:28
hardware isolation, instant wire guard
3:31
VPN connections, push button deployments
3:33
that scale to thousands of
3:35
instances. It's all there for
3:37
you right now. To play
3:39
your app in five minutes,
3:41
go to fly.io. Again, fly.io.
3:56
Welcome to another episode of the
3:58
Practical AI podcast. is Daniel
4:00
White Knack. I am CEO at Prediction
4:03
Guard where we're building a private
4:05
secure gen AI platform. And I'm joined
4:07
as always by Chris Benson, who is
4:09
a principal AI research engineer at Lockheed
4:12
Martin. How are you doing, Chris? Great
4:14
today, Daniel. How are you? It's
4:16
a beautiful fall day and a good
4:19
day to
4:23
take a walk around the block and think about
4:25
interesting AI things and clear your
4:27
mind before getting back
4:29
into some data collaboration, which
4:32
is what we're going to talk about
4:34
today. Chris, I don't know if you
4:36
remember our conversation. It
4:38
was just me on that one, but with Bing
4:41
Sun Chua, who talked about
4:43
broccoli AI, the type of
4:45
AI that's healthy for organizations.
4:48
And in that episode, he made a
4:51
call out to Argeela, which was a
4:53
big part of his solution
4:55
that he was developing in
4:57
a particular vertical. I'm really
5:00
happy today that we have with us
5:02
Ben Burton-Chua, who is a machine
5:04
learning engineer at Argeela, and
5:06
also David Berenstein,
5:08
who is a developer advocate
5:10
engineer working on building Argeela
5:12
and distill label at Hugging
5:15
Face. Welcome, David and Ben.
5:17
Thank you. Great to be here. Hi. Thanks
5:20
for having us. Yeah, so like I was saying,
5:22
I think for some
5:24
time maybe if you're
5:26
coming from a data science perspective, there's
5:30
been tooling maybe around
5:32
data that manages
5:34
training data sets or evaluation
5:36
sets or maybe MLOps tooling
5:38
and this sort of thing.
5:41
And part of that has to do with preparation
5:43
and curation of data sets. But
5:46
I found interesting, I mentioned the
5:48
previous conversation with Bing Sun. He
5:50
talked a lot about collaborating
5:52
with his subject matter experts
5:54
and his company around the
5:56
data sets he was creating
5:58
for. for text classification. And
6:01
that's where Argeela came up. So I'm
6:03
wondering if maybe one of you could
6:05
talk a little bit at a higher
6:07
level when you're talking about
6:11
data collaboration in the context
6:13
of the current kind of
6:15
AI environment. What does that
6:17
mean generally? And how would
6:19
you maybe distinguish that from
6:21
previous generations of tooling in
6:23
maybe similar or different ways?
6:26
So data collaboration, at least from our point
6:28
of view, the collaboration between
6:30
both the domain level experts that really
6:33
have high domain knowledge, actually
6:35
know what they're talking about, in
6:37
terms of the data, the inputs, and the outputs
6:39
that the models are supposed to give within
6:42
their domain. And then you have the data scientists,
6:44
or the AI engineers, and this side of the
6:47
coin that are more technical. They know from a
6:49
technical point of view what the models expect and
6:51
what the models should output. And
6:53
then the collaboration between them is, yeah,
6:55
now even higher because nowadays you can
6:57
actually prompt other items with natural language,
7:00
and you actually need to ensure that
7:02
both the models actually
7:04
perform well, and also the prompts and
7:06
these kind of things. So the collaboration
7:08
is even more important nowadays. And
7:11
that's also the case for, still the case
7:13
for tax-cap models and these kind of things,
7:15
which we also support within our job. I
7:18
guess maybe in the context of, let's say
7:20
there's a new team that's
7:23
exploring the adoption of AI technology, maybe
7:25
for the first time. Maybe they're not
7:28
coming from that data science background, the
7:30
sort of heavy MLOps stuff,
7:32
but maybe they've been excited
7:34
by this latest wave of
7:36
AI technologies. How would
7:39
you go about helping them understand
7:41
how their own data, the data
7:43
that they would curate, the data
7:45
that they would maybe collaborate on
7:48
is relevant to and where that fits
7:50
into the certain workflow.
7:52
So yeah, I imagine someone may
7:54
be familiar with what you can
7:57
do with chat GPT
7:59
or pasting. in certain documents
8:01
or other things. And
8:03
now they're kind of wrestling through how
8:05
to set up their own domain specific
8:07
AI workflows in their organization. What
8:10
would you kind of describe about
8:12
how their own domain data and
8:14
how collaborating around that fits into
8:16
common AI workflows? Yeah, so something
8:19
that I like to think about a
8:21
lot around this subject is like machine
8:24
learning textbooks. And they often
8:26
talk about modeling a problem as well
8:28
as building a model, right? There's a
8:30
famous mama and matter cycle. And
8:33
in that, when you model a problem,
8:35
you're basically trying to explain and define
8:37
the problem. So I have articles and
8:39
I need to know whether
8:42
they are a positive or
8:44
negative rating. And I'm describing
8:46
that problem. And then I'm
8:48
going to need to describe that problem to
8:51
a domain expert or an annotator through guidelines.
8:54
And when I can describe that problem in such a
8:56
way that the annotator or the
8:58
domain expert answers that question clearly
9:00
enough, then I know that that's a modeled and
9:03
clear problem. And it's something that I could then
9:05
take on to build a model around. In
9:08
simple terms, it makes sense. And
9:11
so I think when you're going
9:13
into a new space at Generative AI,
9:15
and you're trying to understand your business
9:17
context around these tools, you
9:20
can start off by modeling the problem in
9:22
simple terms, by looking at the data and
9:24
saying, okay, does this label make sense to
9:26
this articles? If I sort all these articles
9:29
down by these labels or by this ranking,
9:31
are these the kinds of things I'm expecting?
9:34
Starting off at quite low numbers, right? Like single
9:37
articles and kind of building up to tens
9:39
of hundreds. And as you do
9:41
that, you begin to understand and also
9:43
iterate on the problem and kind of change it and adapt
9:45
it as you go. And once
9:47
you've got up to a reasonable scale of the
9:49
problem, you can then say, right, this is something
9:51
that a machine learning model could learn. I
9:55
guess on that front, maybe one
9:58
of the big confusions, that I've
10:00
seen floating around these days is
10:02
the kind of
10:05
data that's relevant to some
10:07
of these workflows. So it
10:09
might be easy for people to
10:11
think about a labeled
10:14
data set for a text classification
10:16
problem, right? Like here's this text
10:18
coming in, I'm going to label
10:20
it spam or not spam or
10:22
in some categories. But I think
10:24
sometimes a sentiment that I've got
10:26
very often is, hey, our company
10:28
has this big file store of
10:30
documents. And somehow
10:32
I'm going to fine
10:35
tune, quote unquote, a
10:37
generative model with just this blob
10:39
of documents and then it will
10:41
perform better for me. And there's
10:44
two elements of that that are
10:46
kind of mushy. One
10:48
is like, well, to what end for
10:50
what task? What are you trying to do? And
10:52
then also how you curate that
10:55
data then really matters. Is
10:57
this a sentiment that you all are
10:59
seeing or how for this latest wave
11:01
of models, like how would
11:03
you describe if a company has a bunch
11:06
of documents and they're in this situation,
11:08
they're like, hey, we know we have
11:10
data and we know that these models
11:12
can get better. And maybe
11:14
we could even create our own private
11:16
model with our own domain of data.
11:18
What would you walk them through to
11:21
explain where to start with that process
11:23
and how to start curating their data
11:25
maybe in a less general way,
11:27
but towards some end? I think
11:30
in these scenarios, it's always good to
11:32
first establish a baseline or a benchmark
11:34
because what we often see is that
11:36
people come to us or come to
11:38
like the open source space. They say,
11:40
okay, we really want to fine tune
11:42
a model. We really want to do
11:44
like a super extensive rack pipeline with
11:46
all of the bells and whistles included
11:48
and then kind of start working on
11:50
these documents. But what we often
11:52
see is that they don't even have a bench
11:54
line to actually start with. So that's
11:56
normally what we recommend also whenever you work
11:58
with a rack pipeline. ensure that
12:00
all of the documents that you index
12:03
are actually properly indexed, properly chunked. Whenever
12:06
you actually execute a pipeline and
12:08
use would store these retrieved documents
12:10
or these based on the questions
12:12
and the queries in RGLR
12:14
or any other data and a nation to
12:17
you can actually have a look at the
12:19
documents, see if they make sense, see if
12:22
the retrieval makes sense, but also if the
12:24
generated output makes sense. And then whenever you
12:26
have that baseline set up from there, actually
12:28
start iterating and kind of making additions to
12:31
your pipeline. Shall I add re-ranking potentially to
12:33
the retrieval of the retrieval as in functioning
12:35
properly? Shall I add a fine-tuned version of
12:37
the model? Should I switch from the latest
12:40
LAMA model of 3 billion to
12:42
7 billion or these kind of things?
12:45
And then from there on, you can actually consider
12:47
maybe either fine-tuning model if that's actually needed or
12:49
fine-tuning one of the retrievers or these
12:51
kind of things. As you're saying
12:54
that, as you're speaking from this kind of profound
12:56
expertise you have, and I think a lot of
12:58
folks really have trouble just
13:00
getting started. And you asked some
13:02
great questions there, but
13:04
I think some of those are really tough for someone
13:06
who's just getting into it, like which way to go
13:08
and some of the selections that you would go with
13:11
that. Could you talk a little
13:13
bit about, kind of go back over the same
13:15
thing, but kind of make up a little workflow
13:17
that's kind of hands-on on just like you might
13:19
see this and this is how I would decide
13:21
that just for a moment, just so people can
13:23
kind of grasp kind of the thought
13:26
process you're going, because you kind of described
13:28
a process, but if you could be a
13:30
little bit more descriptive about that, I
13:33
think when I talk to people, once they get going they kind
13:35
of go to the next step and go to the next step
13:37
and go to the next step, but the first four
13:39
or five big question marks at
13:41
the beginning, they don't know which one to handle. I
13:44
can add some practical steps onto that that
13:46
I've worked with in the past. That'd be
13:48
fantastic. Yeah, so
13:51
one thing that you can do that
13:53
is really straightforward is actually
13:55
to write down a list of the kinds
13:57
of questions that you're expecting your system to
13:59
have. to answer. And
14:02
you can get that list by speaking to domain experts
14:04
or if you are a domain expert you can write
14:06
it yourself, right? And it doesn't
14:09
need to be an extensive exhaustive list, it can
14:11
be quite a small starting set. You
14:13
can then take those questions away and start
14:15
to look at documents or pools and sections
14:18
of documents from this lake that
14:20
you potentially have and associate
14:22
those documents with those questions and
14:25
then start to look if a
14:28
model can answer those questions with those
14:30
documents. In fact, by
14:32
not even building anything, by starting
14:35
to use a chat TPT or hugging
14:37
chat or any of these kind of
14:39
interfaces and just seeing this very very
14:42
low simple benchmark, see is
14:44
that feasible? Whilst at
14:46
the same time starting to ask yourself
14:48
can I as a domain expert answer
14:50
this? And that's kind of where our
14:52
giller comes in at the very first step. So you start to
14:54
put these documents in front
14:56
of people with those questions and you
14:59
start to search through those documents and
15:01
say to people can you answer this question or here's
15:04
an answer from a
15:06
model to this question in a very
15:08
small setting and you
15:10
start to get basic early signals of
15:12
quality and from there
15:15
you would start to introduce
15:17
proper retrieval. So you would
15:19
scale up your you
15:21
would take all of your documents, say you had
15:23
a hundred documents associated with your ten questions, you
15:26
put all those hundred documents in an index and iterate
15:28
over your ten questions and see
15:31
okay are the right documents aligning with the
15:33
right questions here? Then you start
15:35
to scale up your documents and make it more and
15:37
more of a real-world situation. You would
15:39
start to scale up your questions. You
15:41
could do both of these synthetically and
15:44
then if you still started to see positive
15:47
signals you could start to
15:49
scale and if you start to see negative
15:51
signals I'm no longer getting the right documents
15:53
associated with the right questions. I
15:56
personally would always start from the
15:59
simplest levers. in a
16:01
rag setup. And what I mean
16:03
there is that you have a number of different things
16:05
that you can optimize. So you
16:07
have retrieval, you can optimize
16:09
it semantically, or you
16:11
can optimize it in a rule-based retrieval.
16:15
You can optimize the generative model, you
16:17
can optimize the prompt. And
16:19
the simplest movers, the simplest levers
16:23
are the rule-based retrieval, the word
16:25
search, and then the semantic search.
16:27
So I would first of all add like a
16:29
hybrid search. What happens if I make sure
16:31
that there's an exact match in that document for the
16:33
word in my query? Does that
16:36
improve or improve my results?
16:39
And then I would just move
16:41
through that process basically. What's up
16:45
friends? I'm
16:56
here with a friend of mine, a good
16:58
friend of mine, Michael Greenwich, CEO and founder
17:01
of Work OS. Work OS
17:03
is the all-in-one enterprise, SSO and
17:05
a whole lot more solution for
17:07
everyone from a brand new
17:09
startup to a enterprise and all the
17:12
AI apps in between. So Michael, when
17:14
is too early or too late to
17:16
begin to think about being enterprise ready?
17:18
It's not just a single point in
17:20
time where people make this transition. It
17:22
occurs at many steps of the business.
17:25
Enterprise single sign-on like SAML, Auth, you
17:27
usually don't need that until you have
17:29
users. You're not gonna need
17:31
that when you're getting started. And we call it
17:33
an enterprise feature, but I think what you'll find
17:35
is there's companies when you sell to like a
17:38
50 person company, they might want this. They actually,
17:40
especially if they care about security, they might want
17:42
that capability in it. So it's more of like
17:44
SMB features even, if they're tech
17:46
forward. At Work OS we provide a ton of
17:48
other stuff that we give away for free for
17:51
people earlier in their life cycle. We just don't
17:53
charge you for it. So that Auth kit stuff
17:55
I mentioned, that identity service, we give that away
17:57
for free up to a million users. million
18:00
users. And this competes with
18:02
Auth0 and other platforms that have much, much lower
18:04
free plans. I'm talking like 10,000, 50,000, like we
18:06
give you a million free. Because we really want
18:11
to give developers the best tools and capabilities
18:13
to build their products faster, you know, to
18:15
go to market much, much faster. And where
18:17
we charge people money for the service is
18:19
on these enterprise things. If you end up
18:21
being successful and grow and scale up market,
18:23
that's where we monetize. And that's also when
18:25
you're making money as a business. So we
18:27
really like to align, you know, our incentives
18:29
across that. So we have people using Authkit
18:31
that are brand new apps,
18:33
just getting started. Companies in Y Combinator,
18:36
side projects, hackathon things, you know, things
18:38
that are not necessarily commercial focus, but
18:40
could be someday, they're kind of future
18:42
proofing their tech stack by using WorkOS.
18:45
On the other side, we have companies much, much
18:47
later that are really big, who typically
18:50
don't like us talking about them. They're logos,
18:52
you know, because they're big, big customers. But they
18:54
say, hey, we tried to build the stuff
18:56
or we have some existing technology, but sort of
18:58
unhappy with it. The developer that built it
19:00
maybe has left. I was talking last week with
19:02
a company that does over a billion in
19:04
revenue each year. And their skim connection, the
19:06
user provisioning was written last summer by an intern
19:09
who's no longer obviously at the company and
19:11
the thing doesn't really work. And so they're looking
19:13
for a solution for that. So there's a
19:15
really wide spectrum. We'll serve companies that are in
19:17
a, you know, their offices in a coffee
19:19
shop or their living room all the way through.
19:21
They have a, you know, their own building in
19:24
downtown San Francisco or New York or something.
19:26
And it's the same platform, same technology, same tools
19:28
on both sides. The volume is obviously different.
19:30
And sometimes the way we support them from a
19:32
kind of customer support perspective is a little
19:34
bit different. Their needs are different, but same technology,
19:37
same platform, just like AWS, right? You can use
19:39
AWS and pay them $10 a month. You
19:41
can also pay them $10 million
19:43
a month, same product or more for sure.
19:45
Or more. Well,
19:47
no matter where you're at
19:50
on your enterprise ready journey,
19:52
WorkOS has a solution for
19:55
you. They're trusted by perplexity
19:57
copy.ai, loom for sale indeed.
20:00
And so many more. You
20:02
can learn more. and check
20:05
them out at workos.com. That's
20:07
workos.com. Again, workos.com. I'm
20:23
guessing that you all, you know, the fact
20:25
that you're supporting all of these use cases
20:28
on top of RGLA on
20:30
the data side makes me think,
20:33
like you say, there's so many
20:35
things to optimize in terms of
20:37
that RAG process, but there's also
20:39
so many AI workflows that are
20:41
being thought of, whether
20:44
that be, you know, code generation
20:46
or assistance or, you
20:48
know, content generation, information extraction. But then
20:50
you kind of go beyond that. David,
20:53
you mentioned text classification and of
20:55
course there's image use cases. So
20:57
I'm wondering from you all,
21:00
at this point, you know,
21:02
one of the things Chris and I have talked about on
21:04
the show a bit is, you know,
21:06
we're still big proponents and, you know, believe
21:08
that in enterprises a lot of
21:10
times there is a lot of mixing of,
21:13
you know, rule-based systems and more
21:16
kind of traditional, I guess, if you
21:18
want to think about it that way,
21:20
machine learning and smaller models, and then
21:23
bringing in these larger Gen AI models
21:25
as kind of orchestrators or, you know,
21:27
query layer things. And
21:30
that's a story we've been kind of telling,
21:32
but I think it's interesting that we have
21:34
both of you here in the sense that,
21:36
like, you really, I'm sure there's certain things
21:38
that you don't or can't track about what
21:41
you're doing, but just even anecdotally out of
21:43
the users that
21:45
you're supporting on RGLA, what
21:47
have you seen in terms of what
21:50
is the mix between those, you know,
21:52
using RGLA for this sort of, maybe
21:54
what people would consider traditional data
21:57
science type of models like text
21:59
classification. and or image
22:01
classification type of things and these
22:04
maybe newer workflows like RAG and other
22:06
things. How do you see that balance?
22:08
And do you see people using both
22:11
or one or the other? Yeah,
22:13
any insights there? I think we
22:16
recently had this company from Germany,
22:18
Alamin's over at one of our
22:21
meetups that we host. And they
22:23
had an interesting use case where
22:26
they collaborated with this healthcare insurance
22:28
platform in Germany. And
22:30
one of the things that you see
22:32
with large language models is that these
22:35
large language models can't really produce German
22:37
language properly, mostly trained
22:39
on English text. And
22:41
that was also one of their
22:44
issues. And what they did was
22:46
actually a huge classification and generation
22:49
pipeline, combining a lot of these techniques
22:51
where they would initially get an email
22:53
in that they would classify to a
22:56
certain category, then based on the category,
22:58
they would kind of define what kind
23:00
of email template, what kind of front
23:02
template they would use. Then
23:05
based on the front template, they
23:07
would kind of start generating and
23:09
composing one of these response emails
23:11
that you would expect for like
23:13
a customer query requests coming in
23:15
for the healthcare insurance companies. And
23:17
then in order to actually ensure
23:19
that the formatting and phrasing and
23:21
the German language was applied properly,
23:23
they would then based on that
23:26
prompt, regenerate the email once more.
23:28
So prompt an LLM to kind
23:30
of improve the quality of the
23:32
initial proposed output. And then after
23:34
all of these different steps of
23:36
classification of retrieval, augmented generation of
23:38
an initial generation and
23:40
a regeneration, they would then end up
23:43
with their eventual output. So what we
23:45
see is that all
23:47
of these techniques are normally combined.
23:50
And also a thing that we
23:52
are strong believers in is that whenever
23:55
there is a smaller model or an
23:57
easier approach applicable, why not go for
23:59
that? instead of using one of these
24:01
super big large language models. So
24:03
if you can just classify is this relevant
24:06
or is this not relevant and based on
24:08
that, actually decide what to do. That makes
24:10
a lot of sense. And also,
24:13
one of the interesting things that I've seen one
24:16
of these open source platforms, Haystack, out there
24:19
using is also this query
24:22
classification pipeline, where they would
24:24
classify incoming queries as either
24:27
a key terminology
24:29
search, a question query,
24:31
or actually a phrase for an
24:33
LLM to actually start prompting LLM.
24:35
And based on that, actually redirect
24:37
all of their queries to the
24:39
correct model. And that's also an
24:41
interesting approach that we've seen. Quick
24:44
follow up on that. And it's
24:46
just something I wanted to draw out because we've
24:48
drawn it out across some other episodes a bit.
24:51
You were just making a recommendation go for
24:53
the smaller model versus the larger model. For
24:56
people trying to follow, and there's the
24:59
divergent mindsets, could you take just a
25:01
second and say why you would advocate
25:03
for that, what the benefit, what the
25:05
virtue is in the context of
25:07
everything else? I would say smaller
25:11
models are generally hostable by yourself.
25:13
So it's more private. Smaller
25:15
models, they are more cost efficient.
25:19
Smaller models can also be fine-tuned easier
25:21
to your specific use case. So even
25:24
what we see a lot of people coming
25:26
to us about is actually fine-tuning
25:28
LLMs. But even the big companies
25:31
out there with huge amounts of
25:33
money and resources and dedicated research
25:35
teams still have difficulties
25:37
for fine-tuning LLMs. So
25:40
whenever you, instead of within your
25:42
retrieval augment the generation pipeline, fine-tune
25:45
the LLM for the generation part,
25:47
you can actually choose to fine-tune
25:49
one of these retriever models that
25:51
you can actually fine-tune on consumer
25:54
grade hardware. You can actually fine-tune
25:56
it very easily on any arbitrary
25:58
data scientist developer device. And
26:00
then instead of having to
26:02
deploy anything on one of the cloud
26:04
providers, you can start with that. And
26:07
for a similar reason for a
26:10
RAC pipeline, whenever you provide an
26:12
LLM with garbage within such a
26:14
retreivlement generation pipeline, you actually also
26:16
ensure that there's less relevant content
26:19
and the output of the LLM
26:21
is also going to be worse.
26:24
Yeah, I've seen a lot of cases where,
26:27
I think it was Travis Fisher who was on the
26:29
show, he advocated for
26:31
this hierarchy of how you should approach these
26:33
problems and there's like maybe seven things on
26:36
his hierarchy that you should try before fine
26:38
tuning. And I think in a lot of cases I've
26:40
seen people maybe jump to that,
26:43
they're like, oh, I forget
26:45
which one of you said this, but this
26:47
naive RAG approach didn't get me quite
26:49
there, so now I need to fine
26:52
tune. When in reality, there's sort of
26:54
a huge number of things in between
26:56
those two places and you might end
26:58
up just getting a worse performing model,
27:00
depending on how you go about the
27:02
fine tune. One of the
27:04
things, David, you kind of walk through
27:06
these different, the example of
27:08
the specific company that had these
27:10
workflows that involved a variety of
27:12
different operations, which I
27:15
assume to Ben, you
27:18
mentioning earlier, starting with a test set
27:20
and that sort of thing and
27:22
how to think about the tasks.
27:25
I'm wondering if you can specifically
27:27
now talk just a little bit
27:29
about Argeela, specifically people might be
27:31
familiar generally with like data annotation,
27:33
they might be familiar, maybe
27:36
even with how to upload some data
27:38
to quote fine tune some of these
27:41
models in an API sense, or maybe
27:43
even in a more advanced way with
27:46
Q-Lora or something like that, but
27:48
could you take a minute and
27:51
just talk through kind of Argeela's
27:53
approach to data annotation and data
27:55
collaboration it's
27:57
kind of hard on a podcast because we don't have a lot
28:00
of data. a visual to show for people, but as best you
28:02
can help people to imagine
28:04
if I'm using Argeela to
28:07
do data collaboration, what
28:09
does that look like in terms of what I would
28:11
set up and who's involved, what
28:14
actions are they doing, that sort of thing? Argeela,
28:16
there's two sides to it, right?
28:20
There's a Python SDK, which is
28:22
intended for the AI machine learning
28:24
engineer, and there's a
28:27
UI, which is intended for your
28:29
domain expert. In reality,
28:31
the engineers often also use the UI, and
28:33
you kind of iterate on that as you
28:35
would because it gives you a representation of
28:37
your task. But there's these two
28:40
sides. The UI is kind
28:42
of lightweight. It can be deployed in a Docker
28:44
container or on hugging face spaces, so it's really
28:46
easy to spin up. And
28:48
the SDK is really
28:51
about describing a feedback task and
28:53
describing the kind of information that
28:55
you want. So you
28:58
use Python classes to construct your
29:00
dataset settings. You'll say,
29:02
okay, my fields are a piece
29:04
of text, a chat,
29:07
or an image, and the questions
29:09
are a text question, so
29:11
like some kind of feedback, a comment, for
29:13
example, a label question,
29:16
so positive or negative labels, for
29:18
example, a rating,
29:21
say between one and five, or
29:23
a ranking. Example
29:25
one is better than example two, and you
29:27
can rank a set of examples. And
29:30
with that definition of a feedback
29:33
task, you can create
29:35
that on your server, in your
29:37
UI, and then you can push
29:39
what we call records, your samples,
29:42
into that dataset, and then they'll
29:45
be shown within the UI, and your
29:47
annotator can see all of the questions. They'll have
29:49
nice descriptions that were defined in the SDK. They
29:52
can tweak and kind of change those as well if you
29:54
need in the UI because that's a little bit easier. You
29:57
can distribute the task between a team. you
29:59
can say, OK, this record
30:01
will be accepted once we have at
30:04
least two reviews of it. And
30:06
you can say that some questions are required and
30:08
some aren't, and they can skip through some of
30:10
the questions. The UI has
30:12
loads of keyboard shortcuts, like with numbers and
30:15
arrows and returns. So you can move through
30:17
it really fast. It's kind of optimized for
30:19
that. And different screen sizes. One
30:22
thing we're starting to see is that, as
30:25
LLMs get really good at quite long
30:27
documents, that some of the stuff that
30:29
they're dealing with is a multi-page document
30:32
or a really detailed image and
30:34
then a chat conversation. And
30:37
then we want a comment and a
30:39
ranking question. So it's a
30:41
lot of information in the screen. So the
30:43
UI kind of scales a bit like an IDE. So
30:46
you can drag it around to give yourself
30:48
enough width to see all this stuff. And then
30:50
you can move through it in a reasonably
30:52
efficient way with keyboard shortcuts and stuff. Interesting.
30:55
And what do you see as kind
30:57
of the backgrounds of the
30:59
roles of people that are using
31:01
this tool? Because one of the
31:04
interesting things from my perspective, especially
31:07
with this kind of latest
31:09
wave is there's
31:11
maybe less data scientists, kind of
31:13
AI people, that that their background
31:16
and more software engineers
31:18
and just non-technical domain experts. So
31:20
how do you kind of think
31:22
about the roles within
31:25
that and what are you seeing in terms
31:27
of who is using the system? For
31:29
us, I think it's, yeah, from
31:32
the SDK Python side, it's really
31:34
still developers. And then
31:36
from the UI side, it's like anyone
31:38
in the team that needs to have
31:40
some data labeled with domain knowledge. Often
31:43
these are also going to be like
31:45
the AI experts. And one
31:47
of the cool things is that whenever
31:49
an AI expert actually sets up a
31:51
data set besides these fields and questions,
31:53
they can actually come up with some
31:56
interesting features that they can add on top of the
31:58
data set. They are. are
32:00
also able to add semantic search,
32:02
like attach records or semantic representation
32:05
of the records to one of the
32:07
records, which actually enables the users within
32:09
the UI to label way more efficiently.
32:11
So for example, if someone
32:14
sees a very representative example of something
32:16
that's bad within their data set, they
32:18
can do a semantic search,
32:20
find the most similar documents, and then
32:22
continue with the labeling on top of
32:25
that. Besides that, you can also, for
32:27
example, filter based on model certainties. So
32:29
let's say that your model is very
32:31
uncertain about an initial prediction that you
32:33
have within your UI. And
32:36
it's really interesting for the domain expert
32:38
or for the data scientist to go
32:40
and have a look at that specific
32:42
record or that range of
32:45
uncertainties. And then based on that,
32:47
the labeling or the data curation
32:49
or whatever you would like to call
32:51
it becomes way more engaging and way
32:53
more interesting. And on top of
32:56
that, another thing that we are starting
32:58
to explore is actually using this AI
33:00
feedback and synthetic data within RGL
33:02
as well. And that's actually
33:04
one of the other products that we're working
33:06
on, and it's called this C label. So
33:09
nowadays, what you can do with
33:11
LLMs is also actually use LLMs
33:13
to evaluate questions, for
33:15
example, to evaluate whether something is
33:17
label A, B, or C, whether
33:20
something is a good or bad
33:22
response. And you see all
33:25
kinds of tools, open source tools out
33:27
there. And that's also a thing that
33:29
we are looking at for integrating with
33:31
the UI, where instead of doing this
33:33
more from a data science, a CK
33:35
perspective, users without any
33:37
technical knowledge would actually be able to
33:39
tweak these guidelines that been highlighted earlier
33:41
and then say, OK, maybe instead of
33:44
taking this into account, you should focus
33:46
a bit more on the
33:48
harm that potentially is within your data
33:50
or the risks that are within your
33:52
data. And then you would
33:54
be able to prompt an LLM once
33:56
again to label your data, and then
33:58
you wouldn't directly need. the bite on
34:01
STK anymore. I was thinking about
34:03
as you were describing that I work at
34:05
a large organization and we certainly
34:07
have a lot of domain experts
34:09
in the organization I work at
34:11
that are either non-technical or semi-technical
34:14
and as users they
34:16
will sometimes find it intimidating you know kind
34:18
of getting into all this as they're starting
34:20
a project could you talk a little bit
34:22
about what it's like for a
34:25
non-technical person to sit down with Argeela
34:27
and start to work in a productive
34:29
way what what does that experience like
34:31
for them because it's one thing like the
34:33
technical people kind of just know they dive into
34:35
it they're gonna use the SDK they've used other
34:38
SDKs but there can
34:40
be a bit of handholding for people who
34:42
are not used to that could you describe
34:44
the user experience for that non-technical subject matter
34:46
expert coming in and what labeling is like
34:48
and just kind of paint a
34:51
picture of words on what their experience
34:53
might be like yeah I
34:55
mean one thing I guess I'd start off by
34:57
saying is that Argeela is kind
34:59
of the latest iteration of
35:02
a problem that has existed for a
35:04
long time in machine learning and data
35:06
science right about collecting feedback from domain
35:08
experts and it's kind of gone through spreadsheets
35:11
and various other tools
35:13
that were substandard and
35:16
really bad user experiences where
35:19
domain experts were asked for
35:21
information but information was
35:23
extracted and then models have been
35:25
trained really poorly on
35:28
that information so as a
35:30
field we kind of know that it's something that
35:32
we have to take really seriously and and that's
35:34
kind of what Argeela is built on top of
35:36
right that's part of our DNA as a product
35:38
is like optimizing the
35:42
feedback process as a user experience
35:44
problem and so when
35:46
the user sits down to use Argeela the
35:49
intention is that all of
35:51
the information should be right there in
35:53
front of them inside their single like
35:55
record view so what that means
35:57
is they've got a set of guidelines that
36:00
edited and marked down. They
36:02
can contain images, links to various
36:04
pages or other external documents if they need
36:06
and they can just kind of scroll through
36:08
that. It's always there, it's always available. They've
36:11
then also got like basic metrics so
36:13
they'll know how many records they've got
36:15
left, how many they've labeled. They can
36:17
view that kind of team status and
36:19
see what's going on. And
36:21
then on the left they have their fields
36:24
which they can scroll through and on the
36:26
right they'll have a set of questions. As
36:29
I said they can move through these in
36:31
keyboard shortcuts and they can switch the view
36:33
so that they can scroll kind of infinitely
36:35
or they can move into a kind of
36:37
page swiping. If you're looking at
36:39
really small records with like a couple of lines
36:42
and you're just assigning a simple label to, you
36:45
can do that in bulk. So as
36:47
we said you could use a semantic
36:49
search, give me all the records
36:51
that are similar to this one and I'll bulk label those
36:54
or you could search for terms inside those
36:56
records and you can bulk label those. And
36:59
then once you're finished you'll know about
37:01
it. And one of the interesting things
37:03
that I've done personally quite often is
37:05
sit together with the
37:07
domain experts and their AI engineers
37:10
to kind of walk them through
37:12
how to configure our GLI most
37:14
usefully for both of them. And
37:16
then the domain experts come with a lot
37:18
of things to the table like I want
37:20
to see this specific representation. What if we
37:22
could do this, what if we could do
37:24
that. Then the AI engineers think
37:27
about like the data side of things. Is
37:29
this possible from our point of view from
37:31
our side? And then me as a mediator
37:33
so to say kind of make the
37:36
most out of the ARJEL configuration. And
37:38
that's also how we see this collaboration
37:40
process going where domain experts really work
37:42
together also with AI engineers because AI
37:45
engineers or machine learning engineers actually know
37:47
what's possible from the data, what it
37:50
means to get high quality data for
37:52
fine tuning model. Because whenever our domain
37:55
engineer comes up with something
37:57
that's useful for them in terms of
37:59
labeling doesn't mean. necessarily that it's actually
38:01
proper data that's going to come out
38:03
of there in terms of fine tuning
38:05
a model. And that's also a part
38:07
of, I guess, the collaboration that we're
38:09
talking about. What's
38:26
up, friends? I've got something exciting to share
38:28
with you today. A sleep technology that's pushing
38:30
the boundaries of what's possible in our bedrooms.
38:33
Let me introduce you to eight sleep
38:35
and their cutting edge Pod4 Ultra. I
38:37
haven't gotten mine yet, but it's on
38:39
its way. I'm literally counting the days.
38:42
So what exactly is the
38:44
Pod4 Ultra? Imagine a
38:47
high-tech mattress cover that you can
38:49
easily add to any bed. But
38:51
this isn't just any cover. It
38:53
is packed with sensors, heating, and
38:55
cooling elements, and it's all controlled
38:57
by sophisticated AI algorithms. It's like
38:59
having a sleep lab, a smart
39:02
thermostat, and a personal sleep coach
39:04
all rolled into a single device.
39:07
It uses a network of sensors
39:09
to track a wide array of
39:11
biometrics while you sleep, sleep stages,
39:13
heart rate variability, respiratory rate, temperature,
39:16
and more. It uses
39:18
precision temperature control to regulate your
39:20
body's sleep cycles. It can
39:22
cool you down to a chilly 55 degrees
39:24
Fahrenheit or warm you up to a good,
39:27
nice, solid temperature of 110 Fahrenheit.
39:30
And it does this separately for
39:32
each side of the bed. This means
39:34
you and your partner can have your
39:37
own ideal sleep temperatures. But
39:39
the really cool part is that the
39:41
Pod uses AI and it
39:43
uses machine learning to learn your sleep
39:45
patterns over time. And it uses this
39:48
data to automatically adjust the temperature of
39:50
your bed throughout the night according to
39:52
your body's preferences. Instead of just giving
39:55
you some stats, it understands them and
39:57
it does something about it. Your
39:59
bed. You literally get smarter as you
40:02
sleep over time. And all this functionality
40:04
is accessible through a comprehensive mobile app.
40:06
You get sleep analytics, trends over time,
40:09
and you even get a daily sleep
40:11
fitness score. Now, I don't have
40:13
mine yet. It is on its way, thanks to
40:15
our friends over at 8sleep. And I'm literally counting
40:17
the days till I get it, because I love
40:19
this stuff. But if you're ready to
40:22
take your sleep and your recovery to the next
40:24
level, head over to 8sleep.com/Practical
40:26
AI and
40:28
use our code Practical AI to get 350 bucks
40:30
off your very own
40:33
Pod 4 Ultra. And you can try it
40:35
free for 30 days. I
40:37
don't think you wanna send
40:39
it back, but you can
40:41
if you want to. They're
40:44
currently shipping to the US,
40:46
Canada, United Kingdom, Europe, and
40:48
Australia. Again, 8sleep.com/Practical AI. I
40:50
wanna maybe double click on
40:52
something that David, you
41:13
just said and sort of passing, which
41:15
I think is quite significant. I
41:17
don't know if some people might have
41:19
caught it, but when you were talking about the still label,
41:21
you also talked about AI
41:23
feedback. So AI feedback and
41:26
synthetic data. So I'd love to get
41:28
into those topics a little bit. Maybe
41:30
first coming from the AI feedback side,
41:32
I think this is super interesting because,
41:36
Ben, you talked about how this is a
41:38
kind of more general problem that people have
41:40
been looking at in
41:43
various ways from various perspectives for
41:45
a long time in terms of
41:47
this data collaboration labeling piece. But
41:49
there is this kind of very
41:52
interesting element now where we have
41:54
the ability to utilize these very
41:57
powerful, maybe general purpose.
42:00
construction following type of
42:02
models to actually
42:04
act as labelers within
42:06
the system or at least generate
42:09
drafts of labels or
42:12
feedback or even preferences
42:14
and scores and all of
42:16
those sorts of things. So
42:18
I'm wondering if one of you could speak
42:20
to that. Some people might
42:22
find this kind of strange that
42:25
we're kind of giving feedback
42:27
to AI systems with AI
42:29
systems, which seems circular
42:31
and maybe like why would that work
42:33
or just sort of maybe that's kind
42:36
of produces some weird feelings for people. But
42:38
I think it is a significant thing that
42:40
is happening. And so
42:42
yeah, either of you would want to
42:45
kind of dive into that. What does
42:47
it specifically mean in AI feedback? How
42:49
are you seeing that being used most
42:51
productively? So when we create
42:53
a data set either manually
42:55
or with AI feedback or AI
42:57
generation, we have all the
42:59
information there to understand the problem. We have
43:01
a set of guidelines, we have a set
43:04
of labels, definitions of those labels, with documents
43:06
and definitions of those documents. We
43:08
give those to a manual annotator or
43:10
we'll go out and collect those documents and
43:12
we'll give those documents to the manual annotator.
43:15
And we're trying to describe that problem so that the person understands
43:17
it to create the data. We
43:19
can essentially take all of the same resources
43:21
and give those to an LLM and get
43:23
the LLM to perform the same steps. So
43:25
there's two parts to that. There's
43:28
a generative part where the LLM
43:30
can generate documents. So
43:32
let's say we've got a hundred documents in our
43:34
data set that we want 10,000. We
43:37
can say generate a document like
43:40
this one, but and add variation
43:42
on top of that. And
43:45
we can fan out our data set, our
43:47
documents from 100 to 10,000. We
43:50
could then take those same documents or
43:53
a pool of documents from elsewhere and
43:56
we could get feedback on that. So that
43:58
could be qualitative feedback. tell
44:00
me which of these documents are relevant to
44:03
this task, tell me which
44:05
of these documents are of a high quality,
44:07
are concise, are detailed,
44:09
these kind of attributes. So we could
44:12
filter down our large dataset or our
44:14
generated dataset to the best document. We
44:17
could also add labels. So we could say, tell
44:20
me which of these documents relates to my business
44:22
use case or not, these kinds
44:24
of things, apply topics to
44:26
these documents. And then
44:28
we can, in doing so, create a classification
44:30
dataset from those labels. Or
44:33
we could, in one
44:35
example, take a set of documents and
44:37
use a generative model to generate questions
44:39
or queries about those documents. And we
44:41
could use that to create a Q&A
44:44
dataset or a retrieval dataset
44:47
where we generate search queries based
44:49
on documents. When you're doing that
44:51
and you're generating the datasets with
44:54
another model, how
44:56
much do you have to worry about hallucination playing into
44:58
that? It sounds like you have a good process for
45:00
trying to catch it there.
45:03
But is that a small issue?
45:05
Is that a larger issue? Any guidance
45:07
on that? That's one of the
45:09
main issues, definitely. It is
45:11
probably the main issue. And
45:13
so really, it's about both sides
45:16
of that process that I described,
45:18
that generating side and that evaluating
45:20
side. So you get the
45:23
large-scale models to do as much as
45:25
possible to expose hallucination by
45:27
evaluating themselves. And
45:29
typically, you're getting larger models
45:31
to evaluate. So they're
45:33
a more performant model and they should
45:35
hallucinate less. The task
45:38
of identifying hallucinations is not the
45:40
same as generating a document. So
45:42
typically, LLMs are better at identifying
45:44
hallucinations and nonsense. If you give
45:46
them the context, then they
45:48
are not generating it. And
45:51
so you combine that within a pipeline.
45:54
And then you would take that to a domain
45:56
expert in a tool like Argylla.
45:58
And so that's really why. we
46:00
have these two tools, Distal Label and
46:02
Argylla, because without
46:04
Argylla, Distal Label would
46:07
suffer from a lot of those problems. Yeah,
46:10
and I guess that brings us to the
46:12
second tool, the Distal Label, which I know
46:15
has some to do with this
46:17
synthetic data piece as well. And I'm really
46:19
intrigued to hear about this, because I also
46:21
see some of what
46:23
you have on the documentation
46:26
about what are people building with Distal
46:28
Label. I do note
46:30
a couple of data
46:32
sets like the Open Hermes data
46:34
set, the Intel Orca DPO data
46:36
set. These are data sets that
46:38
have been part of
46:41
the lineage of models that
46:43
I found very, very useful.
46:45
So first off, thanks for
46:47
building tooling that's created really
46:50
useful models in my own life.
46:52
But beyond that, David,
46:55
do you want to go into a little bit
46:57
about what Distal Label is and maybe even tie
46:59
into some of those things and how it's proven
47:01
to be a useful piece of
47:03
the process in
47:05
creating some of those models? I
47:08
think the idea of
47:10
Distal Label started to have
47:13
a year ago, more or less, or maybe
47:15
a year ago, where we saw these initial
47:19
new models coming out, like
47:21
Alpaca and Dolly from Databricks,
47:23
Alpaca from Stanford, where
47:25
there were data sets
47:27
being generated
47:30
with OpenAI frontier models being evaluated
47:32
with OpenAI frontier models, and then
47:34
published and actually used for fine
47:37
tuning one of these models. So
47:39
apparently there were research groups
47:41
or companies investing time in this. But what we
47:43
also saw is when we would upload these data
47:46
sets into Agilla, actually started looking at
47:48
the data that there were a lot
47:50
of flaws within there. And then whenever
47:52
like Ultra half feedback, which is one
47:55
of these specific papers that really started
47:57
to scale the synthetic data and AI
47:59
feedback concept came in. out, we thought,
48:01
okay, maybe it's worth to look
48:03
into a package that can actually
48:06
help us facilitate creating
48:08
datasets that we can then eventually
48:10
fine-tune within our Jela. And
48:12
that's when we started work on the initial
48:14
version of this label. So it's kind
48:17
of like application framework, like llama
48:20
index or length chain, if you're
48:22
familiar with those, but then specifically
48:24
focused on synthetic data generation and
48:27
AI feedback. What
48:29
we try to do is organize
48:31
everything into this part-finding structure,
48:33
where you have either steps
48:35
that are about basic data
48:37
operations, tasks that
48:39
are about prompt templates or prompting
48:43
and prompt templates. You can think about
48:45
either providing feedback, maybe rewriting some initial
48:47
input that you provide to that prompt
48:49
template, or maybe ranking or
48:52
generating from scratch or these kind of
48:54
things. And then
48:56
these tasks are actually executed
48:58
by LLMs, and these are
49:00
then all fit together within
49:02
a pipelining structure. The
49:05
thing for these tasks is
49:07
that nowadays we actually look at
49:09
all of the most recent research
49:11
implementations or most recent research papers,
49:13
and we try to implement them
49:15
whenever they come out and are
49:17
actually relevant for synthetic data generation.
49:19
So you really go from the
49:22
finicky prompt engineering, so to
49:24
say, to evaluated prompts that
49:27
we've implemented. And
49:29
the nice thing about our pipelining
49:31
structure is also that we run
49:33
everything asynchronously. So there's multiple LLM
49:35
executions being done at once, which
49:38
will really speed up your pipeline.
49:41
And on top of that, we also cache all
49:43
of the intermediate results. So as
49:45
you can imagine, calling the OpenAI API
49:47
can be quite costly. And
49:49
whenever you run a pipeline, a lot of things
49:52
can go wrong. But
49:55
whenever you actually rerun our pipelines within
49:57
this label, you actually have these cached
49:59
results already there. as we would avoid
50:01
kind of incurring additional costs whenever something
50:04
within the pipeline breaks. Yeah,
50:06
that's awesome. And I know that one
50:08
element of this is the
50:11
kind of creation of synthetic
50:13
data for further fine tuning
50:15
LLMs to increase performance
50:18
or maybe to some sort
50:20
of alignment goal or something like that. But
50:23
also I know from working
50:25
with a lot of healthcare
50:28
companies, manufacturers, others
50:31
that are more security
50:33
privacy conscious in
50:35
my day job. Part of the pitch
50:38
around synthetic data is maybe
50:40
also creating data sets that
50:43
might not kind of poison
50:45
LLMs with a bunch of your own
50:47
sort of private information that could be
50:50
sort of exposed as part of an
50:52
answer that someone prompts the
50:54
model in some way. This data is embedded
50:56
in the data set and all of that.
50:58
So yeah, I would definitely encourage people to
51:01
check out this still label. And you said
51:03
it's been around for half a year or
51:05
so. How have you seen
51:08
the kind of usage and adoption
51:10
so far? The usage
51:12
and adoption has been quite good in
51:14
terms of the number of data
51:16
sets that have been released. So you
51:19
mentioned the Intel Orca DPO data set,
51:21
which was an example use case of
51:23
how we were initially
51:26
using it, where we had this original
51:28
data set that had been labeled by
51:30
Intel employees with preferences
51:33
of what would be like the preferred
51:35
response to a given prompt. And
51:38
we actually use this label to
51:40
kind of clean that based on
51:42
asking prompting LLMs ourselves to reevaluate
51:44
these chosen rejected pairs within the
51:47
original data set, filtering
51:49
out all of the ambiguities. So
51:51
sometimes the LLM wouldn't align with
51:53
the original chosen rejected pair.
51:56
And based on that, we were actually able to scale
51:58
down the data set by 50. percent, leading
52:01
to less training time, and also
52:04
leading to a higher performing model. And
52:06
that was one of the really famous
52:08
examples that inspired some people within the
52:11
open source community to actually start looking
52:13
at this label, to start using this
52:15
label to generate data sets. There
52:17
are some Hagging Face teams that
52:19
have actually been generating millions and
52:22
millions of rows of synthetic data
52:24
using this label. And that's pretty
52:26
cool to see that people are
52:28
actually using it at skill. And
52:31
besides that, there's also these smaller companies,
52:33
so to say, but like
52:36
Allemind, the German consultancy, the
52:38
German startup that I mentioned
52:40
before, using it to also
52:43
rewrite and resynthesize emails
52:45
within actual production use
52:47
cases. It's really fascinating.
52:49
You guys are pushing the state of the
52:51
art in a big
52:54
way. With the work that you've
52:56
done in DistaLabel and
52:58
Argella, where do you think things
53:00
are going? Like when you're kind
53:02
of end of whatever your task is of the
53:04
day and you're kind of just letting your mind
53:06
wander and thinking about the
53:09
future, where do each of y'all
53:11
go in terms of what you think's going
53:13
to happen, what you're excited about, what you're
53:15
hoping will happen, what you might be
53:17
working on in a few months or maybe a year
53:19
or two? What were your thoughts? I
53:21
suppose for me, it's about two
53:23
main things. And the
53:25
first would be modalities. So moving
53:28
out of text and into image
53:30
and audio and video and
53:32
also kind of UX environments
53:35
so that maybe in Argella, but
53:37
also in DistaLabel, that we can generate
53:39
synthetic data sets in different modalities and
53:41
that we can review those. And that's
53:45
a necessity and something that we're already working on and
53:47
we've already got features around, but we've got kind of
53:49
more coming. And then the second
53:52
one, which I suppose is a bit
53:54
more far fetched and that's a bit
53:56
more about kind of tightening the loop
53:58
between the various parts. application.
54:00
So between distilable argyllar and
54:02
the application that you're building,
54:05
so that you can deal with feedback as
54:07
it's coming from your domain expert that's using
54:09
your application and potentially argyllar at the same
54:11
time. So we can kind of
54:13
synthesize on top of that to evaluate that feedback
54:16
that we're getting and generate based on that feedback.
54:19
So we can add that into our giller and
54:21
then we can respond to that synthetic
54:24
generation, that synthetic data. And
54:26
then we can use that to train
54:28
our model, this kind of tight loop
54:30
between the end user, the application and
54:32
our feedback. Yeah, and for
54:34
me, it kind of aligns with what
54:36
you mentioned before, man, like the multi-modality,
54:38
smaller and more efficient models, things that
54:40
can actually run on a device. I've
54:43
been playing around with this app this morning
54:45
that you can actually load local
54:47
LLM into, like a small smaller
54:50
like QN or a llama model
54:52
from Meta. And it actually runs
54:54
on an iPhone 13, which is really cool.
54:56
It's private, it runs quite quickly. And
54:59
the thing that I've been wanting to
55:01
play around with is the speech to
55:03
speech models, where you can actually have
55:05
real-time speech to speech. I'm currently learning
55:07
Spanish at the moment. And one of
55:09
the difficult things there is not being
55:12
secure enough to actually talk to people out
55:14
on the streets and these kind of things.
55:16
So whenever you would be able to kind
55:19
of practice that at home privately on your
55:21
device, talk some Spanish into
55:23
an LLM, get some Spanish back, maybe some
55:25
corrections in English. These kind of scenarios are
55:28
super cool for me whenever they would be able to
55:31
come true. Yeah,
55:34
this is Muy Bueno. And yeah,
55:36
I've been really excited to talk to
55:38
you both and would love to have
55:40
you both back on the show sometime
55:42
to update on those things. Thank you
55:45
for what you all are doing, both
55:47
in terms of tooling and
55:49
our GLM hugging face
55:51
more broadly in terms of how you're
55:53
driving things forward in the
55:55
community and especially the open
55:57
source side. Thank
56:00
you for taking time to talk with us and
56:03
hope to talk again soon. Yeah,
56:05
thank you. And thanks for having us. Thank
56:14
you. All right,
56:16
that is Practical AI for this
56:18
week. Subscribe now.
56:21
If you haven't already, head
56:23
to practicalai.fm for all the
56:25
ways. And join our
56:27
free Slack team where you can hang
56:29
out with Daniel, Chris, and the entire
56:31
changelog community. Sign
56:33
up today at practicalai.fm
56:36
slash community. Thanks
56:38
again to our partners at fly.io, to
56:41
our beat freaking residents, Breakmaster Cylinder, and
56:43
to you for listening. We appreciate you
56:45
spending time with us. That's
56:48
all for now. We'll talk to you again
56:50
next time. Bye.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More