Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:05
Welcome to Practical AI.
0:08
If you work in artificial intelligence,
0:10
aspire to, or are curious how
0:12
AI related tech is changing the
0:14
world, this is the show for
0:17
you. Thank you to our
0:19
partners at fly.io. Fly
0:21
transforms containers into microvms that run
0:23
on their hardware in 30 plus
0:26
regions on six continents, so you
0:28
can launch your app near your
0:30
users. Learn more at fly.io.
0:35
Hey friends, you know we're big
0:37
fans of fly.io and I'm here
0:39
with Kurt Mackey, co-founder and CEO
0:42
of Fly. Kurt, we've had some
0:44
conversations and I've heard you say
0:46
that public clouds suck. What
0:48
is your personal lens into public clouds sucking
0:50
and how does fly not suck? Alright, so
0:52
public clouds suck. I actually think most ways
0:55
of hosting stuff on the internet sucks and
0:57
I have a lot of theories about why
0:59
this is but it almost doesn't matter. The
1:01
reality is like I've built a new app
1:03
for like generating sandwich recipes because my family's
1:05
just into specific types of sandwiches that use
1:08
brown sugar as a component for example and
1:10
then I want to like put that somewhere.
1:12
You go to AWS and it's harder than
1:14
just going and getting like a dedicated server
1:16
from Headstar. It's like it's actually like more
1:18
complicated to figure out how to deploy my
1:20
dumb sandwich app on top of AWS because
1:23
it's not built for me as a developer
1:25
to be productive with. It's built for other
1:27
people. It's built for platform teams to kind
1:29
of build the infrastructure of their dreams and
1:31
hopefully create a new UX that's useful for
1:33
the developers that they work with. And again,
1:36
I feel like every time I talk about
1:38
this, it's like I'm just too impatient. I
1:40
don't particularly want to go figure so many
1:42
things out purely to put my sandwich app
1:44
in front of people and I don't particularly
1:46
want to have to go talk to a
1:49
platform team once my sandwich app becomes a
1:51
huge startup and IPOs and I have to
1:53
like do a deploy. I
1:55
kind of feel like all that stuff should just
1:57
work for me without me having to go ask
1:59
permission. or talk to anyone else. And so this
2:01
is a lot of, it's informed a lot of
2:04
how we built Fly. Like we're still a public
2:06
cloud. We still have a lot of very similar
2:08
low level primitives as the bigger guys. But
2:11
in general, they're designed to be used
2:13
directly by developers. They're not built for
2:15
a platform team to kind of cobble
2:17
together. They're designed to be useful
2:19
quickly for developers. One of the ways we've
2:21
thought about this is, is if you can
2:24
turn a very difficult problem into a two
2:26
hour problem, people will build much more interesting
2:28
types of apps. And so this is why
2:30
we've done things like made it easy to
2:33
run an app multi-region. Most companies don't run
2:35
multi-region apps on public clouds because it's functionally
2:37
impossible to do without a huge amount of
2:39
upfront effort. It's why we've made things like
2:42
the virtual machine primitives behind just
2:44
a simple API. Most people don't do like
2:46
code sandboxing or their own virtualization because it's
2:48
just not really easy. It's not, there's no
2:50
path to that on top of the clouds.
2:52
So in general, like I feel like, and
2:54
it's not really fair of me to say
2:56
public clouds suck because they were built for
2:59
a different time. If you build one of
3:01
these things starting in 2007, the
3:04
world's very different than it is right now. And so a
3:06
lot of what I'm saying, I think, is that public clouds
3:08
are kind of old and there's a
3:10
new version of public clouds that we should
3:12
all be building on top of that are
3:14
definitely gonna make me as a developer much
3:16
happier than I was like five or six
3:18
years ago when I was kind of stuck
3:20
in this quagmire. So AWS was built for
3:22
a different era, a different cloud era, and
3:24
Fly, a public cloud, yes, but
3:27
a public cloud built for developers who
3:29
ship. That's the difference. And we here
3:31
at change.org are developers who ship. So
3:34
you should trust us. Try out Fly,
3:36
fly.io. Over three
3:38
million apps, that includes us,
3:40
have launched on Fly. They
3:42
leverage the global anti-cast load
3:44
balancing, the zero config private
3:47
networking, hardware isolation, instant wire
3:49
guard VPN connections with push
3:51
button deployments, scaling to thousands
3:53
of instances. This is the
3:55
cloud you want. Check it out,
3:57
fly.io again. fly.io.
4:18
Welcome to another episode of the
4:21
Practical AI podcast. This is Daniel
4:23
Weitnack. I am CEO at Prediction
4:25
Guard, where we're building a private
4:28
secure gen AI platform. And I'm
4:30
joined as always by my co-host,
4:32
Chris Benson, who is a principal
4:34
AI research engineer at Lockheed Martin.
4:37
How are you doing, Chris? Doing
4:40
very well today, Daniel. How's it going? It
4:42
is going great. I'm super excited about
4:44
this one because it's a very, we
4:47
schedule a lot of shows and they're all
4:50
interesting, of course. But
4:52
occasionally, there's a show on a topic
4:54
that intersects with something that I'm working
4:56
on at the moment or something that
4:58
I found that is really exciting and
5:01
found to be really useful and so selfishly.
5:04
I'm really extra excited about
5:07
this episode this week, which
5:09
is with Till and Aditya from Mother Duck.
5:12
How are you doing? Doing good.
5:14
Excited to be here. Yes.
5:16
And note, Duck is in the
5:18
bird. So editors, you don't have to
5:20
bleep us out. Sure,
5:23
that's something that is an old joke
5:25
for you all. I can pinpoint
5:27
very easily how I ran across
5:30
DuckDB and Mother Duck is
5:32
there was a blog post. The
5:34
title is very simple. It said, Big Data is
5:36
Dead. And immediately when I
5:39
saw the title, I was like, thank
5:41
goodness, finally. But
5:43
I'm wondering if you can maybe just
5:46
step back. It doesn't
5:48
necessarily have to be the points in
5:50
that blog post. But how you see
5:52
the kind of data
5:55
analytics, big data, AI intersections
5:58
as of now. And
6:01
what are the sort of concerns and
6:03
issues that people are thinking about that
6:05
is driving them to DuckDB?
6:07
And then of course, we'll obviously get into
6:09
DuckDB and Mother Duck and all that you're
6:12
doing, but setting that stage of, you know,
6:14
what are people struggling with? What have they
6:16
realized in the past about this sort of
6:19
big data hype in one way or the
6:21
other, positive or negative? And how has that
6:24
kind of changed the way that
6:26
people are thinking about analytics and
6:28
databases? I can tell a story
6:30
about how I got in
6:32
touch with DuckDB. It started
6:34
at the very beginning of the
6:37
DuckDB project. I was actually doing
6:39
my master's thesis back then that
6:41
the CWI where DuckDB
6:44
originated from. And after
6:47
I graduated, Hannes, who is the
6:50
developer or the founder of DuckDB
6:52
Labs, reached out and
6:54
we were talking and they were saying, hey,
6:56
we're working on this new project. We're working
6:58
on this database system.
7:01
Are you interested in
7:03
maybe joining, maybe working on it? But
7:05
I was very focused on machine learning
7:07
and stuff like this. So I wanted
7:09
to go into data analytics,
7:12
data science, these kinds of things.
7:15
So a year later or so,
7:17
I was working at a telco company
7:19
and we were analyzing, you
7:21
know, customer data with Spark and so on.
7:24
And one day there
7:26
was like one of the first versions of
7:28
DuckDB was released. So I pip install it
7:30
and run the first like
7:32
simple aggregation query on a maybe a hundred
7:35
megabyte dataset or something like this. And I
7:38
was surprised because I thought something was
7:40
going wrong. I thought it's impossible that
7:42
it just did the aggregation, right? Because
7:44
from working with Spark, I was so
7:46
used to, okay, now, spinners
7:49
starting for 10 seconds at least,
7:52
right? And then that was really
7:54
eye opening. And I've heard similar
7:56
experience from a lot of people
7:58
even until today. they
8:00
hear very similar stories and
8:02
experiences. Yeah,
8:05
for me, it started in a different way.
8:08
I first figured out DuckDB Wasum existed
8:10
that you could run an analytical
8:13
engine in the browser. And
8:15
to think about something like
8:17
that was super crazy. And the
8:19
kind of stuff that you could do on top of
8:22
it started to look super crazy.
8:24
And one of the things that I was super
8:26
excited about when DuckDB Wasum released was the
8:28
possibility to do geospatial analytics. So back
8:31
then, when I started my
8:33
first encounter with DuckDB was doing
8:36
geospatial analytics. And
8:39
then to think about that could be actually be done
8:41
in the browser was
8:43
like mind blowing. And that's when my
8:45
journey into DuckDB started. So
8:47
let me ask y'all a follow-up question
8:49
as you're diving into your passion. For
8:51
those out there who may be listening
8:53
who are not already familiar with it
8:55
and they're hearing database, they're hearing big
8:57
data is dead, they're hearing
9:00
doing this in the browser. Give me
9:02
a little bit of background on kind of
9:04
the ecosystem that
9:06
you were coming from a bit and also
9:09
what this idea was so
9:12
that people can kind of follow you into
9:14
that. What is it that caught your passion
9:16
and attention and made you say,
9:18
ah, this is the way and
9:20
assume somebody doesn't already have a familiarity with
9:23
it? So I guess
9:26
I was going into this coming from
9:29
the machine learning side of things. So
9:31
I was used to working with scikit-learn,
9:34
pandas or the
9:36
Spark equivalents to that like Spark
9:39
ML, building data prep
9:41
pipelines and so on and
9:43
so forth. So, and then like
9:45
encountering this DuckDB thing
9:48
suddenly that apparently is doing
9:50
aggregations of, the
9:52
sizes of data I was working with much,
9:54
much, much faster. Yeah, Spark
9:56
some fantasies around, hey, how
9:59
much of the data data preparation
10:01
pipeline can we push into DuckDB
10:03
actually? And this
10:05
idea or this fantasy has been following me,
10:08
you know, for the past years. And I
10:10
think it's still an exciting topic. To
10:12
follow up a little bit on that, the way that
10:15
large data or big data has been analyzed
10:18
in the last years, I mean, predominantly
10:20
that you, you required some server in
10:22
the cloud, you, you required resources that
10:24
were not local to be able to
10:26
perform like large analysis, but
10:29
something that DuckDB opened up that
10:31
made possible was to lose, use
10:33
local compute in your local Mac
10:35
book, for example, was to
10:37
utilize that compute at the
10:39
most to like, perform
10:41
these kinds of huge analysis.
10:44
And that, I guess, sets
10:47
spark to a
10:49
change in the ecosystem, I would say. And
10:51
I guess that's where we're at. I
10:53
resonate so much with this. So like
10:56
coming from a background also as a
10:58
data scientist, living through the
11:00
years of like being told, Hey,
11:03
you know, use spark for this,
11:05
like basically my experience in
11:08
this sort of ecosystem was like, I would
11:10
try to write a query and it would
11:12
get the right result. But to your point,
11:14
till it like, I would just
11:16
be waiting forever to get a result. And
11:18
so I'd have to send it to some
11:20
like other guy whose name was Eugene. Eugene
11:23
was really smart and he could figure out
11:25
a way to like make it go fast.
11:27
And I never became Eugene. So like I
11:29
resonated with this very much. And
11:31
the fact that this concept of,
11:33
Hey, there's these seemingly big data
11:36
sets out there. And
11:39
I want to do maybe
11:41
even complicated analytics types of
11:44
queries over these, or even,
11:46
you know, execute workflows of,
11:48
as you mentioned, till
11:50
aggregation or other processes at
11:53
query time, I could do that with a
11:55
system that I could just run on my
11:58
laptop or I could run in
12:01
process is really intriguing. So maybe
12:03
now is a good time then
12:05
to like introduce DuckDB formally. So
12:07
like I'm on the DuckDB side,
12:09
it says, DuckDB is a fast
12:11
in process analytical database. So maybe
12:13
one of you could like take
12:15
a stab at, you
12:17
know, thinking about those data scientists out
12:19
there who are maybe at the point
12:21
of not, also not believing that what
12:24
we just described is maybe
12:26
possible or they're living in a world where
12:28
that's not possible. Describe what
12:30
DuckDB is and maybe why
12:32
that becomes possible as a
12:35
function of what it is. I
12:37
think I can talk a little
12:39
bit about motivation behind DuckDB, at
12:42
least the way I perceived it at the time.
12:44
And that was actually
12:46
originated from the R ecosystem.
12:50
Yeah, so Hannes was very
12:52
involved in that ecosystem and
12:55
people were using R to
12:58
essentially crunch relatively large
13:01
data with relatively primitive
13:04
methods. And
13:07
so at the time, CWA had
13:09
a database system and
13:12
an analytical database system called MonetDB that
13:16
has incorporated the
13:18
idea of vectorized columnar
13:22
query execution. And
13:24
it was a large system that
13:26
was not really easy for
13:29
the typical R users to adopt.
13:32
So the first idea
13:34
was to say, hey, let's maybe
13:37
build a light version of MonetDB and
13:40
integrate it with, I think it was
13:42
dplyr or something like this. And
13:45
we just let it run on the client. But
13:48
eventually it turned out to
13:50
be easier maybe to just
13:52
rebuild the database system from
13:54
scratch that was actually designed
13:56
to run in process
13:58
to be super light. lightweight, that's
14:01
super easy to install and everything essentially
14:03
to give the power of
14:07
this vectorized query execution into
14:09
the hands of data analysts.
14:11
I'm wondering if you could, when
14:13
you talk about that being in
14:15
process and lightweight, could
14:17
you describe what that means for someone that may
14:20
not be familiar with the term in process?
14:22
And how is that different from
14:25
other databases that are not in
14:27
process, that have their own processes?
14:29
Can you describe a little bit of what that
14:31
means? So classical
14:34
database systems operate in the
14:36
client server architecture. Usually you
14:38
have a database server running
14:40
somewhere and you have a
14:42
client that sends SQL
14:45
queries essentially to the database
14:47
server and then the result is transferred
14:49
back to the client through some
14:52
kind of transfer protocol. One
14:56
paper that turned out to be
14:58
Mark, Mark Bresford, who is also
15:00
a co-founder of Data Build Abscess,
15:03
they were working on a paper that basically
15:06
benchmarked these client protocols and it turned out
15:08
that that was actually a huge bottleneck. So
15:11
even when you're running Postgres on your
15:13
local machine, you still
15:15
have this client server protocol bottleneck.
15:19
And the way to get around this
15:21
is to have the database actually running
15:24
within your process that
15:26
is, in that case, maybe R
15:28
or Python and
15:30
has access to
15:32
the result set just
15:35
in memory and
15:38
no transfers happen. And
15:41
maybe I'd like to just add
15:43
in that for those who maybe
15:45
haven't done programming and stuff in
15:47
our audience that when it's expensive
15:49
to go between processes and
15:52
so that database server in a different process,
15:55
it takes a lot of resource to go from the process you're
15:57
in off to that and back. And so this puts a
15:59
lot of resources in there. puts it all into one, you might
16:01
say one little sandbox where
16:03
you're able to maximize that. Would that be
16:06
a fair assessment? Yeah. Yeah, so
16:08
I think one of the other advantages of
16:10
having this type of a model is that
16:12
you can share memory between the processes. So
16:14
just to go a little bit inside the
16:16
technical aspects of this, is that
16:18
the bottleneck that Till was explaining was more like
16:21
the data transfer bottleneck. But in
16:23
this case, when it's running within the process, you
16:25
can share the same memory, you can share the
16:27
variables that are, crunching
16:30
inside, let's say a Python script, that you're
16:32
crunching a variable, and then you have access
16:34
to the variable inside your database as well,
16:36
for an example. And this makes
16:38
it super powerful for the developer,
16:40
for the developer experience as well. And I
16:43
guess one of the things that apart from
16:45
the database itself being super fast,
16:48
the developer experience of using that
16:50
DB is so awesome in
16:52
that sense that I guess that has also led
16:55
to the success of it. You
16:57
know, that's how you can do that with Andra. Okay,
17:01
friends, I'm here with a new friend
17:03
of ours over at TimeGary affirm. So
17:07
within... So after helping you understand what exactly
17:09
is and till. So
17:12
TimeGary is a Postgres company. We build
17:14
tools in the cloud and in the
17:16
open-source ecosystem that allow developers to do
17:18
more with Postgres. So
17:20
using it for things like TimeSeries Analytics, and
17:22
more recently, AI applications likeies like
17:26
RAG and Search and Agents. Okay, if our listeners
17:28
were trying to get started with Postgres,
17:30
Timescale, AI application development, what would you
17:33
tell them? What's a good roadmap? If
17:35
you're a developer out there, you're either
17:37
getting tasked with building an AI application,
17:39
or you're interested in just seeing all
17:41
the innovation going on in the space
17:44
and want to get involved yourself. And
17:46
the good news is that any developer
17:48
today can become an AI engineer using
17:50
tools that they already know and love.
17:53
And so the work that we've been
17:55
doing at Timescale with the PGAI project
17:57
is allowing developers to build AI applications.
18:00
with the tools and with the database
18:02
that they already know, and that being
18:04
Postgres. What this means is that you
18:06
can actually level up your career, you
18:08
can build new interesting projects, you can
18:10
add more skills without learning a whole
18:12
new set of technologies. And the best
18:14
part is, it's all open source, both
18:16
PGAI and PG Vector Scale. Our open
18:18
source, you can go and spin it
18:20
up on your local machine via Docker,
18:22
follow one of the tutorials on the
18:24
Timescale blog, build these cutting edge applications
18:26
like RAG and such without having to
18:28
learn 10 different new technologies
18:30
and just using Postgres and the SQL
18:33
query language that you probably already know
18:35
and are familiar with. So yeah, that's
18:37
it, get started today. It's a PGAI
18:39
project and just go to any of
18:41
the Timescale, GitHub repos, either the PGAI
18:43
one or the PG Vector Scale one
18:45
and follow one of the tutorials to
18:48
get started with becoming an AI engineer,
18:50
just using Postgres. Okay,
18:52
just use Postgres and just
18:54
use Postgres to get started
18:57
with AI development, build RAG,
18:59
search, AI agents and it's
19:02
all open source. Go to
19:04
timescale.com/AI, play with PGAI, play
19:07
with PG Vector Scale all locally
19:09
on your desktop. It's open source.
19:12
Once again, timescale.com/AI.
19:16
So Aditya, you
19:18
were just describing the
19:20
developer experience, which
19:41
I would definitely say is kind of fitting
19:43
that magical experience that you alluded
19:46
to with DuckDB and maybe
19:48
just to give a sense of people, like
19:51
when I was initially exploring this, similar to some
19:53
of the experiences that you all talked about, I
19:55
would encourage our listeners to go out and install
19:58
DuckDB locally try something because it
20:00
is a really interesting experience, especially
20:03
for those that have worked with
20:05
traditional database systems in the past
20:07
and all of a sudden, so
20:10
you kind of install ducty be
20:12
locally, import it as a library, then you
20:15
can query, you know, point to CSV files
20:18
or JSON files or parquet files,
20:21
or even a database like a
20:23
Postgres database or data stored in
20:25
an S3 bucket. And you have
20:27
this consistent then SQL interface that's
20:30
familiar that you can do queries over
20:32
that data. So
20:35
I don't know, maybe one
20:38
of you could describe some of
20:40
the, you know, just
20:42
to give people a sense of
20:44
the use cases for ducty be
20:46
maybe on one side where it's
20:49
like the primary or the key
20:51
or the most often occurring
20:54
use cases that you see people grabbing
20:56
ducty be and using it for. And
20:58
then maybe on the other side, just
21:00
to kind of help
21:03
people understand where it fits, maybe
21:05
where it wouldn't be as relevant
21:09
if you have any of those thoughts. I
21:11
can give like a brief overview of this. Some
21:14
of the biggest users of ducty be
21:16
come from the Python ecosystem, and
21:18
which means that it's
21:21
being a stand-in for a
21:23
data frame, for example. And
21:26
one of the advantages of using ducty be
21:28
is that it's really fast on aggregate. And
21:32
for the Python ecosystem, it helps with
21:34
standing in for a data frame to
21:36
be used with other ML
21:39
libraries, for example. So
21:41
that's like one part of the ecosystem. And the
21:43
other part of the ecosystem is for a data
21:46
engineer to be able to pull in data
21:48
from different sources, like you said, you
21:50
know, Postgres from CSV, and to
21:52
be able to join those different data
21:54
sets. Joins are
21:56
really good with ducty be as well, and
21:58
to create transfer. data sets
22:00
is also pretty useful.
22:03
And on the third ecosystem
22:06
for a data analyst who is writing SQL,
22:08
and one of the really nice aspects
22:11
of duct TV is the SQL dialect
22:13
itself. It's pretty flavored
22:15
that you have a lot of
22:17
duct TV functions that makes data
22:19
cleaning easy, data transformation easy. For
22:22
example, we also have a dialect that says
22:25
from table, and that's just gonna show
22:27
you the table. Instead of
22:29
going select star from table, you can go
22:31
from table and that will, just
22:34
fetch data from that table. So there
22:36
are these flavors of duct dialect for
22:38
duct TV that makes it nice. I
22:41
was also looking through the duct TV website and
22:43
stuff, and I know it runs
22:45
on kind of all the major
22:47
platforms and architectures and you support
22:50
a variety of languages on it.
22:52
I'm curious would, cause I'm
22:54
asking a question to my own
22:56
interest selfishly, as Dan would say,
22:58
do you support kind
23:01
of embedded environments and kind of on
23:03
the edge, that kind of stuff where
23:05
you find it embedded and operating
23:08
where it's not necessarily on
23:10
a cloud server on one of the major platforms. Is
23:12
that a typical use case? That is one of a
23:16
good use cases for duct TV. Since
23:18
it's the in process protocol that
23:20
it has for running that duct TV, it
23:23
can run wherever you run
23:25
Python or R or anywhere. And
23:28
they've also optimized it to
23:31
run in different architectures as well. So
23:33
this makes it possible. And to kind
23:36
of go beyond that, you can also run
23:38
it in the browser. So any edge environment,
23:40
you can run it. Of course, there's a
23:43
lot of optimization for, there are like a
23:45
lot of edge environments at the moment, not
23:47
everything is optimized to
23:49
run duct TV, but I guess it's also
23:51
moving towards being run in every edge environment
23:54
as well. Some of our
23:56
listeners might be curious why, you
23:59
know, a person like. me as sort
24:01
of living day to day in the
24:03
AI world is thinking, is
24:05
super excited to talk about DuckDB. I
24:07
mean, certainly I have a past in
24:10
more broadly data science and this is
24:12
pain I felt over time, but also
24:14
there's a very relevant piece of this
24:16
that intersects with the needs of the
24:20
AI community more broadly and the
24:22
workflows that they're executing.
24:26
One of those is where I started getting into
24:29
this is in these dashboard killing AI apps that
24:35
people are trying to build in the sense
24:37
that like, hey, another
24:39
pain of mine as a data scientist in
24:41
my life is building dashboards because you always
24:43
build them and they never answer
24:45
the questions that people actually have. And
24:48
so there's this real desire to
24:51
have a natural language question input
24:53
and then you can then compute
24:55
very quickly the answer
24:57
to that natural language question by using
24:59
the LM to generate a SQL query
25:02
to a number of data sources. But
25:05
then when you start thinking about, oh, well,
25:07
now I have these CSV files that people
25:10
have uploaded into a chat interface, or I
25:12
have these types of databases that I need
25:15
to connect to or have this data in
25:17
S3 buckets and my answer could come from
25:19
these different places, all of a
25:21
sudden this kind of rich SQL
25:23
dialect that you talked about that's very
25:26
quick and can run with
25:29
a standardized API across
25:31
those sources becomes incredibly
25:34
intriguing for me. Transparently,
25:36
that's how I sort of like got into
25:38
this is I'm like thinking
25:41
of all of these sources of
25:43
data that I could answer questions out of
25:45
using an LLM, but how do
25:47
I standardize a fast
25:49
interface to all of these diverse sets
25:51
of data and also do it in
25:53
a way that doesn't, you know, is
25:56
easy to use from a developer's perspective.
25:59
But I also know that you all
26:01
see much more than I do and
26:04
maybe that is an entry point that
26:06
you're seeing. I'm wondering if one
26:08
of you could talk a little bit more
26:10
broadly of how the problems that DuckDB is
26:12
solving and the problems that your customers are
26:15
looking at are intersecting
26:17
with this rapidly developing world
26:19
of AI workflows. I
26:21
mean, one way to describe DuckDB
26:24
is it's the SQLite for analytics.
26:28
So it is basically
26:30
a very easy
26:32
way, a very developer friendly way to
26:34
achieve what you just described. If I
26:37
want to create a demo for
26:39
my new text to SQL model,
26:42
if I use DuckDB for it, I can
26:45
even make completely like
26:47
wasm based demo out of it,
26:49
for example. I don't have
26:52
any issues with CSV
26:54
upload. There might be
26:56
databases where I have to specify the
26:59
limiter of the file that the user uploads. So
27:01
I would have to show a dialogue to my
27:03
user where he says, oh, that's comma separated and
27:05
it has a header row and
27:08
so on. With DuckDB, it
27:10
just works. So it
27:12
takes away some of the edges
27:14
you might have with other databases.
27:17
And on top of that, as you said,
27:19
it integrates with different
27:21
storage backends like it can read
27:24
from S3, it can read from
27:26
HTTP. When
27:28
I see an interesting file on, let's
27:30
say, Hagging Face or GitHub, I
27:33
just run read CSV from
27:35
this URL and I have the data
27:37
set locally in my CLI or in
27:39
my Python. Furthermore, when I
27:41
have a say,
27:44
a Python environment, I start
27:46
a Colab notebook, right?
27:48
And I create some data frames. With
27:51
DuckDB, I can just read those
27:53
data frames. I've seen very cool demos
27:55
of people basically using text
27:58
to SQL for, yeah, for
28:00
analytics on Penis data frames.
28:02
And under the hood, it's just
28:05
duct to be sitting there and
28:07
basically reading straight from those
28:09
Penis data frames, which by the way, is
28:11
one of the other benefits of shared
28:14
memory of in process. It's
28:17
not only for fetching results,
28:19
it's also for reading data straight from
28:21
the process. So in that case from
28:23
Penis. That's very
28:26
exciting. I'm happy to talk more about
28:28
text to SQL. We have have
28:30
had a project about that at
28:32
mother duck. But yeah, yeah,
28:35
and maybe also, before
28:37
we get into maybe some of those
28:40
stories, I think that that's
28:42
one side of it is like the
28:44
integration of this analytics piece into AI
28:47
workflows. But then also, if I'm not
28:49
mistaken, there is sort of vector search
28:52
capabilities within DuckDB as well. I don't
28:54
know if one of you could speak
28:56
to that. Yeah, that's one of the
28:58
exciting aspects of DuckDB as well. So
29:01
if I could take
29:03
a step back and think about other
29:05
ecosystems where let's say Postgres has been
29:07
shining a lot Postgres has exploded into
29:09
the kind of possibilities that you can
29:11
do because it has kind of like
29:13
an amazing extension mechanism, where
29:15
you could add extensions and capabilities.
29:17
And in a similar
29:21
way, DuckDB has an extension mechanism
29:23
that you have access to the
29:25
internal workings of DuckDB. And you
29:27
could add more workflows
29:30
on top of what DuckDB can
29:32
do, right? DuckDB has these capabilities
29:34
of doing vector search, for example,
29:36
and it also has hybrid search,
29:39
where you you also have full
29:41
text search, and vector search
29:43
that you could put together to to create
29:45
hybrid search. One of the ways it
29:47
does is that it has a really nice data type.
29:50
I can go into the rabbit hole of the
29:52
inner workings of how they make this happen, which
29:54
is also pretty exciting. But one
29:57
of the things that they make this possible is
29:59
to provide like an array data type where you
30:01
can have an array of floating
30:04
points, and then you can store this
30:06
as a data type, and then that
30:09
eventually becomes an embedding vector that you
30:11
can do cosine similarity against. So that
30:14
is to do like an embedding-based search. Then
30:16
you can also have full-text search where
30:19
you can create an
30:22
inverted index of keywords to
30:24
your documents, and you can search across your
30:27
keywords to find your ideal
30:29
documents and rank them according to the score. And then
30:32
you could fuse both of these
30:34
scores from embedding search and from full-text
30:37
search to have like a hybrid search.
30:40
So yeah, so all of these are possible, and
30:42
they're very accessible. Well,
30:45
there's no shortage
30:47
of helpful AI tools
30:50
out there, but
31:00
using these AI tools means you got
31:02
to switch back and forth, back and
31:04
forth between yet one more tool. So
31:07
instead of simplifying your workflow, it just
31:09
gets more complicated, but that's not how
31:11
it works when you're using Notion. Notion
31:13
is the perfect place to organize
31:16
lots of stuff, tasks, tracking your
31:18
habits, writing beautiful docs, collaborating with
31:20
your team, knowledge bases, and the
31:23
more content you add to Notion,
31:25
the more this cool thing called
31:27
Notion AI can personalize all of
31:30
the responses for you. Unlike generic
31:32
chatbots, Notion AI already has the
31:34
context of your work. Plus, it
31:36
has multiple knowledge sources. It uses
31:39
AI knowledge from GPT-4 and Clod,
31:41
and that helps you chat about
31:43
any topic. And here's the kicker.
31:46
Now in beta, Notion AI can
31:48
search across Slack discussions, Google Docs,
31:50
Sheets, Slides, and even more tools
31:53
like GitHub and Jira. Those are
31:55
coming soon. And unlike
31:57
specialized tools or legacy suites,
31:59
that have you bouncing between
32:01
different applications. Notion is seamlessly
32:03
integrated, infinitively flexible, and beautifully
32:05
easy to use. So you
32:07
are empowered to do your
32:09
most meaningful work inside Notion.
32:12
From small teams to massive Fortune
32:14
500 companies, these
32:16
teams, both small and large, use
32:19
Notion to send less email, cancel
32:22
more meetings, save time searching
32:24
for their work, and they
32:26
reduce spending on tools, which
32:28
helps everyone stay on the
32:30
same page. You can try
32:32
Notion for free today by
32:35
going to notion.com/practicalai. That's
32:37
all over case notion.com/practicalai
32:39
to try the powerful,
32:42
easy to use Notion
32:44
AI today. And
32:46
of course, when you use our
32:48
link, you're supporting our show, and
32:51
I know you love that. Again
32:53
notion.com/practicalai. So,
33:05
Till, you were starting to get
33:07
into even some of the things
33:10
now that you're doing at Mother Duck on
33:12
top of DuckDB. I'm wondering,
33:14
hopefully we can get to some of
33:16
those use cases or the
33:18
things that you've been doing with
33:20
customers or internally. But I'm wondering before
33:23
we do that, I
33:25
see also this
33:27
story about DuckDB's
33:30
efficiency, but with this
33:32
multiplayer aspect as
33:35
part of what you're doing at DuckDB. So,
33:37
maybe one of you could describe, now
33:40
I think we have a sense of what DuckDB is, and
33:43
it's this free thing that is open and
33:45
I can pull down, I can install, I
33:47
can run it very quickly, run it on
33:50
my laptop, run it in my browser, do
33:52
these analytics queries. So, now kind of
33:55
describe maybe a little bit of how
33:57
you're taking that further with Mother Duck.
33:59
and how you're thinking about some of
34:01
the enterprise use cases. I
34:04
like to describe Motherduck
34:06
as giving your
34:09
duct to be a cloud companion.
34:11
So it's easy
34:14
to think or
34:16
to associate, okay, we
34:19
bring Motherduck to the cloud, which
34:21
is one way how we describe ourselves as
34:24
well, to associate that with
34:26
we provide infinite scale up in the
34:28
cloud. You give us a workload and
34:30
we start how
34:32
many hundred ducts in
34:35
the background that in a
34:37
task like fashion, let's say,
34:39
process your data concurrently. But
34:42
actually, one
34:46
of the hypotheses that Motherduck
34:49
is based on or that the company
34:51
was founded on is that actually single
34:54
node compute, which means one
34:56
duct to be database with
34:59
nowadays hardware, cloud hardware
35:01
is actually actually gets you very, very,
35:03
very far. So
35:06
when your local compute resources
35:10
reach reach limit, you
35:12
have cloud cloud
35:14
single cloud instances with up
35:17
to how much is it 24 terabyte
35:19
of memory, that's
35:21
relatively big data. So
35:24
that's one aspect, right? So scaling up
35:27
with one cloud companion duct to
35:29
be another aspect is
35:32
that collaboration. So once you
35:34
you're connected to a cloud
35:36
instance, you can have
35:38
shared context with other users in
35:41
your organization, you can create
35:43
shared data sets, you
35:45
can have shared notebooks, and
35:48
so on and so forth. And with that,
35:51
of course, comes all the enterprise SOC
35:53
2 kind of things that that
35:55
some of the enterprise customers require
35:58
to adopt towards like. Thank
36:00
you. I'm curious if you
36:02
could, uh, that you really
36:05
captured my imagination with that, uh,
36:07
that description. And so like, because, you
36:09
know, by drawing, for instance, with kind
36:11
of, you know, the old school postgres
36:13
things that people would do with that.
36:15
And you just talked about having many
36:18
duck DB instances operating
36:20
concurrently, you know, what
36:22
kinds of problems and kind of, you
36:24
know, grounding it in a, in a practical
36:26
way for, from a user's perspective, what kind
36:29
of problems, uh, do you see
36:31
people solving with that kind of architecture, um, and
36:34
that new capability that they may not have
36:36
historically had over the years with previous database
36:39
capabilities on other platforms. What new
36:41
sets of concerns can they address
36:44
now with those? I would
36:46
come from the perspective on this, that,
36:48
um, there are a lot
36:50
of companies out there that when they
36:53
want to go to the cloud
36:55
with analytics workload, they have relatively
36:57
limited choices. One of
36:59
those choices is, uh, like snowflake or
37:02
data bricks. And they
37:04
of course are optin those systems are
37:06
optimized for big data scale. So,
37:09
but then one
37:11
of our observations is that a lot
37:13
of companies actually don't have
37:15
that amount of data when
37:18
they run queries or they might have big
37:20
data, but the queries they are running,
37:23
uh, only access a very small subset
37:25
of the data. For example, you
37:27
know, you run, um, monthly reports,
37:30
they don't touch your, your entire
37:32
historic dataset. So those
37:35
companies want to, um, might
37:37
want to have something that is easier first, easier
37:40
to use, easier to
37:42
set up. And that's also more cost
37:44
efficient and other existing solutions. One
37:47
of the things that we haven't touched
37:49
upon in this yet is kind
37:52
of how mother duck and duck TV
37:54
go hand in hand with like the
37:56
remote and the local aspect where you
37:59
have. on your local and
38:01
your remote the same client, so
38:03
that you're running the same thing.
38:05
It's easy to go from one place to the other doing
38:08
the same thing. What Motherduck
38:10
also provides is a dual execution
38:12
where your local DuckDB,
38:14
if you're running it locally, can
38:17
communicate with your remote Motherduck
38:20
and execute seamlessly between
38:22
both. For example, a
38:24
query where you have a table in
38:27
your local DuckDB and you want to
38:29
join it with a remote DuckDB, you
38:31
can join both of these tables
38:34
together to run an aggregate. Then
38:36
there's a query optimization that we
38:38
run where we transfer
38:40
the data which was required from the remote
38:43
to your local or from your local to
38:45
remote and execute it intelligently in
38:47
a way, if I could say that. This
38:50
opens up new opportunities in the
38:53
dual execution aspect of running your
38:55
local and the remote with the
38:58
same client. I'm curious against
39:00
selfish question, is you're doing that and you
39:02
have the local version and the remote version,
39:05
the connection between the two there. What
39:08
does that look like? Is it something
39:10
that if they're widely separated, if Motherduck's
39:12
in the Cloud and I'm out on
39:15
a device that's not Cloud-based, is
39:17
that efficient communication? How do you all handle
39:19
those different types of use cases? One
39:23
of the principles of this dual execution
39:25
is to reduce
39:27
the amount of data that
39:29
has to be transferred as much as possible.
39:31
One of the use cases, for example, is
39:33
I have a really large dataset
39:36
on S3 and I want
39:38
to join it with a small table that
39:40
I have on my notebook. In
39:44
that case, an optimizer, query
39:47
optimizer will make the
39:49
decision to, instead of downloading
39:51
the one terabyte dataset to
39:53
your local device and doing the join there,
39:56
to instead upload your small local file
39:58
to the. at
44:00
it. It's pretty cool
44:02
what you've talked about today in terms
44:04
of what is possible for us. How
44:06
are you thinking about the future? What
44:09
are the new cool things that you
44:11
have in mind? I often say when
44:13
you're not necessarily working hard on a
44:15
problem, but you're chilling out at the
44:17
end of the day and your
44:19
mind is just wondering in free form and
44:21
you're thinking, boy, what if we could do
44:23
this? I could imagine that and I can
44:25
see a path forward to get there. How
44:28
are each of you thinking about
44:30
Mother Duck and Duck DB in
44:33
terms of what the future might offer
44:35
if you want to
44:37
get out there and wax poetic a little
44:39
bit and it doesn't have to
44:41
be grounded in current work, but more in imagination
44:43
and aspiration? One of the things that
44:45
I really like about the current state of AI is
44:49
how good the local models are, the small models
44:51
that you can run locally. There's
44:53
a great ecosystem out there building on top
44:55
of that. One of
44:57
the things that I see with the
44:59
local models, of course, the hallucinate, but
45:02
to prevent hallucination, you can use a
45:04
really nice rag mechanism to put context
45:06
into those local models. These
45:09
local models could be on the edge as well. It could be
45:11
on your local laptop. It could be on
45:14
the edge. Knowledge
45:16
bases are essentially created
45:18
to prevent these hallucinations.
45:21
One wasteful aspect of creating
45:24
knowledge bases is that everybody's
45:26
creating very similar knowledge bases.
45:29
What if there could be a mechanism where
45:31
we could share these knowledge bases?
45:34
A user could create a knowledge base and
45:36
they could share a knowledge base. One of
45:38
the imaginative worlds that I've
45:40
driven is how Mother Duck could
45:42
be there to do these kind
45:44
of shadowable knowledge bases where you
45:46
essentially have a world of
45:49
remote knowledge bases out there in your
45:51
remote tables. Then you have
45:53
a local DuckDB client that helps you
45:55
pull a knowledge base that you want,
45:58
use the local knowledge base. argument
46:00
your local model with the
46:03
relevant context for your current
46:05
question. And then when you don't want the
46:07
knowledge base, you could also drop the knowledge
46:09
base. And that's like having a remote
46:11
knowledge base, a repository, and pull
46:14
whatever you want. This is like
46:16
one of the dreams that I
46:19
think about how Motherduck and DuckDB
46:21
could be useful for this. And
46:24
another aspect of talking
46:27
about knowledge bases and RAG applications
46:29
is that not all
46:32
applications and workflows require
46:34
a real-time database to
46:36
build agents on top of them. And
46:38
some of these agents could be running
46:40
as background agents that do some workflow
46:43
once every day. And instead
46:45
of having a real-time database for that, what
46:47
if you could provide a very lightweight analytical
46:50
engine that's quite cheap to run locally as well?
46:52
And that could also, you could offload some
46:55
work to the remote cloud. So
46:57
this is another thing that keeps me excited
47:00
at night to think about what could
47:02
be these kind of use cases. But
47:04
yeah, these are the two use cases that I
47:06
am quite excited about. Yeah, I
47:08
mean, maybe I can add
47:11
two things. One
47:13
thing that actually connects to that is
47:18
bringing AI and machine
47:21
learning capabilities more into
47:23
the database. So one of the
47:25
things we've seen in the
47:27
past is that the inference costs
47:29
of language models have
47:31
dropped quite significantly compared
47:33
to two years ago. It's
47:36
now, I think, only it's
47:38
2% of the price
47:40
for inference with GPT for
47:43
Mini compared to GPT-3. And
47:47
that actually makes it possible to
47:49
run language model
47:51
inference on your tables and
47:55
also to do things like embedding
47:58
compute on your tables SQL
48:00
is just a really, really convenient user interface
48:02
for that. So we added this embedding function
48:05
some time ago that works really well together
48:07
with a vector search. So you can basically
48:10
do embedding based search only
48:12
in SQL. Now we're adding
48:14
the prompting capabilities so you can do
48:17
language model based data wrangling in your
48:19
database and that together with
48:22
local models and this
48:24
hybrid execution model. We say, okay, we
48:26
do part of the work locally. Maybe
48:28
if you have a GPU, do part
48:30
of the embedding inference locally. If
48:32
you want to do it faster, do it in the cloud
48:35
with a fewer A100. And
48:38
again, everything is in SQL.
48:42
That's awesome. Yeah, well, thank you both
48:44
for taking time out of your analytics,
48:47
AI, database work to come
48:49
talk to us. This has
48:51
been super amazing. And
48:53
I would definitely encourage people out there,
48:55
please, please, please go try out some
48:57
things. Try out some examples
48:59
with DuckDB. Check out the Mother Duck
49:01
website and some of the great blog
49:04
post content that they have there, examples
49:06
or things that they're doing. Check it
49:08
out because it's definitely a really
49:11
wonderful thing that you can add into your
49:14
AI stack and think about and experiment with.
49:16
So thank you so much, Till and Aditya,
49:18
for joining. It's been a pleasure. Thank you
49:20
guys for having us. Thank you, guys. It
49:22
was pretty awesome to be here. Thanks
49:55
again to our partners at fly.io
49:57
to our beat freaking residents. Break
49:59
master.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More