Big data is dead, analytics is alive by Practical AI | Podchaser

Episode from the podcastPractical AI

Big data is dead, analytics is alive

Released Thursday, 24th October 2024

Good episode? Give it some love!

Big data is dead, analytics is alive

Big data is dead, analytics is alive

Thursday, 24th October 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:05

Welcome to Practical AI.

0:08

If you work in artificial intelligence,

0:10

aspire to, or are curious how

0:12

AI related tech is changing the

0:14

world, this is the show for

0:17

you. Thank you to our

0:19

partners at fly.io. Fly

0:21

transforms containers into microvms that run

0:23

on their hardware in 30 plus

0:26

regions on six continents, so you

0:28

can launch your app near your

0:30

users. Learn more at fly.io.

0:35

Hey friends, you know we're big

0:37

fans of fly.io and I'm here

0:39

with Kurt Mackey, co-founder and CEO

0:42

of Fly. Kurt, we've had some

0:44

conversations and I've heard you say

0:46

that public clouds suck. What

0:48

is your personal lens into public clouds sucking

0:50

and how does fly not suck? Alright, so

0:52

public clouds suck. I actually think most ways

0:55

of hosting stuff on the internet sucks and

0:57

I have a lot of theories about why

0:59

this is but it almost doesn't matter. The

1:01

reality is like I've built a new app

1:03

for like generating sandwich recipes because my family's

1:05

just into specific types of sandwiches that use

1:08

brown sugar as a component for example and

1:10

then I want to like put that somewhere.

1:12

You go to AWS and it's harder than

1:14

just going and getting like a dedicated server

1:16

from Headstar. It's like it's actually like more

1:18

complicated to figure out how to deploy my

1:20

dumb sandwich app on top of AWS because

1:23

it's not built for me as a developer

1:25

to be productive with. It's built for other

1:27

people. It's built for platform teams to kind

1:29

of build the infrastructure of their dreams and

1:31

hopefully create a new UX that's useful for

1:33

the developers that they work with. And again,

1:36

I feel like every time I talk about

1:38

this, it's like I'm just too impatient. I

1:40

don't particularly want to go figure so many

1:42

things out purely to put my sandwich app

1:44

in front of people and I don't particularly

1:46

want to have to go talk to a

1:49

platform team once my sandwich app becomes a

1:51

huge startup and IPOs and I have to

1:53

like do a deploy. I

1:55

kind of feel like all that stuff should just

1:57

work for me without me having to go ask

1:59

permission. or talk to anyone else. And so this

2:01

is a lot of, it's informed a lot of

2:04

how we built Fly. Like we're still a public

2:06

cloud. We still have a lot of very similar

2:08

low level primitives as the bigger guys. But

2:11

in general, they're designed to be used

2:13

directly by developers. They're not built for

2:15

a platform team to kind of cobble

2:17

together. They're designed to be useful

2:19

quickly for developers. One of the ways we've

2:21

thought about this is, is if you can

2:24

turn a very difficult problem into a two

2:26

hour problem, people will build much more interesting

2:28

types of apps. And so this is why

2:30

we've done things like made it easy to

2:33

run an app multi-region. Most companies don't run

2:35

multi-region apps on public clouds because it's functionally

2:37

impossible to do without a huge amount of

2:39

upfront effort. It's why we've made things like

2:42

the virtual machine primitives behind just

2:44

a simple API. Most people don't do like

2:46

code sandboxing or their own virtualization because it's

2:48

just not really easy. It's not, there's no

2:50

path to that on top of the clouds.

2:52

So in general, like I feel like, and

2:54

it's not really fair of me to say

2:56

public clouds suck because they were built for

2:59

a different time. If you build one of

3:01

these things starting in 2007, the

3:04

world's very different than it is right now. And so a

3:06

lot of what I'm saying, I think, is that public clouds

3:08

are kind of old and there's a

3:10

new version of public clouds that we should

3:12

all be building on top of that are

3:14

definitely gonna make me as a developer much

3:16

happier than I was like five or six

3:18

years ago when I was kind of stuck

3:20

in this quagmire. So AWS was built for

3:22

a different era, a different cloud era, and

3:24

Fly, a public cloud, yes, but

3:27

a public cloud built for developers who

3:29

ship. That's the difference. And we here

3:31

at change.org are developers who ship. So

3:34

you should trust us. Try out Fly,

3:36

fly.io. Over three

3:38

million apps, that includes us,

3:40

have launched on Fly. They

3:42

leverage the global anti-cast load

3:44

balancing, the zero config private

3:47

networking, hardware isolation, instant wire

3:49

guard VPN connections with push

3:51

button deployments, scaling to thousands

3:53

of instances. This is the

3:55

cloud you want. Check it out,

3:57

fly.io again. fly.io.

4:18

Welcome to another episode of the

4:21

Practical AI podcast. This is Daniel

4:23

Weitnack. I am CEO at Prediction

4:25

Guard, where we're building a private

4:28

secure gen AI platform. And I'm

4:30

joined as always by my co-host,

4:32

Chris Benson, who is a principal

4:34

AI research engineer at Lockheed Martin.

4:37

How are you doing, Chris? Doing

4:40

very well today, Daniel. How's it going? It

4:42

is going great. I'm super excited about

4:44

this one because it's a very, we

4:47

schedule a lot of shows and they're all

4:50

interesting, of course. But

4:52

occasionally, there's a show on a topic

4:54

that intersects with something that I'm working

4:56

on at the moment or something that

4:58

I found that is really exciting and

5:01

found to be really useful and so selfishly.

5:04

I'm really extra excited about

5:07

this episode this week, which

5:09

is with Till and Aditya from Mother Duck.

5:12

How are you doing? Doing good.

5:14

Excited to be here. Yes.

5:16

And note, Duck is in the

5:18

bird. So editors, you don't have to

5:20

bleep us out. Sure,

5:23

that's something that is an old joke

5:25

for you all. I can pinpoint

5:27

very easily how I ran across

5:30

DuckDB and Mother Duck is

5:32

there was a blog post. The

5:34

title is very simple. It said, Big Data is

5:36

Dead. And immediately when I

5:39

saw the title, I was like, thank

5:41

goodness, finally. But

5:43

I'm wondering if you can maybe just

5:46

step back. It doesn't

5:48

necessarily have to be the points in

5:50

that blog post. But how you see

5:52

the kind of data

5:55

analytics, big data, AI intersections

5:58

as of now. And

6:01

what are the sort of concerns and

6:03

issues that people are thinking about that

6:05

is driving them to DuckDB?

6:07

And then of course, we'll obviously get into

6:09

DuckDB and Mother Duck and all that you're

6:12

doing, but setting that stage of, you know,

6:14

what are people struggling with? What have they

6:16

realized in the past about this sort of

6:19

big data hype in one way or the

6:21

other, positive or negative? And how has that

6:24

kind of changed the way that

6:26

people are thinking about analytics and

6:28

databases? I can tell a story

6:30

about how I got in

6:32

touch with DuckDB. It started

6:34

at the very beginning of the

6:37

DuckDB project. I was actually doing

6:39

my master's thesis back then that

6:41

the CWI where DuckDB

6:44

originated from. And after

6:47

I graduated, Hannes, who is the

6:50

developer or the founder of DuckDB

6:52

Labs, reached out and

6:54

we were talking and they were saying, hey,

6:56

we're working on this new project. We're working

6:58

on this database system.

7:01

Are you interested in

7:03

maybe joining, maybe working on it? But

7:05

I was very focused on machine learning

7:07

and stuff like this. So I wanted

7:09

to go into data analytics,

7:12

data science, these kinds of things.

7:15

So a year later or so,

7:17

I was working at a telco company

7:19

and we were analyzing, you

7:21

know, customer data with Spark and so on.

7:24

And one day there

7:26

was like one of the first versions of

7:28

DuckDB was released. So I pip install it

7:30

and run the first like

7:32

simple aggregation query on a maybe a hundred

7:35

megabyte dataset or something like this. And I

7:38

was surprised because I thought something was

7:40

going wrong. I thought it's impossible that

7:42

it just did the aggregation, right? Because

7:44

from working with Spark, I was so

7:46

used to, okay, now, spinners

7:49

starting for 10 seconds at least,

7:52

right? And then that was really

7:54

eye opening. And I've heard similar

7:56

experience from a lot of people

7:58

even until today. they

8:00

hear very similar stories and

8:02

experiences. Yeah,

8:05

for me, it started in a different way.

8:08

I first figured out DuckDB Wasum existed

8:10

that you could run an analytical

8:13

engine in the browser. And

8:15

to think about something like

8:17

that was super crazy. And the

8:19

kind of stuff that you could do on top of

8:22

it started to look super crazy.

8:24

And one of the things that I was super

8:26

excited about when DuckDB Wasum released was the

8:28

possibility to do geospatial analytics. So back

8:31

then, when I started my

8:33

first encounter with DuckDB was doing

8:36

geospatial analytics. And

8:39

then to think about that could be actually be done

8:41

in the browser was

8:43

like mind blowing. And that's when my

8:45

journey into DuckDB started. So

8:47

let me ask y'all a follow-up question

8:49

as you're diving into your passion. For

8:51

those out there who may be listening

8:53

who are not already familiar with it

8:55

and they're hearing database, they're hearing big

8:57

data is dead, they're hearing

9:00

doing this in the browser. Give me

9:02

a little bit of background on kind of

9:04

the ecosystem that

9:06

you were coming from a bit and also

9:09

what this idea was so

9:12

that people can kind of follow you into

9:14

that. What is it that caught your passion

9:16

and attention and made you say,

9:18

ah, this is the way and

9:20

assume somebody doesn't already have a familiarity with

9:23

it? So I guess

9:26

I was going into this coming from

9:29

the machine learning side of things. So

9:31

I was used to working with scikit-learn,

9:34

pandas or the

9:36

Spark equivalents to that like Spark

9:39

ML, building data prep

9:41

pipelines and so on and

9:43

so forth. So, and then like

9:45

encountering this DuckDB thing

9:48

suddenly that apparently is doing

9:50

aggregations of, the

9:52

sizes of data I was working with much,

9:54

much, much faster. Yeah, Spark

9:56

some fantasies around, hey, how

9:59

much of the data data preparation

10:01

pipeline can we push into DuckDB

10:03

actually? And this

10:05

idea or this fantasy has been following me,

10:08

you know, for the past years. And I

10:10

think it's still an exciting topic. To

10:12

follow up a little bit on that, the way that

10:15

large data or big data has been analyzed

10:18

in the last years, I mean, predominantly

10:20

that you, you required some server in

10:22

the cloud, you, you required resources that

10:24

were not local to be able to

10:26

perform like large analysis, but

10:29

something that DuckDB opened up that

10:31

made possible was to lose, use

10:33

local compute in your local Mac

10:35

book, for example, was to

10:37

utilize that compute at the

10:39

most to like, perform

10:41

these kinds of huge analysis.

10:44

And that, I guess, sets

10:47

spark to a

10:49

change in the ecosystem, I would say. And

10:51

I guess that's where we're at. I

10:53

resonate so much with this. So like

10:56

coming from a background also as a

10:58

data scientist, living through the

11:00

years of like being told, Hey,

11:03

you know, use spark for this,

11:05

like basically my experience in

11:08

this sort of ecosystem was like, I would

11:10

try to write a query and it would

11:12

get the right result. But to your point,

11:14

till it like, I would just

11:16

be waiting forever to get a result. And

11:18

so I'd have to send it to some

11:20

like other guy whose name was Eugene. Eugene

11:23

was really smart and he could figure out

11:25

a way to like make it go fast.

11:27

And I never became Eugene. So like I

11:29

resonated with this very much. And

11:31

the fact that this concept of,

11:33

Hey, there's these seemingly big data

11:36

sets out there. And

11:39

I want to do maybe

11:41

even complicated analytics types of

11:44

queries over these, or even,

11:46

you know, execute workflows of,

11:48

as you mentioned, till

11:50

aggregation or other processes at

11:53

query time, I could do that with a

11:55

system that I could just run on my

11:58

laptop or I could run in

12:01

process is really intriguing. So maybe

12:03

now is a good time then

12:05

to like introduce DuckDB formally. So

12:07

like I'm on the DuckDB side,

12:09

it says, DuckDB is a fast

12:11

in process analytical database. So maybe

12:13

one of you could like take

12:15

a stab at, you

12:17

know, thinking about those data scientists out

12:19

there who are maybe at the point

12:21

of not, also not believing that what

12:24

we just described is maybe

12:26

possible or they're living in a world where

12:28

that's not possible. Describe what

12:30

DuckDB is and maybe why

12:32

that becomes possible as a

12:35

function of what it is. I

12:37

think I can talk a little

12:39

bit about motivation behind DuckDB, at

12:42

least the way I perceived it at the time.

12:44

And that was actually

12:46

originated from the R ecosystem.

12:50

Yeah, so Hannes was very

12:52

involved in that ecosystem and

12:55

people were using R to

12:58

essentially crunch relatively large

13:01

data with relatively primitive

13:04

methods. And

13:07

so at the time, CWA had

13:09

a database system and

13:12

an analytical database system called MonetDB that

13:16

has incorporated the

13:18

idea of vectorized columnar

13:22

query execution. And

13:24

it was a large system that

13:26

was not really easy for

13:29

the typical R users to adopt.

13:32

So the first idea

13:34

was to say, hey, let's maybe

13:37

build a light version of MonetDB and

13:40

integrate it with, I think it was

13:42

dplyr or something like this. And

13:45

we just let it run on the client. But

13:48

eventually it turned out to

13:50

be easier maybe to just

13:52

rebuild the database system from

13:54

scratch that was actually designed

13:56

to run in process

13:58

to be super light. lightweight, that's

14:01

super easy to install and everything essentially

14:03

to give the power of

14:07

this vectorized query execution into

14:09

the hands of data analysts.

14:11

I'm wondering if you could, when

14:13

you talk about that being in

14:15

process and lightweight, could

14:17

you describe what that means for someone that may

14:20

not be familiar with the term in process?

14:22

And how is that different from

14:25

other databases that are not in

14:27

process, that have their own processes?

14:29

Can you describe a little bit of what that

14:31

means? So classical

14:34

database systems operate in the

14:36

client server architecture. Usually you

14:38

have a database server running

14:40

somewhere and you have a

14:42

client that sends SQL

14:45

queries essentially to the database

14:47

server and then the result is transferred

14:49

back to the client through some

14:52

kind of transfer protocol. One

14:56

paper that turned out to be

14:58

Mark, Mark Bresford, who is also

15:00

a co-founder of Data Build Abscess,

15:03

they were working on a paper that basically

15:06

benchmarked these client protocols and it turned out

15:08

that that was actually a huge bottleneck. So

15:11

even when you're running Postgres on your

15:13

local machine, you still

15:15

have this client server protocol bottleneck.

15:19

And the way to get around this

15:21

is to have the database actually running

15:24

within your process that

15:26

is, in that case, maybe R

15:28

or Python and

15:30

has access to

15:32

the result set just

15:35

in memory and

15:38

no transfers happen. And

15:41

maybe I'd like to just add

15:43

in that for those who maybe

15:45

haven't done programming and stuff in

15:47

our audience that when it's expensive

15:49

to go between processes and

15:52

so that database server in a different process,

15:55

it takes a lot of resource to go from the process you're

15:57

in off to that and back. And so this puts a

15:59

lot of resources in there. puts it all into one, you might

16:01

say one little sandbox where

16:03

you're able to maximize that. Would that be

16:06

a fair assessment? Yeah. Yeah, so

16:08

I think one of the other advantages of

16:10

having this type of a model is that

16:12

you can share memory between the processes. So

16:14

just to go a little bit inside the

16:16

technical aspects of this, is that

16:18

the bottleneck that Till was explaining was more like

16:21

the data transfer bottleneck. But in

16:23

this case, when it's running within the process, you

16:25

can share the same memory, you can share the

16:27

variables that are, crunching

16:30

inside, let's say a Python script, that you're

16:32

crunching a variable, and then you have access

16:34

to the variable inside your database as well,

16:36

for an example. And this makes

16:38

it super powerful for the developer,

16:40

for the developer experience as well. And I

16:43

guess one of the things that apart from

16:45

the database itself being super fast,

16:48

the developer experience of using that

16:50

DB is so awesome in

16:52

that sense that I guess that has also led

16:55

to the success of it. You

16:57

know, that's how you can do that with Andra. Okay,

17:01

friends, I'm here with a new friend

17:03

of ours over at TimeGary affirm. So

17:07

within... So after helping you understand what exactly

17:09

is and till. So

17:12

TimeGary is a Postgres company. We build

17:14

tools in the cloud and in the

17:16

open-source ecosystem that allow developers to do

17:18

more with Postgres. So

17:20

using it for things like TimeSeries Analytics, and

17:22

more recently, AI applications likeies like

17:26

RAG and Search and Agents. Okay, if our listeners

17:28

were trying to get started with Postgres,

17:30

Timescale, AI application development, what would you

17:33

tell them? What's a good roadmap? If

17:35

you're a developer out there, you're either

17:37

getting tasked with building an AI application,

17:39

or you're interested in just seeing all

17:41

the innovation going on in the space

17:44

and want to get involved yourself. And

17:46

the good news is that any developer

17:48

today can become an AI engineer using

17:50

tools that they already know and love.

17:53

And so the work that we've been

17:55

doing at Timescale with the PGAI project

17:57

is allowing developers to build AI applications.

18:00

with the tools and with the database

18:02

that they already know, and that being

18:04

Postgres. What this means is that you

18:06

can actually level up your career, you

18:08

can build new interesting projects, you can

18:10

add more skills without learning a whole

18:12

new set of technologies. And the best

18:14

part is, it's all open source, both

18:16

PGAI and PG Vector Scale. Our open

18:18

source, you can go and spin it

18:20

up on your local machine via Docker,

18:22

follow one of the tutorials on the

18:24

Timescale blog, build these cutting edge applications

18:26

like RAG and such without having to

18:28

learn 10 different new technologies

18:30

and just using Postgres and the SQL

18:33

query language that you probably already know

18:35

and are familiar with. So yeah, that's

18:37

it, get started today. It's a PGAI

18:39

project and just go to any of

18:41

the Timescale, GitHub repos, either the PGAI

18:43

one or the PG Vector Scale one

18:45

and follow one of the tutorials to

18:48

get started with becoming an AI engineer,

18:50

just using Postgres. Okay,

18:52

just use Postgres and just

18:54

use Postgres to get started

18:57

with AI development, build RAG,

18:59

search, AI agents and it's

19:02

all open source. Go to

19:04

timescale.com/AI, play with PGAI, play

19:07

with PG Vector Scale all locally

19:09

on your desktop. It's open source.

19:12

Once again, timescale.com/AI.

19:16

So Aditya, you

19:18

were just describing the

19:20

developer experience, which

19:41

I would definitely say is kind of fitting

19:43

that magical experience that you alluded

19:46

to with DuckDB and maybe

19:48

just to give a sense of people, like

19:51

when I was initially exploring this, similar to some

19:53

of the experiences that you all talked about, I

19:55

would encourage our listeners to go out and install

19:58

DuckDB locally try something because it

20:00

is a really interesting experience, especially

20:03

for those that have worked with

20:05

traditional database systems in the past

20:07

and all of a sudden, so

20:10

you kind of install ducty be

20:12

locally, import it as a library, then you

20:15

can query, you know, point to CSV files

20:18

or JSON files or parquet files,

20:21

or even a database like a

20:23

Postgres database or data stored in

20:25

an S3 bucket. And you have

20:27

this consistent then SQL interface that's

20:30

familiar that you can do queries over

20:32

that data. So

20:35

I don't know, maybe one

20:38

of you could describe some of

20:40

the, you know, just

20:42

to give people a sense of

20:44

the use cases for ducty be

20:46

maybe on one side where it's

20:49

like the primary or the key

20:51

or the most often occurring

20:54

use cases that you see people grabbing

20:56

ducty be and using it for. And

20:58

then maybe on the other side, just

21:00

to kind of help

21:03

people understand where it fits, maybe

21:05

where it wouldn't be as relevant

21:09

if you have any of those thoughts. I

21:11

can give like a brief overview of this. Some

21:14

of the biggest users of ducty be

21:16

come from the Python ecosystem, and

21:18

which means that it's

21:21

being a stand-in for a

21:23

data frame, for example. And

21:26

one of the advantages of using ducty be

21:28

is that it's really fast on aggregate. And

21:32

for the Python ecosystem, it helps with

21:34

standing in for a data frame to

21:36

be used with other ML

21:39

libraries, for example. So

21:41

that's like one part of the ecosystem. And the

21:43

other part of the ecosystem is for a data

21:46

engineer to be able to pull in data

21:48

from different sources, like you said, you

21:50

know, Postgres from CSV, and to

21:52

be able to join those different data

21:54

sets. Joins are

21:56

really good with ducty be as well, and

21:58

to create transfer. data sets

22:00

is also pretty useful.

22:03

And on the third ecosystem

22:06

for a data analyst who is writing SQL,

22:08

and one of the really nice aspects

22:11

of duct TV is the SQL dialect

22:13

itself. It's pretty flavored

22:15

that you have a lot of

22:17

duct TV functions that makes data

22:19

cleaning easy, data transformation easy. For

22:22

example, we also have a dialect that says

22:25

from table, and that's just gonna show

22:27

you the table. Instead of

22:29

going select star from table, you can go

22:31

from table and that will, just

22:34

fetch data from that table. So there

22:36

are these flavors of duct dialect for

22:38

duct TV that makes it nice. I

22:41

was also looking through the duct TV website and

22:43

stuff, and I know it runs

22:45

on kind of all the major

22:47

platforms and architectures and you support

22:50

a variety of languages on it.

22:52

I'm curious would, cause I'm

22:54

asking a question to my own

22:56

interest selfishly, as Dan would say,

22:58

do you support kind

23:01

of embedded environments and kind of on

23:03

the edge, that kind of stuff where

23:05

you find it embedded and operating

23:08

where it's not necessarily on

23:10

a cloud server on one of the major platforms. Is

23:12

that a typical use case? That is one of a

23:16

good use cases for duct TV. Since

23:18

it's the in process protocol that

23:20

it has for running that duct TV, it

23:23

can run wherever you run

23:25

Python or R or anywhere. And

23:28

they've also optimized it to

23:31

run in different architectures as well. So

23:33

this makes it possible. And to kind

23:36

of go beyond that, you can also run

23:38

it in the browser. So any edge environment,

23:40

you can run it. Of course, there's a

23:43

lot of optimization for, there are like a

23:45

lot of edge environments at the moment, not

23:47

everything is optimized to

23:49

run duct TV, but I guess it's also

23:51

moving towards being run in every edge environment

23:54

as well. Some of our

23:56

listeners might be curious why, you

23:59

know, a person like. me as sort

24:01

of living day to day in the

24:03

AI world is thinking, is

24:05

super excited to talk about DuckDB. I

24:07

mean, certainly I have a past in

24:10

more broadly data science and this is

24:12

pain I felt over time, but also

24:14

there's a very relevant piece of this

24:16

that intersects with the needs of the

24:20

AI community more broadly and the

24:22

workflows that they're executing.

24:26

One of those is where I started getting into

24:29

this is in these dashboard killing AI apps that

24:35

people are trying to build in the sense

24:37

that like, hey, another

24:39

pain of mine as a data scientist in

24:41

my life is building dashboards because you always

24:43

build them and they never answer

24:45

the questions that people actually have. And

24:48

so there's this real desire to

24:51

have a natural language question input

24:53

and then you can then compute

24:55

very quickly the answer

24:57

to that natural language question by using

24:59

the LM to generate a SQL query

25:02

to a number of data sources. But

25:05

then when you start thinking about, oh, well,

25:07

now I have these CSV files that people

25:10

have uploaded into a chat interface, or I

25:12

have these types of databases that I need

25:15

to connect to or have this data in

25:17

S3 buckets and my answer could come from

25:19

these different places, all of a

25:21

sudden this kind of rich SQL

25:23

dialect that you talked about that's very

25:26

quick and can run with

25:29

a standardized API across

25:31

those sources becomes incredibly

25:34

intriguing for me. Transparently,

25:36

that's how I sort of like got into

25:38

this is I'm like thinking

25:41

of all of these sources of

25:43

data that I could answer questions out of

25:45

using an LLM, but how do

25:47

I standardize a fast

25:49

interface to all of these diverse sets

25:51

of data and also do it in

25:53

a way that doesn't, you know, is

25:56

easy to use from a developer's perspective.

25:59

But I also know that you all

26:01

see much more than I do and

26:04

maybe that is an entry point that

26:06

you're seeing. I'm wondering if one

26:08

of you could talk a little bit more

26:10

broadly of how the problems that DuckDB is

26:12

solving and the problems that your customers are

26:15

looking at are intersecting

26:17

with this rapidly developing world

26:19

of AI workflows. I

26:21

mean, one way to describe DuckDB

26:24

is it's the SQLite for analytics.

26:28

So it is basically

26:30

a very easy

26:32

way, a very developer friendly way to

26:34

achieve what you just described. If I

26:37

want to create a demo for

26:39

my new text to SQL model,

26:42

if I use DuckDB for it, I can

26:45

even make completely like

26:47

wasm based demo out of it,

26:49

for example. I don't have

26:52

any issues with CSV

26:54

upload. There might be

26:56

databases where I have to specify the

26:59

limiter of the file that the user uploads. So

27:01

I would have to show a dialogue to my

27:03

user where he says, oh, that's comma separated and

27:05

it has a header row and

27:08

so on. With DuckDB, it

27:10

just works. So it

27:12

takes away some of the edges

27:14

you might have with other databases.

27:17

And on top of that, as you said,

27:19

it integrates with different

27:21

storage backends like it can read

27:24

from S3, it can read from

27:26

HTTP. When

27:28

I see an interesting file on, let's

27:30

say, Hagging Face or GitHub, I

27:33

just run read CSV from

27:35

this URL and I have the data

27:37

set locally in my CLI or in

27:39

my Python. Furthermore, when I

27:41

have a say,

27:44

a Python environment, I start

27:46

a Colab notebook, right?

27:48

And I create some data frames. With

27:51

DuckDB, I can just read those

27:53

data frames. I've seen very cool demos

27:55

of people basically using text

27:58

to SQL for, yeah, for

28:00

analytics on Penis data frames.

28:02

And under the hood, it's just

28:05

duct to be sitting there and

28:07

basically reading straight from those

28:09

Penis data frames, which by the way, is

28:11

one of the other benefits of shared

28:14

memory of in process. It's

28:17

not only for fetching results,

28:19

it's also for reading data straight from

28:21

the process. So in that case from

28:23

Penis. That's very

28:26

exciting. I'm happy to talk more about

28:28

text to SQL. We have have

28:30

had a project about that at

28:32

mother duck. But yeah, yeah,

28:35

and maybe also, before

28:37

we get into maybe some of those

28:40

stories, I think that that's

28:42

one side of it is like the

28:44

integration of this analytics piece into AI

28:47

workflows. But then also, if I'm not

28:49

mistaken, there is sort of vector search

28:52

capabilities within DuckDB as well. I don't

28:54

know if one of you could speak

28:56

to that. Yeah, that's one of the

28:58

exciting aspects of DuckDB as well. So

29:01

if I could take

29:03

a step back and think about other

29:05

ecosystems where let's say Postgres has been

29:07

shining a lot Postgres has exploded into

29:09

the kind of possibilities that you can

29:11

do because it has kind of like

29:13

an amazing extension mechanism, where

29:15

you could add extensions and capabilities.

29:17

And in a similar

29:21

way, DuckDB has an extension mechanism

29:23

that you have access to the

29:25

internal workings of DuckDB. And you

29:27

could add more workflows

29:30

on top of what DuckDB can

29:32

do, right? DuckDB has these capabilities

29:34

of doing vector search, for example,

29:36

and it also has hybrid search,

29:39

where you you also have full

29:41

text search, and vector search

29:43

that you could put together to to create

29:45

hybrid search. One of the ways it

29:47

does is that it has a really nice data type.

29:50

I can go into the rabbit hole of the

29:52

inner workings of how they make this happen, which

29:54

is also pretty exciting. But one

29:57

of the things that they make this possible is

29:59

to provide like an array data type where you

30:01

can have an array of floating

30:04

points, and then you can store this

30:06

as a data type, and then that

30:09

eventually becomes an embedding vector that you

30:11

can do cosine similarity against. So that

30:14

is to do like an embedding-based search. Then

30:16

you can also have full-text search where

30:19

you can create an

30:22

inverted index of keywords to

30:24

your documents, and you can search across your

30:27

keywords to find your ideal

30:29

documents and rank them according to the score. And then

30:32

you could fuse both of these

30:34

scores from embedding search and from full-text

30:37

search to have like a hybrid search.

30:40

So yeah, so all of these are possible, and

30:42

they're very accessible. Well,

30:45

there's no shortage

30:47

of helpful AI tools

30:50

out there, but

31:00

using these AI tools means you got

31:02

to switch back and forth, back and

31:04

forth between yet one more tool. So

31:07

instead of simplifying your workflow, it just

31:09

gets more complicated, but that's not how

31:11

it works when you're using Notion. Notion

31:13

is the perfect place to organize

31:16

lots of stuff, tasks, tracking your

31:18

habits, writing beautiful docs, collaborating with

31:20

your team, knowledge bases, and the

31:23

more content you add to Notion,

31:25

the more this cool thing called

31:27

Notion AI can personalize all of

31:30

the responses for you. Unlike generic

31:32

chatbots, Notion AI already has the

31:34

context of your work. Plus, it

31:36

has multiple knowledge sources. It uses

31:39

AI knowledge from GPT-4 and Clod,

31:41

and that helps you chat about

31:43

any topic. And here's the kicker.

31:46

Now in beta, Notion AI can

31:48

search across Slack discussions, Google Docs,

31:50

Sheets, Slides, and even more tools

31:53

like GitHub and Jira. Those are

31:55

coming soon. And unlike

31:57

specialized tools or legacy suites,

31:59

that have you bouncing between

32:01

different applications. Notion is seamlessly

32:03

integrated, infinitively flexible, and beautifully

32:05

easy to use. So you

32:07

are empowered to do your

32:09

most meaningful work inside Notion.

32:12

From small teams to massive Fortune

32:14

500 companies, these

32:16

teams, both small and large, use

32:19

Notion to send less email, cancel

32:22

more meetings, save time searching

32:24

for their work, and they

32:26

reduce spending on tools, which

32:28

helps everyone stay on the

32:30

same page. You can try

32:32

Notion for free today by

32:35

going to notion.com/practicalai. That's

32:37

all over case notion.com/practicalai

32:39

to try the powerful,

32:42

easy to use Notion

32:44

AI today. And

32:46

of course, when you use our

32:48

link, you're supporting our show, and

32:51

I know you love that. Again

32:53

notion.com/practicalai. So,

33:05

Till, you were starting to get

33:07

into even some of the things

33:10

now that you're doing at Mother Duck on

33:12

top of DuckDB. I'm wondering,

33:14

hopefully we can get to some of

33:16

those use cases or the

33:18

things that you've been doing with

33:20

customers or internally. But I'm wondering before

33:23

we do that, I

33:25

see also this

33:27

story about DuckDB's

33:30

efficiency, but with this

33:32

multiplayer aspect as

33:35

part of what you're doing at DuckDB. So,

33:37

maybe one of you could describe, now

33:40

I think we have a sense of what DuckDB is, and

33:43

it's this free thing that is open and

33:45

I can pull down, I can install, I

33:47

can run it very quickly, run it on

33:50

my laptop, run it in my browser, do

33:52

these analytics queries. So, now kind of

33:55

describe maybe a little bit of how

33:57

you're taking that further with Mother Duck.

33:59

and how you're thinking about some of

34:01

the enterprise use cases. I

34:04

like to describe Motherduck

34:06

as giving your

34:09

duct to be a cloud companion.

34:11

So it's easy

34:14

to think or

34:16

to associate, okay, we

34:19

bring Motherduck to the cloud, which

34:21

is one way how we describe ourselves as

34:24

well, to associate that with

34:26

we provide infinite scale up in the

34:28

cloud. You give us a workload and

34:30

we start how

34:32

many hundred ducts in

34:35

the background that in a

34:37

task like fashion, let's say,

34:39

process your data concurrently. But

34:42

actually, one

34:46

of the hypotheses that Motherduck

34:49

is based on or that the company

34:51

was founded on is that actually single

34:54

node compute, which means one

34:56

duct to be database with

34:59

nowadays hardware, cloud hardware

35:01

is actually actually gets you very, very,

35:03

very far. So

35:06

when your local compute resources

35:10

reach reach limit, you

35:12

have cloud cloud

35:14

single cloud instances with up

35:17

to how much is it 24 terabyte

35:19

of memory, that's

35:21

relatively big data. So

35:24

that's one aspect, right? So scaling up

35:27

with one cloud companion duct to

35:29

be another aspect is

35:32

that collaboration. So once you

35:34

you're connected to a cloud

35:36

instance, you can have

35:38

shared context with other users in

35:41

your organization, you can create

35:43

shared data sets, you

35:45

can have shared notebooks, and

35:48

so on and so forth. And with that,

35:51

of course, comes all the enterprise SOC

35:53

2 kind of things that that

35:55

some of the enterprise customers require

35:58

to adopt towards like. Thank

36:00

you. I'm curious if you

36:02

could, uh, that you really

36:05

captured my imagination with that, uh,

36:07

that description. And so like, because, you

36:09

know, by drawing, for instance, with kind

36:11

of, you know, the old school postgres

36:13

things that people would do with that.

36:15

And you just talked about having many

36:18

duck DB instances operating

36:20

concurrently, you know, what

36:22

kinds of problems and kind of, you

36:24

know, grounding it in a, in a practical

36:26

way for, from a user's perspective, what kind

36:29

of problems, uh, do you see

36:31

people solving with that kind of architecture, um, and

36:34

that new capability that they may not have

36:36

historically had over the years with previous database

36:39

capabilities on other platforms. What new

36:41

sets of concerns can they address

36:44

now with those? I would

36:46

come from the perspective on this, that,

36:48

um, there are a lot

36:50

of companies out there that when they

36:53

want to go to the cloud

36:55

with analytics workload, they have relatively

36:57

limited choices. One of

36:59

those choices is, uh, like snowflake or

37:02

data bricks. And they

37:04

of course are optin those systems are

37:06

optimized for big data scale. So,

37:09

but then one

37:11

of our observations is that a lot

37:13

of companies actually don't have

37:15

that amount of data when

37:18

they run queries or they might have big

37:20

data, but the queries they are running,

37:23

uh, only access a very small subset

37:25

of the data. For example, you

37:27

know, you run, um, monthly reports,

37:30

they don't touch your, your entire

37:32

historic dataset. So those

37:35

companies want to, um, might

37:37

want to have something that is easier first, easier

37:40

to use, easier to

37:42

set up. And that's also more cost

37:44

efficient and other existing solutions. One

37:47

of the things that we haven't touched

37:49

upon in this yet is kind

37:52

of how mother duck and duck TV

37:54

go hand in hand with like the

37:56

remote and the local aspect where you

37:59

have. on your local and

38:01

your remote the same client, so

38:03

that you're running the same thing.

38:05

It's easy to go from one place to the other doing

38:08

the same thing. What Motherduck

38:10

also provides is a dual execution

38:12

where your local DuckDB,

38:14

if you're running it locally, can

38:17

communicate with your remote Motherduck

38:20

and execute seamlessly between

38:22

both. For example, a

38:24

query where you have a table in

38:27

your local DuckDB and you want to

38:29

join it with a remote DuckDB, you

38:31

can join both of these tables

38:34

together to run an aggregate. Then

38:36

there's a query optimization that we

38:38

run where we transfer

38:40

the data which was required from the remote

38:43

to your local or from your local to

38:45

remote and execute it intelligently in

38:47

a way, if I could say that. This

38:50

opens up new opportunities in the

38:53

dual execution aspect of running your

38:55

local and the remote with the

38:58

same client. I'm curious against

39:00

selfish question, is you're doing that and you

39:02

have the local version and the remote version,

39:05

the connection between the two there. What

39:08

does that look like? Is it something

39:10

that if they're widely separated, if Motherduck's

39:12

in the Cloud and I'm out on

39:15

a device that's not Cloud-based, is

39:17

that efficient communication? How do you all handle

39:19

those different types of use cases? One

39:23

of the principles of this dual execution

39:25

is to reduce

39:27

the amount of data that

39:29

has to be transferred as much as possible.

39:31

One of the use cases, for example, is

39:33

I have a really large dataset

39:36

on S3 and I want

39:38

to join it with a small table that

39:40

I have on my notebook. In

39:44

that case, an optimizer, query

39:47

optimizer will make the

39:49

decision to, instead of downloading

39:51

the one terabyte dataset to

39:53

your local device and doing the join there,

39:56

to instead upload your small local file

39:58

to the. at

44:00

it. It's pretty cool

44:02

what you've talked about today in terms

44:04

of what is possible for us. How

44:06

are you thinking about the future? What

44:09

are the new cool things that you

44:11

have in mind? I often say when

44:13

you're not necessarily working hard on a

44:15

problem, but you're chilling out at the

44:17

end of the day and your

44:19

mind is just wondering in free form and

44:21

you're thinking, boy, what if we could do

44:23

this? I could imagine that and I can

44:25

see a path forward to get there. How

44:28

are each of you thinking about

44:30

Mother Duck and Duck DB in

44:33

terms of what the future might offer

44:35

if you want to

44:37

get out there and wax poetic a little

44:39

bit and it doesn't have to

44:41

be grounded in current work, but more in imagination

44:43

and aspiration? One of the things that

44:45

I really like about the current state of AI is

44:49

how good the local models are, the small models

44:51

that you can run locally. There's

44:53

a great ecosystem out there building on top

44:55

of that. One of

44:57

the things that I see with the

44:59

local models, of course, the hallucinate, but

45:02

to prevent hallucination, you can use a

45:04

really nice rag mechanism to put context

45:06

into those local models. These

45:09

local models could be on the edge as well. It could be

45:11

on your local laptop. It could be on

45:14

the edge. Knowledge

45:16

bases are essentially created

45:18

to prevent these hallucinations.

45:21

One wasteful aspect of creating

45:24

knowledge bases is that everybody's

45:26

creating very similar knowledge bases.

45:29

What if there could be a mechanism where

45:31

we could share these knowledge bases?

45:34

A user could create a knowledge base and

45:36

they could share a knowledge base. One of

45:38

the imaginative worlds that I've

45:40

driven is how Mother Duck could

45:42

be there to do these kind

45:44

of shadowable knowledge bases where you

45:46

essentially have a world of

45:49

remote knowledge bases out there in your

45:51

remote tables. Then you have

45:53

a local DuckDB client that helps you

45:55

pull a knowledge base that you want,

45:58

use the local knowledge base. argument

46:00

your local model with the

46:03

relevant context for your current

46:05

question. And then when you don't want the

46:07

knowledge base, you could also drop the knowledge

46:09

base. And that's like having a remote

46:11

knowledge base, a repository, and pull

46:14

whatever you want. This is like

46:16

one of the dreams that I

46:19

think about how Motherduck and DuckDB

46:21

could be useful for this. And

46:24

another aspect of talking

46:27

about knowledge bases and RAG applications

46:29

is that not all

46:32

applications and workflows require

46:34

a real-time database to

46:36

build agents on top of them. And

46:38

some of these agents could be running

46:40

as background agents that do some workflow

46:43

once every day. And instead

46:45

of having a real-time database for that, what

46:47

if you could provide a very lightweight analytical

46:50

engine that's quite cheap to run locally as well?

46:52

And that could also, you could offload some

46:55

work to the remote cloud. So

46:57

this is another thing that keeps me excited

47:00

at night to think about what could

47:02

be these kind of use cases. But

47:04

yeah, these are the two use cases that I

47:06

am quite excited about. Yeah, I

47:08

mean, maybe I can add

47:11

two things. One

47:13

thing that actually connects to that is

47:18

bringing AI and machine

47:21

learning capabilities more into

47:23

the database. So one of the

47:25

things we've seen in the

47:27

past is that the inference costs

47:29

of language models have

47:31

dropped quite significantly compared

47:33

to two years ago. It's

47:36

now, I think, only it's

47:38

2% of the price

47:40

for inference with GPT for

47:43

Mini compared to GPT-3. And

47:47

that actually makes it possible to

47:49

run language model

47:51

inference on your tables and

47:55

also to do things like embedding

47:58

compute on your tables SQL

48:00

is just a really, really convenient user interface

48:02

for that. So we added this embedding function

48:05

some time ago that works really well together

48:07

with a vector search. So you can basically

48:10

do embedding based search only

48:12

in SQL. Now we're adding

48:14

the prompting capabilities so you can do

48:17

language model based data wrangling in your

48:19

database and that together with

48:22

local models and this

48:24

hybrid execution model. We say, okay, we

48:26

do part of the work locally. Maybe

48:28

if you have a GPU, do part

48:30

of the embedding inference locally. If

48:32

you want to do it faster, do it in the cloud

48:35

with a fewer A100. And

48:38

again, everything is in SQL.

48:42

That's awesome. Yeah, well, thank you both

48:44

for taking time out of your analytics,

48:47

AI, database work to come

48:49

talk to us. This has

48:51

been super amazing. And

48:53

I would definitely encourage people out there,

48:55

please, please, please go try out some

48:57

things. Try out some examples

48:59

with DuckDB. Check out the Mother Duck

49:01

website and some of the great blog

49:04

post content that they have there, examples

49:06

or things that they're doing. Check it

49:08

out because it's definitely a really

49:11

wonderful thing that you can add into your

49:14

AI stack and think about and experiment with.

49:16

So thank you so much, Till and Aditya,

49:18

for joining. It's been a pleasure. Thank you

49:20

guys for having us. Thank you, guys. It

49:22

was pretty awesome to be here. Thanks

49:55

again to our partners at fly.io

49:57

to our beat freaking residents. Break

49:59

master.

Rate

Get this podcast via API

From The Podcast

Practical AI

Making artificial intelligence practical, productive & accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, GANs, MLOps, AIOps, LLMs & more).The focus is on productive implementations and real-world scenarios that are accessible to everyone. If you want to keep up with the latest advances in AI, while keeping one foot in the real world, then this is the show for you!

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More