Vinoth Chandar - The Future of Open Data Lakehouses by The Joe Reis Show | Podchaser

Episode from the podcastThe Joe Reis Show

Vinoth Chandar - The Future of Open Data Lakehouses

Released Tuesday, 1st April 2025

Good episode? Give it some love!

Vinoth Chandar - The Future of Open Data Lakehouses

Vinoth Chandar - The Future of Open Data Lakehouses

Tuesday, 1st April 2025

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

How's it going? Good. All right,

0:03

let's do this. Let's do this.

0:05

Yeah, so it's good to see

0:07

you again. How have things been?

0:09

It's been good. If I, more

0:11

correctly, we met once in that,

0:14

like. the data engineering conference in

0:16

India, right? I think. Yeah, yeah.

0:18

So, yeah. Yeah, it's been good.

0:20

Lots of things happening last year.

0:23

Yeah. Continuing to build, like, you

0:25

know, one house and lake house

0:27

and whatnot. So, yeah, love to

0:30

dig in and just, just, kind

0:32

of put it on. Yeah, super cool.

0:34

Yeah, even up to a lot. I

0:36

guess I kicked things off first. For

0:38

people who don't know who you are,

0:41

do you want to give a

0:43

quick intro? Folks, my name is

0:45

Vinot and I currently founded a

0:47

starter called One House which essentially

0:49

provides an open data lake house

0:51

as a foundation for data infrastructure

0:53

and we call it the Universal

0:56

Data Lake House and probably should

0:58

be clear why as we as

1:00

the show goes on. My background

1:02

before this I've worked in data

1:04

infrastructure for quite a bit and

1:06

right before this I was a

1:08

confluent. I was a principal engineer

1:10

working cross Kafka, streaming, connect

1:13

and bunch of different areas

1:15

there. What brings me to the Lake

1:17

House space is my work actually before

1:19

that a duper. Same thing. We built

1:22

the walls first data lake house,

1:24

which should be like a probably

1:26

like a fun trivia at some

1:28

point as the space builds up.

1:30

We actually, it wasn't called the

1:32

lake house. We called it a

1:34

transactional data lake. Kind of like

1:36

what we called it, but we

1:38

built the first production data lake

1:40

house operated at scale. Open data

1:42

formats across multiple engines each for

1:44

its one use case and through

1:46

this Apache hoodie project and continue

1:48

to kind of like lead the

1:50

project in the Apache software foundation

1:53

as PMC chair as well as now

1:55

also involved with another project in

1:57

the foundation called Apache X table

1:59

in. which is now kind of

2:01

you know bringing interoperability and brokering piece

2:03

across the open data ecosystem and yeah

2:06

before that I let Key Valley storage

2:08

at LinkedIn there was a fun experience

2:10

for me back in the day build

2:13

this Key Valley store called Walmart which

2:15

was like a dynamo source deck lead

2:17

for that and it was a like

2:20

I said a good experience for me

2:22

like scaling and like a system like

2:24

that for a very popular website like

2:26

LinkedIn. So I make a real article

2:29

doing database replication and like CQL streams,

2:31

Golden Gate replication software, not things like

2:33

that. So that's kind of like my

2:36

background. Okay, so just dabbling in databases.

2:38

Just kidding. Yeah, I generally describe myself

2:40

as a one-trick pony. I actually don't

2:43

know anything beyond this. So actually how

2:45

I look at it. So even one

2:47

house I started because I was like,

2:50

okay, we built, I've been fortunate to

2:52

be part of these, you know, like

2:54

LinkedIn or Uber, kind of, you know,

2:57

very large, behemoth data, kind of like

2:59

companies, And when I was thinking about

3:01

what to do next, I actually don't

3:03

know anything else. And it felt like

3:06

I was going to go somewhere else

3:08

and build the same data platform again.

3:10

And then I saw people building the

3:13

same data platform over and over again,

3:15

you know, kind of that taping open

3:17

source projects. So we said, hey, the

3:20

industry kind of like does not have

3:22

a data platform that is built prepackaged

3:24

with these like, like a bundled. platform

3:27

if you will, like build from purely

3:29

open source foundations that can now provide

3:31

the same ease of use as you

3:33

would expect from cloud managed services, but

3:36

build completely on open source foundations. So

3:38

that's kind of like actually how I

3:40

even got going. So you could say

3:43

that my background actually led me to

3:45

this. Yeah, right. So yeah, that's awesome.

3:47

And then I guess a interest to

3:50

me is you work at Uber with

3:52

what you called it. transactional data, like

3:54

I believe is what you said. What

3:57

were the insights into that? Because I

3:59

think around that, around what year was

4:01

that? This was 2016 that we built

4:03

it. We had things, we opened up

4:06

the project, Jan, 2018, after we got

4:08

all of our, you know, like our

4:10

critical, most of our data on the

4:13

thing. Yeah. So, yeah. Because up before

4:15

that, Data Lakes had sort of been

4:17

popular for a bit, then they kind

4:20

of, I don't know, I don't know,

4:22

they disappeared, but they weren't as popular.

4:24

Then I think they started seeing, I

4:27

started reading, I don't know, some of

4:29

the Uber blogs around the end of

4:31

the last decade was sort of hinting

4:33

like this Data Lake was coming back,

4:36

but it was sort of resembling more

4:38

of a database, and so it's kind

4:40

of interesting. Like, what were some of

4:43

the insights that led you to want

4:45

to build something like that? Yeah, to

4:47

be honest, we were just trying to

4:50

solve a business problem. We were trying

4:52

to build like a new industry category.

4:54

I was trying to start like come

4:57

in, I think like that. So the

4:59

problem that we had was pretty common.

5:01

So you had a database and you

5:04

remember like when we started this at

5:06

Uber, like I mentioned, I was just

5:08

working on No sequel. key value stores

5:10

and everything at LinkedIn before that. That

5:13

was primarily what I did. So the

5:15

operational data was kind of like scaling

5:17

out, right? So we were moving away

5:20

from relational databases purely to plenty of

5:22

companies needed to have these operational data,

5:24

like databases, which are like scale out

5:27

databases now. We had streaming data, right?

5:29

Another thing that I was kind of,

5:31

you know, privy to, like, you know,

5:34

kind of front row seats to was

5:36

the rise of Kafka at LinkedIn, like,

5:38

you know, to do, be, tune the

5:40

Kafka clusters, GM and all that back

5:43

in the day. So there's a lot

5:45

of data, right? So the problem that

5:47

we ran into it, Uber, was it

5:50

was a highly real-time business. And, I

5:52

mean, literally, you know, whether or many

5:54

factors could just change how the dynamics

5:57

of the dynamics of the business. business

5:59

operates in real time. And we had

6:01

the scale out database, which was storing

6:04

all our trips, transactions, you know, all

6:06

of the data core business data. And

6:08

we wanted to just ingest it downstream

6:10

to a warehouse. We had an on-prem

6:13

warehouse at that point. And the warehouse

6:15

was another specialized database, right? So you

6:17

just do database to database replication. So

6:20

it was fine until it couldn't like

6:22

fit either the scale. or it couldn't,

6:24

it was too closed, so we couldn't

6:27

fit like new use cases on top

6:29

of like our warehouse without running many

6:31

instances of the warehouse or making parallel

6:34

copies of the data. So we needed

6:36

a scale out compute processing architecture, which

6:38

is what the map produced data lake

6:40

of the worlds already did at that

6:43

point, right? Spark was just coming up

6:45

at that point. And then you had

6:47

HDFS or S3 are like, you know,

6:50

you had scale out compute, you had

6:52

scale out storage, right? But what you

6:54

needed now was all these databasey primitives

6:57

that the warehouse had, you did not

6:59

have on the lake house. Sorry, on

7:01

the lake, right? So that's kind of,

7:04

so we borrowed all these transactional primitives.

7:06

So our on-prem warehouse had updates. It

7:08

had a notion of an index. They

7:11

called it projections. They called it projections.

7:13

And they had all these like, you

7:15

know, different ways of coordinating between, you

7:17

know, handling writer and reader concurrency scale

7:20

and actually had a bunch of services

7:22

that can continuously optimize the data. So

7:24

they had like a database runtime, right,

7:27

that was actually managing the data presenting

7:29

clean tables for you to consume. So

7:31

that's why we said, okay, we have

7:34

a data lake. We're going to retain

7:36

the core. compute storage scaling aspects of

7:38

it and just borrow the transactional aspects

7:41

and we call it a transaction data

7:43

like that's kind of what we did

7:45

that and it helped us ingest like

7:47

all this low high scale data like

7:50

either high scale database no sequel change

7:52

logs basically RDBM has changed logs I

7:54

scale streaming data all into a central

7:57

kind of like a layer of data

7:59

then you could build a kind of

8:01

you know send it downstream to various

8:04

data marks if you will I don't

8:06

think anybody uses these terms anymore but

8:08

like you could build like real-time data

8:11

marks for example you could build one

8:13

for data science you could like you

8:15

know you we could even feed do

8:17

do offload the e-tails from our warehouse

8:20

which was pretty expensive offloaded it to

8:22

the the lake. And then, you know,

8:24

just copy the final serving of the

8:27

rolled up tables, we could move our

8:29

data modeling to the lake and do

8:31

it in a more higher scale fashion.

8:34

This we built as like the central,

8:36

like I would say, like maybe like

8:38

a watering hole for every all of

8:41

the company can come clean data. And

8:43

you from there, it goes to many

8:45

different use cases, right? That's kind of

8:47

the overall architecture. Right. And now that

8:50

a paradigm is kind of taken over,

8:52

I would say, I guess it's now

8:54

called the Lake House. Yeah. So walk

8:57

me through the evolution from sort of

8:59

what you just described to maybe Hootie

9:01

and then to today. Yeah, yeah, that's

9:04

actually been interesting. So honestly, when we

9:06

were building it, we were like, hey,

9:08

we have at the core of it,

9:11

we had to solve three problems, right.

9:13

We had to provide a way for

9:15

you to mutate data. Data links were

9:18

a bunch of files. that you throw

9:20

into like a distributed file system or

9:22

cloud storage bucket and then you somehow

9:24

deal with it. That's how it was.

9:27

So we need to bring some transactional

9:29

boundaries that's first and then while doing

9:31

that you needed to provide it ability

9:34

to fast updates right and then the

9:36

one thing that we did very early

9:38

on from the first version was give

9:41

a change log on the other side.

9:43

So we wanted it to be like

9:45

a database table just like how we

9:48

could CDC from a upstream database you

9:50

could should be able to CDC from

9:52

this lake. like you know, lake houses

9:54

or a table as well, and then

9:57

build on string, right? It was actually

9:59

super controversial when we actually built it

10:01

for a year so we open source

10:04

it early 2017 because we thought this

10:06

is like seems like a general enough

10:08

problem and and we had conviction internally

10:11

in our team it over that oh

10:13

yeah this is where it will go

10:15

to For a year or so, this

10:18

was like a, okay, this is like

10:20

a nerdy thing this Uber engineers built.

10:22

It wasn't like much more than that.

10:24

But slowly after a year, people started

10:27

like running into similar issues, right? You

10:29

see a lot of companies who had

10:31

a lot of transactional data suddenly realized,

10:34

I fulfilled order today, my package gets

10:36

returned or like the transactions, the payments

10:38

won't complete, you retried the card again,

10:41

try an alternate payment, like data is

10:43

mutating. So slowly people started realizing, okay.

10:45

data is mutating so this is an

10:48

efficient way to handle that so community

10:50

started forming right the main thing for

10:52

that pushed this I think into the

10:54

as a use case to me was

10:57

the GDPR thing because suddenly that made

10:59

everybody be like okay you can't dump

11:01

a bunch of it kind of like

11:04

generalized this problem from this is like

11:06

a CDC update problem to a okay

11:08

no no this is a overall data

11:11

management problem You need whether you have

11:13

CDC or append only data models, you

11:15

need to be able to delete stuff.

11:18

So if you need to delete stuff

11:20

with large amounts of data, you need

11:22

all these things. You need an index,

11:25

you need like, you know, all the

11:27

management and you need to mutate, you

11:29

need to be able to produce like

11:31

a new thing, you need to understand

11:34

a change log to propagate it downstream.

11:36

So all of these like, and that's

11:38

suddenly generalized. 2019, so 2017, 2018 is

11:41

what I describe it not. So, 2019

11:43

is when, you know, AWS, in Dell,

11:45

the data breaks, open sourced, Delta Lake,

11:48

and, you know, called the Lake was

11:50

paper and all that happened in 2019.

11:52

And then Amazon basically, you know, kind

11:55

of, we did an integration with hoodie

11:57

and then like, you know, hoodie was.

11:59

bungalen to EMR, all the, all it

12:01

still is pre-installed on all these AEWs

12:04

services, right? And then I think a

12:06

whole bunch of companies started building these

12:08

transactional data lakes. So this terminology you

12:11

can actually find in a lot of

12:13

like AWS blogs and whatnot. So for

12:15

a while it was like transactional, like

12:18

you know how this space goes, it's

12:20

like a lot of marketing terms. It's

12:22

like to talk about the same thing

12:25

we invent some 10 terms. It's like

12:27

that. But then, you know, I think

12:29

even when we started the company, so

12:31

I couldn't start the company for a

12:34

couple of years, I didn't have a

12:36

green card or whatever, so we could

12:38

have probably started before. But anywho, like

12:41

we got started 2021. And originally our

12:43

company name was actually called Infinity Lake

12:45

and then we went out and talked

12:48

to people and they were like, what's

12:50

a data lake? Everybody thought they had

12:52

a data lake already. And we were

12:55

like trying to say, no, no, no,

12:57

that's not a real data lake. Like

12:59

you don't have like good schemas, you

13:01

don't have like, you know, you really

13:04

can't consume this data. It's like the

13:06

classic swamp versus lake type thing. Then

13:08

actually we realized that what data bricks

13:11

had actually done was really good. They

13:13

had given it a new name. And

13:15

you know as an engineer I had

13:18

the first appreciation for like you know

13:20

marketing at that point which was like

13:22

yeah this needs a new name and

13:25

then that's kind of like where we

13:27

when we announced one house we was

13:29

a second vendor in the market after

13:32

data breaks to be like okay we

13:34

are a like a provider right now

13:36

like everybody like this like a big

13:38

crew now right everybody's jumped on to

13:41

that bandwagon from there and there is

13:43

also the table format conversation. which famously

13:45

kind of kick started by I think

13:48

most of that momentum around that has

13:50

been from snowflake right so that's kind

13:52

of like cost a whole bunch of

13:55

confusion I think right now is what's

13:57

a table format what's a lake house

13:59

what's like do I just do open

14:02

open table formats then do I do

14:04

you know that it's been pretty interesting

14:06

for me Obviously there is like, you

14:08

know, the open source side is hoodie,

14:11

iceberg, delta, you know, there's like, you

14:13

know, we are continuing to. kind of

14:15

you know we compete healthy we we

14:18

are innovating so we just had our

14:20

release where we have indexes and whatnot

14:22

and I think there like we can

14:25

go deeper into that but that's one

14:27

side of the open source innovation type

14:29

of the coin right but on the

14:32

other side I think for example something

14:34

like iceberg support on snowflake is great

14:36

because it finally opens up the snowflake

14:38

compute engine to data that is not

14:41

in snowflakes proprietary format same for big

14:43

So overall I would say it's moved

14:45

to a healthy place now where, oh

14:48

yeah, open. data or what used to

14:50

be called external tables back in the

14:52

the warehouse thing. They're a mainstream thing

14:55

now and it for example, you know,

14:57

big query and snowflake and redshift all

14:59

are improving to making sure that performance

15:02

is good on the external tables versus

15:04

thing. It took like a few couple

15:06

of years. I don't know why for

15:08

snowflake to get there, but big query

15:11

did that in six months from the

15:13

launch. So it's definitely in a healthier

15:15

place now. with these open tables format

15:18

support in all of these warehouses and

15:20

they're generally supported across any open source

15:22

query engine, right, or spark platforms. We

15:25

are in a really good position as

15:27

an industry, I think, to start now

15:29

thinking about what should this new world

15:32

data architecture be because now the customers

15:34

are saying. I don't want to store

15:36

the data five times. It's just like

15:39

not even if you want to go

15:41

as far it's I don't think it's

15:43

even good for the planet in some

15:45

sense like how many copies do you

15:48

want to store how many servers do

15:50

we want to run it all makes

15:52

no sense. We should be using the

15:55

volume of data you know so high

15:57

we should be using the right engine

15:59

for the right workload even if it's

16:02

like 30% better. means that it's like

16:04

it's gonna save you money it's gonna

16:06

like lower you know your green tack

16:09

lower like you know compute footprint that's

16:11

generally better so we need to figure

16:13

out how to from here move to

16:15

that architecture where data remains like you

16:18

know instead of building let me put

16:20

this way instead of trying to sell

16:22

warehouse or in compute engine software top

16:25

down we should think about data going

16:27

up and that is the portion that

16:29

we have because the engines come and

16:32

go but there'll be new engines all

16:34

the time but your data is the

16:36

constant across them the bytes that you

16:39

write today should be readable five years

16:41

from now four years from now right

16:43

so how do we set ourselves up

16:45

for that I think that that to

16:48

me is where I think the future

16:50

challenges like That's super interesting. I like

16:52

that. I like that vision a lot.

16:55

Um, we'll give you a second, but

16:57

uh, by the way, congrats on the

16:59

1.0. Uh, really? That's awesome. Um, what

17:02

happened there? Oh yeah, that's a, it's

17:04

a good one and this touches upon

17:06

the whole, uh, databasey aspects of it

17:09

right so so one dot o for

17:11

us is is more about so we've

17:13

been like you know the projects been

17:15

around for from 2017 what like a

17:18

good six seven years now right and

17:20

and since we had a whole bunch

17:22

of first more disadvantages if you will

17:25

like for example bunch of these compute

17:27

frameworks weren't even ready to kind of

17:29

take the kind of these things back

17:32

in the day when we did all

17:34

the things. So for us, Bondado is

17:36

like a reimagination of the project. It

17:39

broadly, right from storage to start off

17:41

concurrency control, right? And we've tried to

17:43

apply the lessons that we learned from

17:45

the past few years of working with

17:48

so many companies in open source to

17:50

run these like mega data lakes. So

17:52

just to kind of summarize. One, we

17:55

realize a lot of the processing is

17:57

columnar, but the storage is not. in

17:59

a sense, like in a true sense,

18:02

because people have wide tables, in which

18:04

only few columns are changing. So we

18:06

made the storage actually very efficient. It

18:09

can encode partial updates now. So it

18:11

cuts down significantly the amount of data

18:13

that you store, rewrite, store, rewrite. So

18:16

that's like an example of a fundamental

18:18

storage change. The second thing that

18:20

we realized was, so we call it

18:22

the lake house, it gets compared to

18:24

a database and all that. But this

18:27

is not. an RDBMS. This is not

18:29

an online store or playcoses are not

18:31

even not used to build apps right

18:34

or consumer facing or even internal apps.

18:36

End of the day we run jobs

18:38

essentially like a There are only long running

18:40

transactions, right? When I was at Oracle,

18:43

anything, like for example, if a transaction

18:45

took more than a minute, we'll classify

18:47

that as a long running transaction or

18:49

something. Everything is long running transaction here.

18:52

And then some of the concurrency control

18:54

that we had before, optimistic concurrency control,

18:56

kind of says, okay, I want to assume

18:58

there's no concurrency. If there is, then I

19:01

want to retry. that retry again weighs compute

19:03

cycles because his job runs for hours and

19:05

minutes in a lot of cases. That's just

19:07

a lot of things to retry. So we

19:09

come up with what we call non-blocking concurrency

19:12

control which kind of incorporates some

19:14

techniques from stream processing which is

19:16

kind of matured. So there's a

19:18

different kind of concurrency control that

19:20

lets multiple riders continuously write without

19:22

sort of like blocking on each

19:25

other. I think this is more

19:27

suited for the kind of workloads

19:29

that Lake House runs. That's too.

19:31

Number three indexes. So we've introduced

19:33

like secondary indexes like database like

19:35

secondary indexes to the to the

19:37

lake houses. Again, it still has

19:39

the same architecture, the indexes and

19:42

data metadata system is three or

19:44

like a cloud storage and totally

19:46

compute and storage are like decoupled.

19:48

But now you're able to do

19:50

massive speedups of like, you know,

19:52

point, look up point. It still

19:54

won't be the same performance as

19:56

like your postgrass, right. But that's

19:58

not the goal. But it's it's

20:00

still like, you know, narrows down, it

20:03

unlocks new use cases like you can

20:05

do a needle in a you can

20:07

search for a few transactions in a

20:09

large table. You can do joins like

20:12

more effectively stuff like that. So we

20:14

introduced a bunch of indexes. And the

20:16

fourth thing is just a lot more

20:19

we've absorbed. One thing people use hoodie

20:21

a lot for is to deal with

20:23

later arriving data. So and kind of

20:25

like process that. And for a example,

20:28

let's say, you know, the downstream data,

20:30

like, you get events out of order.

20:32

For example, you get a order created

20:34

event after the order has been processed.

20:37

So if you just process it in

20:39

the array, like the order in which

20:41

the data arrives, you're going to say,

20:43

you're going to lose the fact that

20:46

the order was processed. You're just going

20:48

to say this order is in created

20:50

state, right? So, so hoodie can. actually

20:52

we pushed that intelligence into storage where

20:55

we can actually understand and resolve records

20:57

by a business field that you can

20:59

say like this is called event time

21:01

processing in in the streaming world. So

21:04

we've incorporated some of these that people

21:06

routinely need to write these kind of

21:08

processing pipelines because people constantly do things

21:10

like or this a main pipe like

21:13

a jaw writer that is pumping new

21:15

results while there's a backfill writer or

21:17

there is like you know something that's

21:19

like deleting the record. So what if

21:22

you update some value and then you

21:24

deleted it? right at the same time

21:26

you don't want to delete it since

21:28

the value basically flip makes it like

21:31

so that invalidates the delete condition or

21:33

something so that we introduced a lot

21:35

of intelligence to kind of deal with

21:38

the record level merges at that for

21:40

these kinds of practical kind of scenarios

21:42

that people do right so that's so

21:44

it's a very exciting release we landed

21:47

most of the storage changes in this

21:49

one and we are continuing towards like

21:51

a 1.1 1.1 1.0 where we slowly

21:53

make also rewriting the software layer on

21:56

top. because really is like a lot

21:58

of software stuck not just the table.

22:00

format layer. So we're also rewriting a

22:02

lot of them, like I mentioned, taking

22:05

advantage of newer APIs in, let's say,

22:07

Spark or Link, that we didn't have

22:09

when we originally created the project. Right.

22:11

So that's that's very exciting for me

22:14

personally to report or to rewrite it.

22:16

How do you know? I've always wanted

22:18

to ask somebody this. How do you

22:20

know when you hit the 1.0, like

22:23

the major number versioning? Yeah. So great

22:25

question. At least for us, we didn't

22:27

think of this as like a typical

22:29

enterprise software one dot o type thing.

22:32

I get that question a lot. So

22:34

we've had backwards compatibility and like all

22:36

these different things. The format's been stable

22:38

for a long long time. For us,

22:41

at least we just like, you know,

22:43

we bumped it up just to tell

22:45

everybody that hey this is like a

22:48

major change we've changed the the core

22:50

metadata log that we have the core

22:52

transactional log and so that's how we

22:54

did it I think a lot of

22:57

projects do that from a standpoint of

22:59

it is format is now considered stable

23:01

and deemed stable so you can go

23:03

use it I think for example like

23:06

iceberg project did it that way I

23:08

think it well yeah there's no I

23:10

think one rule I guess around this

23:12

but to your point though I think

23:15

for enterprise software it is commonly the

23:17

latter right that's how they like sometimes

23:19

perceive it as it's not stable until

23:21

you get one auto but that's not

23:24

that's not interesting and then I notice

23:26

in their release notes too there was

23:28

something called the data lake house management

23:30

service or something like that yeah they

23:33

kind of the top of the pyramid

23:35

there what's that yeah yeah great question

23:37

so I think that's the so one

23:39

thing that's been bugging me for a

23:42

while is I feel we are in

23:44

slow motion trying to build a database.

23:46

Like that's what I feel. Yeah. Because

23:48

if you think about it, what's the

23:51

lakehouse architecture? Essentially open formats is fine,

23:53

but data metadata on scalable storage and

23:55

then computer stateless, which is running on.

23:57

carbonities or yarn or whatever someone of

24:00

these like resource management that's basically what

24:02

it is right but there is there's

24:04

no reason that this needs to be

24:07

this cannot be packaged in like a

24:09

like a piece of installable software that

24:11

you know I can do darker compost

24:13

blah or like I can I can

24:16

bring something up right I can't See

24:18

if you if you want to install

24:20

a warehouse you can like probably download

24:22

click or something like unzipped and install

24:25

and there's a server you can send

24:27

queries to interact right there's no such

24:29

thing for a lakehouse right it's essentially

24:31

you take so like a hoodie on

24:34

a spark you like right pipelines and

24:36

you need to register it to a

24:38

catalog then you need to you need

24:40

to do some 10 things to get

24:43

something that that's doing this. coherent set

24:45

of functionality together which is like Inges,

24:47

write, transform your data, query your data,

24:49

index and optimize and then with some

24:52

monitoring. So there is no real database

24:54

experience. So we think essentially what's going

24:56

on right now is if you look

24:58

at it, something like, like snowflake and

25:01

iceberg or something as an example. So

25:03

you have the warehouse, right, the DBMS

25:05

or the warehouse compute stack that does

25:07

all of what I described. It's packaged

25:10

really well. as managed software, then you

25:12

have like, you know, you have an

25:14

open table format and a close table

25:16

format or a storage format, right? So

25:19

that's the level that we are in.

25:21

But this DLMS is arguing for, okay,

25:23

we need like a whole open stack

25:26

that I can take and deploy in

25:28

my Kubernities, it behaves like a database,

25:30

right? It behaves like how you would

25:32

run a Cassandra or a click house

25:35

or some distributed cluster in your thing,

25:37

but it's It's still, you know, doing,

25:39

it's built on top of the lake's

25:41

architecture or the technical foundations. It's a

25:44

packaging problem. Ashley, in my opinion, and

25:46

we probably a few missing components, like

25:48

a high performance metadata layer and a

25:50

cashing layer are still missing from this.

25:53

Yeah, that's the Soviet. if you look

25:55

at our one or another RFC we

25:57

basically when we started like designing indexing

25:59

or all these different things I talked

26:02

about we baselined on the like a

26:04

blue like a general architecture for a

26:06

database and you see this like standard

26:08

components and these are the ones that

26:11

are missing so that's that's kind of

26:13

but but we've still not packaged it

26:15

as a database right we're still handing

26:17

out jars jars or libraries that you

26:20

pull into other compute frameworks that will

26:22

still that's the second that's the second

26:24

layer in that pyramid. I think people

26:26

need a easier way to start the

26:29

lakehouse, open source lakehouse hat on. I

26:31

think that's going to broaden the reach

26:33

of the lakehouse and then kind of

26:36

make it the staple thing that you

26:38

start with and instead of like going

26:40

to a warehouse because it's easy to

26:42

use, then you take on a migration

26:45

project like basically it's a it's a

26:47

ease of use packaging problem that needs

26:49

to be solved in my opinion. So

26:51

it was the intention of this then

26:54

that it'll always be whatever variety of

26:56

open source you want first, pick your

26:58

flavor, but it's under this sort of

27:00

Lake House umbrella? Or what's the, um,

27:03

the vision for that? The vision for

27:05

that is, you need to be able

27:07

to have a one click, open source,

27:09

like host, where you can like download

27:12

and click. And then the experience you

27:14

get is just like, for example, let

27:16

me pick something like a post class

27:18

on my sequel. You install the thing,

27:21

there's a server on or running on

27:23

a port, you're able to point your

27:25

data frame, you can write like a

27:27

Python data frame or something, send logical

27:30

plans, execute on that, come back or

27:32

the SQL, standard ODBC, that you can

27:34

interact with, doing all this, there is

27:36

all these, let's say, post-grice demands or

27:39

these my sequel buffer threads or all

27:41

these things that are under the hood,

27:43

making sure tables are well optimized and

27:45

like, you know, all that's happening for

27:48

you automatically. So today we do a

27:50

whole bunch of this, for example, in

27:52

the hoodie spark or fling writers will

27:55

do some of this for you, but

27:57

it's. not really packaged in this way.

27:59

So a lot of times people are

28:01

like, I don't know how to tune

28:04

this table servers and stuff, but without

28:06

them, you're not going to get good

28:08

performance. So that's why I say it's

28:10

a packaging problem. So we need an

28:13

experience to be something. like that where

28:15

you can interact with it. The underlying

28:17

storage is lake of storage, right? And

28:19

the, the, you know, it's open, the

28:22

same, everything else remains the same. I

28:24

think the current way of interacting will

28:26

remain the same, like remain there for

28:28

a while, right? It's not going to

28:31

go away, even maybe forever, but we

28:33

just want to have an alternate way

28:35

of interacting with the lake house, which

28:37

is far easier. for like somebody who

28:40

understands an RDBMS today cannot easily build

28:42

a lakehouse. It's kind of hard. You

28:44

need to duct tape a lot of

28:46

things together and it can like, you

28:49

know, take you for a ride. We

28:51

just want that to be that like

28:53

easy experience for them. That's basically what

28:55

I mean. Like end of the day,

28:58

what is changed? What changes this storage

29:00

compute separation and then you have columnar

29:02

file formats? We do a lot of

29:04

the same database problems of indexing, writing,

29:07

concurrency control. We do all that differently

29:09

here for good reasons. That's it. Otherwise,

29:11

yeah, it's just the same sequel queries

29:14

or data frame or like code programs

29:16

that you're writing against. It's an interesting

29:18

vision because the way I view the

29:20

Lakehouse right now is it feels if

29:23

you're going to take kind of the

29:25

current state and say this is how

29:27

it's going to be from now till

29:29

kingdom come. Like it's cool, but it

29:32

also feels like we're not. all the

29:34

way there yet, because I still have

29:36

to rely on, I have to go

29:38

find a query engine still, I got

29:41

to go, and typically it's like, geez,

29:43

which walled garden do I want to

29:45

go partner with? Well, I get to

29:47

use my data in an open format.

29:50

And so it's this kind of schizophrenic

29:52

relationship with my data and the provider.

29:54

And so I think that what you

29:56

outlined seems like a more complete vision

29:59

towards a more open way of interacting

30:01

with data. Yeah, it just felt kind

30:03

of, way it's done now it seems

30:05

like it's like I said it feels

30:08

like it's part of the way there

30:10

but doesn't seem like this is it

30:12

feel I feel unsettled for whatever reason

30:14

so it it it yeah so okay

30:17

this is like a great a great

30:19

talk and there's like a lot to

30:21

unpack here so first what I'd admit

30:24

is what I'm saying is actually getting

30:26

probably 70% there but it's not fully

30:28

gonna like address the thing that you

30:30

talked about so let me separate these

30:33

right so one I think the first

30:35

problem that you mentioned was around I

30:37

still need to pick like a query

30:39

engine. So this will solve that problem

30:42

in a sense that there is a

30:44

sequel interface that's gonna like you know

30:46

it's gonna your lake house can be

30:48

installed with like a command you can

30:51

have a health chart running like you

30:53

know do that there's something running on

30:55

a port that you can send sequel

30:57

queries to and it has a sequel

31:00

query engine right when we solve that

31:02

problem but Even these sequel engines, I

31:04

think we need to evolve. So if

31:06

you want to eliminate the choice, or

31:09

I don't want to pick one, I

31:11

just want to stick with this, then

31:13

we need to solve a lot of

31:15

like technical computer engine problems. For example,

31:18

if you take. Even warehouses, snowflake does,

31:20

I think, at least by going by

31:22

whatever I can know from their paper

31:24

and stuff, they do the, you know,

31:27

like push space processing, right? Like, so

31:29

they, they're good for like lower latency

31:31

on interactive queries, that's the kind of

31:33

the warehouse workload that you want. If

31:36

you look at something like spark, sparkle,

31:38

shuffle, shuffle, data to disk between stages,

31:40

they're great for pipelines because pipelines need

31:43

the resilience need the resilient C to

31:45

kind of like retry and whatnot, right?

31:47

So this is where I feel maybe

31:49

we can like you know we try

31:52

I don't know when those gaps will

31:54

converge so for it to fully converge

31:56

to okay this is the one open

31:58

thing that you need you can start

32:01

with this it's gonna be a while

32:03

an industry perspective what I see is

32:05

the wall garden comment that you have

32:07

right so if you if you look

32:10

at it right now we draw all

32:12

these like. Okay, here is open table

32:14

format or iceberg, read, write, read, read,

32:16

write, write, like, but there's actually only

32:19

right and everybody else can read because

32:21

the point is there is a technical

32:23

problem called the catalog. So we've been

32:25

like very distracted with this like table

32:28

formatting method. Here at least all the

32:30

data is simply in parquet. You have

32:32

some metadata around it. It's a solid

32:34

problem. We solved it with the X

32:37

table, data, try to, you know, solids,

32:39

the uniform. But the catalog problem is

32:41

a kind of like a kind of

32:43

like a kind of like a very

32:46

narrowly one. for let's say snowflake when

32:48

big query to safely write to a

32:50

single table they need to agree on

32:52

as like a single catalog right there

32:55

needs to be one guy who's going

32:57

to be designated to do concurrency control

32:59

across these two rights so I don't

33:02

know how we're going to solve that

33:04

problem because it needs vendors to perfectly

33:06

collaborate with each other to be able

33:08

to solve this problem. So until then,

33:11

like I feel decoupling the storage. That's

33:13

kind of simple. Everybody will agree on

33:15

that. But also, the lock-in is not

33:17

just in storage. I like to comment

33:20

so much. The lock-in is not just

33:22

in storage, but all your core compute,

33:24

right? So all your core compute, which,

33:26

which by, by, I mean, you just...

33:29

any wall garden that you're in you

33:31

need to still ingest your data you

33:33

need to transform and do your data

33:35

modeling build your fact dimension tables or

33:38

what not and then optimize this all

33:40

centrally do your GDPR deletions and do

33:42

your compliance management blah blah blah blah.

33:44

My takers like that has to be

33:47

at least our vision at one house

33:49

is like no parade that from the

33:51

individual wall gardens if you will and

33:53

like centralize them so it has a

33:56

lot of benefits kind of do you

33:58

do it once and you can support

34:00

multiple catalogs and you can then pick

34:02

your data scientists needs the the entire

34:05

product stack right the pipe i Python

34:07

notebook and the spark for their cashing

34:09

and their r scripts and blah blah

34:11

blah while your data analyst you know

34:14

just wants to be left alone inside

34:16

snowflake because it's such a easy platform

34:18

to go shift through data and then

34:21

they can create downstream databases from data

34:23

from there and whatnot. This is generally

34:25

where I what C is possible from

34:27

where we are right now for us

34:30

to reach that like kind of that

34:32

you know ideal state we need to

34:34

everybody needs to align on one chat

34:36

log and I don't know how to

34:39

even technically solve that problem because like

34:41

This is the service that every computer

34:43

engine has to call it runtime to

34:45

plan a query. And you know, so

34:48

it doesn't run in the same zone

34:50

and the region with the same, I

34:52

mean, like it's just not going to

34:54

happen engineering wise in my opinion, right?

34:57

So, so I feel practically doing it

34:59

this way where the storage and the

35:01

transformation, the common stuff that you repeat

35:03

across these wall gardens is like decouple,

35:06

then you have the freedom to use

35:08

whichever quite compute engine is great like

35:10

use press took like engine for interactive

35:12

query performance or use spark for pipelines

35:15

or use flink instead of spark you

35:17

know it fosters a more open ecosystem

35:19

where all of us can solve these

35:21

hard problems independently you know new engines

35:24

can emerge. There are startups doing FPG

35:26

acceleration for these, you know, the common

35:28

filter project joints, you can push it

35:31

in the hardware if you want, there

35:33

is projects that do GPU acceleration of

35:35

the same medial workloads. So it just

35:37

sets up for a much. level playing

35:40

field, new innovations can come up, they

35:42

have access to the same data to

35:44

prove out the value. The problem right

35:46

now is much of this data is

35:49

locked into your proprietary cloud data warehouses

35:51

that even if you had, you and

35:53

I had like a great sequel engine

35:55

today, how are we going to get

35:58

access to all that data to be

36:00

able to prove? It's a huge pain

36:02

and a huge problem, right? able to

36:04

even like do an evaluation you need

36:07

to like export that data which comes

36:09

to the cost like you see what

36:11

I'm saying right yeah yeah it makes

36:13

a lot of sense walk me through

36:16

X table that's it's an interesting project

36:18

Yeah, it's a so maybe I'll start

36:20

with like the origins of the project

36:22

right which is always like interesting to

36:25

kind of understand it from so what

36:27

we what happened was in this vision

36:29

like you know it this project started

36:31

from like a one-house lens because we

36:34

so whatever I just told you about

36:36

like decoupling the storage and the core

36:38

transformation work and the optimization is what

36:40

we wanted at one house For example,

36:43

we don't offer a like a data

36:45

science platform like Databrix does or we

36:47

don't like, you know, kind of offer

36:50

like a new query. There's like a

36:52

market, there's plenty of people doing that,

36:54

right? Our job is to make sure

36:56

the data gets into that. So where

36:59

there was friction was, okay, let's say,

37:01

listening to Databrix, it's a great data

37:03

science platform, but if you look at

37:05

it from an then like for an

37:08

end-to-end use case I want to ingest

37:10

a lot of data very quickly and

37:12

then get it in front of my

37:14

data scientists. So we are really good

37:17

at this right it's pretty like industry-wide

37:19

there's like few workloads that like you

37:21

know we as in like one house

37:23

a hoodie does really well as CDC

37:26

ingestion we have all these indexes that

37:28

like other projects don't. We do a

37:30

great job when it comes to near

37:32

real-time ingestion efficient ingestion blah blah blah

37:35

blah blah. On the other side, some

37:37

data science platform, like data breaks, just

37:39

understands Delta Lake, right? So now, how

37:41

do you bridge this? Because as a

37:44

customer, they want both the fast and

37:46

just, as well as an easy, nice

37:48

data science platform. So this is how

37:50

we started thinking about it. And then

37:53

the difference between, like I said, like,

37:55

you know, thankfully, between all these three

37:57

projects, we just stored park files. So

37:59

what we did was, okay, you write

38:02

to one project. We pick who we

38:04

del salaker iceberg you write as one

38:06

and then we can just translate with

38:09

very low overhead, we can translate the

38:11

metadata and at that commit boundary in

38:13

two other projects as well. So what

38:15

this gives you is the same data

38:18

can be, we can quickly ingest this.

38:20

have a table registered as an external

38:22

iceberg table in snowflake same physical copy

38:24

of data registered into unity catalog as

38:27

a data breaks delta table your data

38:29

scientists and your data analysts are happily

38:31

consuming the same data while your data

38:33

engineers are engineering or data platform costs

38:36

are also going down so this is

38:38

how we created this project I think

38:40

we open sources originally is one table

38:42

as a feature in our platform And

38:45

there was a lot of interest to

38:47

open source this as a general thing.

38:49

And later that year, this is 23,

38:51

23. We open source this with like,

38:54

you know, Azure and Google Cloud. And

38:56

it's now powering the one-lake translation between

38:58

a snowflake and Azure data breaks or

39:00

fabric inside Azure. So the same problem,

39:03

right? You write us from Snowflake as

39:05

iceberg tables, and then there's a translation

39:07

of metadata that happens, and you can

39:09

read in fabric as a delta table.

39:12

And so this is now, you know.

39:14

It's actually pretty cool like the amount

39:16

of impact it's had in a short

39:19

time where it's kind of as you

39:21

can see this example broken piece between

39:23

three giants with a small piece of

39:25

conversion code running has been pretty fascinating

39:28

to see and also personally for me

39:30

this was like a thing where you

39:32

know we were starting to when snowflakes

39:34

started the iceberg. you know, support and

39:37

you know, it's become like a little

39:39

bit of a snowflake, the Arabic kind

39:41

of conversation. I at least felt as

39:43

somebody who's woken up to work in

39:46

the space every day, kind of like

39:48

brush my teeth and think about lake

39:50

houses, right? So for me, it was

39:52

kind of like unfortunate. because we are

39:55

starting to like, you know, kind of

39:57

shrink wrap these open table formats like

39:59

how we would deal with close table

40:01

formats, which kind of defeats the purpose

40:04

on the power that these things can

40:06

bring. The power in doing this is

40:08

actually that. open layer that every code

40:10

that the watering hole analogy that I

40:13

told you at the start right it

40:15

kind of takes away that if you

40:17

say no no you have either pick

40:19

iceberg or Delta Lake or you're wrong

40:22

this for me personally this flew in

40:24

the face of that because it basically

40:26

said no you can do both because

40:28

technically there's nothing limiting us from doing

40:31

that and it's actually moved the conversation

40:33

to the real bottleneck here, which is

40:35

the catalog, right? So it's actually had,

40:38

I think, a pretty good industry impact.

40:40

I would say it's not technically that

40:42

complex compared to, let's say, Fourier and

40:44

Delta, like themselves, but the impact it's

40:47

had. in terms of strategically how it's

40:49

moved us towards this interoperability. I'm very

40:51

happy about maybe like how and we

40:53

are like moving towards addressing the catalog

40:56

interoperability problem on X table broadly. So

40:58

you want to like grow this as

41:00

the interoperability kind of peacekeeping force if

41:02

you will to make sure that your

41:05

data remains. you know, same physical copy

41:07

of data remains variable really well across

41:09

multiple formats and catalogs. So there's a

41:11

lot of, you know, design work and

41:14

RFCs and all of that going in

41:16

the project right now towards that. That's

41:18

super cool. You mentioned the iceberg in

41:20

Delta Lake and I wanted to ask

41:23

you for a while, like, what was

41:25

your reaction? Because last year, DataBrickspot, tabular.

41:27

And all of a sudden, that was,

41:29

that was the big big news. What

41:32

were your thoughts when that happened? I

41:34

actually, we don't know how much to

41:36

do with it. So, so, but, but

41:38

my honest thoughts are, I kind of,

41:41

I didn't, I mean, generally, I don't

41:43

think, I think many people would, would,

41:45

would, would have thought this way that

41:47

like, you know, it's not quite quick.

41:50

kind of like do the like you

41:52

know they probably get bought by snowflake

41:54

because that's kind of like how the

41:57

table was set before that but yeah

41:59

when this when this happened I don't

42:01

know like I was surprised like everybody

42:03

else but for us from a open

42:06

source standpoint it doesn't change much I

42:08

am curious to see how they're going

42:10

to integrate or consolidate both there is

42:12

still no really good clarity on that

42:15

right like whether it's going to be

42:17

like there's only one project or no

42:19

I think of delta more like hoodie

42:21

in a sense it has a lot

42:24

more upper level stack so iceberg can

42:26

be a table format that interrupts between

42:28

like even like Dell like you know

42:30

engines us hoodie this like like what

42:33

a database that we talked about anything

42:35

really but that part is not clear

42:37

to me but other than that yeah

42:39

I am curious to see how they

42:42

actually consolidated but I could actually understand

42:44

why they did it in some sense

42:46

if I can't speak for data breaks

42:48

but we were kind of in that

42:51

camp in some sense, although we are

42:53

like very small, as you know, like

42:55

we are an organic grassroots open source

42:57

project. Our company has a ton of

43:00

funding, but it's not remotely close to

43:02

what the, if you look at them

43:04

as competitors to us, then it's not

43:06

even close. But what really happened here

43:09

was, right, if you look at 2022,

43:11

right, when database started all this like

43:13

benchmark wars and all of that, I

43:16

think Snowflake did not have a. Lakeos

43:18

offering at all, right? They were like,

43:20

you know, if you remember all the

43:22

reinvent ads to be like, which 30-year-old

43:25

technology are you using? Yeah, they went

43:27

very strong on the Lakeos angle. Remember

43:29

those? And Daywicks was basically at the

43:31

point saying... Look, this is what Uber

43:34

built, this is what Netflix built, this

43:36

is what we built, this is a

43:38

new way, new kind of technology, right?

43:40

Which is true, which is now validated

43:43

now. But then essentially when Snowflake came

43:45

into the picture, an interesting dynamic happened,

43:47

which just kind of fascinates me today.

43:49

So everybody else in every other vendor,

43:52

cloud provider, small vendor, every... for them

43:54

it's like Christmas right because on one

43:56

hand you are able to by attacking

43:58

Delta or something you're able to kind

44:01

of like counter data breaks who's like

44:03

a streaming is like doing a bunch

44:05

of things but on the other hand

44:07

Snowflake is telling its customer say I'm

44:10

gonna do this iceberg thing and then

44:12

by saying I'm iceberg ready you're like

44:14

if I'm Athena's iceberg ready then yeah

44:16

maybe I can offload my snowflake costs

44:19

from you know like reduce this by

44:21

like the interesting thing that happened and

44:23

then the entire ecosystem basically kind of

44:26

ganged up on data breaks for a

44:28

while there I feel and I think

44:30

we have a lot of crossfire from

44:32

that like you know we're like kind

44:35

of doing our thing around in probability

44:37

and building towards you know this is

44:39

what this is the technicality this is

44:41

the product vision on one house that's

44:44

been pretty constant but this does set

44:46

off like a lot of narratives and

44:48

whatnot and I the one thing I

44:50

realize is like lot of the decisions

44:53

in this space actually based on these

44:55

narratives a lot more than what I

44:57

believed as just an engineer I was

44:59

like oh yeah people they were like

45:02

no a lot of decisions are a

45:04

little bit more top-down so I think

45:06

I could see why they did that

45:08

is my point of it all is

45:11

like it's it's the one marketing thing

45:13

that was used against database of sustained

45:15

for over like 18 months. It looks

45:17

like a mode to just consolidate around

45:20

that. Yeah, it's good. I think it's

45:22

healthy in a way, right, having one,

45:24

like some support, we interrupt with iceberg.

45:26

We are fine. You can land data.

45:29

You can read data. You can read

45:31

Delta Table iceberg tables, whatnot. But of

45:33

course, we have, we stick to our

45:35

guns on what I, because of popularity,

45:38

I can't suddenly change my portion and

45:40

agree that. Yeah, if you use an

45:42

index, no, you will not get to

45:45

and you will if you use we

45:47

have an index if you use an

45:49

index you will get foster performance that's

45:51

a technical fact. So I can't change

45:54

my technical portion because of this. But

45:56

yeah, if everybody supports something, we'll play

45:58

ball, we'll support. We're not here to

46:00

kind of build another silo or something.

46:03

So that's just simply my my push

46:05

in around this. Yeah, I would say,

46:07

you know, as the series B startup,

46:09

like, I don't think if any other

46:12

series B startup has had the open

46:14

source project as had the kind of

46:16

marketing headwinds that, you know, you know,

46:18

we've been kind of, kind of marketing

46:21

headwins that, kind of, happy of how

46:23

we continue to the community, if you

46:25

look at get up stats, if you

46:27

look at open source contributor metrics, we're

46:30

still like, you know, we're like even

46:32

pretty much in spite of all this

46:34

marketing push, which kind of I'm very

46:36

happy about how the technology is holding,

46:39

you know, on its own, and that's

46:41

all I can do, right? I cannot

46:43

control anything else. That's interesting. This brings

46:45

to mind a question. I mean, it

46:48

doesn't seem like much has changed. in

46:50

terms of your approach with hoodie or

46:52

X table or one house, given everything

46:54

to happen. I think for some founders,

46:57

you'd be like, crap, okay, like I

46:59

need to pivot pretty hardcore, but it

47:01

doesn't seem like that's the case. Like

47:04

what's your true north? What do you,

47:06

what do you, what do you consider?

47:08

Just like this is direction we have

47:10

to go, no matter what? Yeah, so

47:13

that's what I think. Some of this

47:15

actually comes from thinking about one of

47:17

this actually comes from thinking about one

47:19

house as a hoodie company, common like

47:22

angle because we are actually tuned into

47:24

thinking like that because of so many

47:26

open-source startups like be like doing that

47:28

we cannot like we never wanted to

47:31

or I never want to start a

47:33

hoodie company right this there's nothing like

47:35

you know we like sure we offer

47:37

a ton of things for hoodie users

47:40

yes but are not starters to build

47:42

this Open a data platform this what

47:44

we currently I don't know if I'm

47:46

good with the words yet But essentially

47:49

what we call universal data like us.

47:51

We believe that the biggest difference between

47:53

us and data breaks for example is

47:55

saying look yes lake house is the

47:58

way to go like you know delta

48:00

versus who do technical differences aside we

48:02

are lying there but spark is not

48:04

the best engine to run everything We

48:07

think like virus is good for some

48:09

use cases. So that's that that's the

48:11

true not for us and this decoupling

48:14

of the core compute that you need

48:16

across these wall gardens is basically what

48:18

we are after. So that like I

48:20

said, these are actually good. If for

48:23

example, Delta and Iceberg converge in one

48:25

way and make one, we just have

48:27

one thing to support. That's it. So

48:29

we will support new formats that come

48:32

up. So as a company. We're not

48:34

pivoting, we're not doing anything because this

48:36

vision is pretty set and it's strong.

48:38

This is why we started the company.

48:41

And so that's if you think about

48:43

it, we built one table and all

48:45

of this before any of this happened,

48:47

right? It's not a reaction to this

48:50

acquisition or something. That is just like

48:52

true to our principles on, hey, the

48:54

market has plenty of engines and we

48:56

are confusing everybody on, hey, I'm the,

48:59

everybody claims, I'm the best. Right. You

49:01

cannot be the best, right? Because there

49:03

are these like fundamental tradeoffs that you're

49:05

making in the core of your thing.

49:08

And with these complex systems, you make

49:10

five of them, then you cannot go

49:12

back to doing some other workload. That's

49:14

how it is. That's what the computer

49:17

science is. So we just want to

49:19

build a platform that can ensure that

49:21

your data is open and you are

49:23

not suddenly beholden by a world garden

49:26

to keep paying them because of what

49:28

for this core workloads. And why are

49:30

we not that Walgarten? Why will we

49:33

not be the bad guy? So we've

49:35

kind of addressed this even in our

49:37

like launch blog, which is for anything

49:39

that we run on our platform, hoodie

49:42

has open services that can do the

49:44

same thing. So you can duct tape

49:46

all of this together if you want,

49:48

right? We want to give you the

49:51

total freedom to build that unbundled data

49:53

platform if you want, which is great

49:55

if you have the team to go

49:57

build it. If not we can get

50:00

you started on the right fundamentals we

50:02

will keep your data. right in optimize

50:04

the same exact way and you can

50:06

even like think about it like can

50:09

you even can people even easily do

50:11

apples to apples test between photon and

50:13

data you know snowflakes equal today because

50:15

you will load it one way here

50:18

you load it one way there you

50:20

will not run clustering here you will

50:22

run clustering there so we can ensure

50:24

that this level playing field data is

50:27

optimistic same exact way and you can

50:29

now bring multiple engines you can figure

50:31

out what product stack features and what

50:33

cost performance fits your need right so

50:36

that's the not star for us so

50:38

that's why we are not panicking or

50:40

we're not doing anything around this right

50:42

for us for hoodie same thing right

50:45

see hoodie we don't own hoodie it's

50:47

like we have six out of 16

50:49

PMC members in the project so if

50:52

you know like And as a Apache

50:54

project, we should be focusing on how

50:56

to make the overall system benefit in

50:58

terms of, you know, like if there

51:01

are ways for us to make, you

51:03

know, hoodie, extiable iceberg delta all work

51:05

well together for the benefit of the

51:07

community, we will not stand in the

51:10

way, right? We will do it if

51:12

things come up. Yeah, we cleanly separate

51:14

how we feel technically about the choices

51:16

and solving like the core technical problems

51:19

in the lake house from how the

51:21

ecosystem should be. So we don't want

51:23

to build another actual silo, right? Yeah.

51:25

That's how we generally approach it. So

51:28

for us, the challenge really is, can

51:30

we, is this too idealistic of a

51:32

vision? Right? Or not? And how are

51:34

we going to get this message across

51:37

with, you know, like vendors and competition

51:39

which are thousand, ten thousand times bigger

51:41

than us? That is the true challenge

51:43

for us, I would say, and that's

51:46

where we need to do a lot

51:48

of work in sort of getting, getting

51:50

people to like truly understand this. Yeah,

51:52

that's awesome. That's a master class that

51:55

can just stick with your guns, you

51:57

know. Like I said, earlier I think

51:59

a temptation for a lot of people

52:02

would be to freak out and take

52:04

a run around the room and, you

52:06

know, pivot galore. Like that scene in

52:08

a Silicon Valley, the TV show where

52:11

they're just pivoting. So. Yeah, yeah. This

52:13

is like some of these things that

52:15

I hang on to is like the

52:17

middle out compression, right? I mean, that's

52:20

fact. The problem for you is like,

52:22

you can't suddenly pivot away from that.

52:24

right like let's say for example if

52:26

you're like so we land what we

52:29

call the fastest iceberg tables or delta

52:31

tables or hoodie tables because we write

52:33

a sootie and you can read as

52:35

delta iceberg or whatever today right so

52:38

it combines the powers of both that's

52:40

the right way we approach it I

52:42

cannot argue against myself on whether you

52:44

know like hoodie is the only the

52:47

one that has record level indexes so

52:49

if you have an index would the

52:51

rights be faster yes how do you

52:53

how do you against myself and do

52:56

something else I would actually be doing,

52:58

you know, like the customers at this

53:00

service by slowing down their jobs. Actually,

53:02

it may make us more money because

53:05

the job will for longer time and

53:07

we price for use and whatever. But

53:09

that's not the right way to approach

53:11

this, right? And there's a good chunk

53:14

of the community that understands these benefits.

53:16

Our challenges, how do we... kind of

53:18

educate more people when you have such

53:21

top-down kind of like you know marketing

53:23

or like messaging like lots of buzz

53:25

if you will You know from the

53:27

big fight basically the three clouds and

53:30

data bricks and snowflake doing saying one

53:32

thing or the other. Yeah, that's the

53:34

that's the challenge and I think from

53:36

about I said as a founder Our

53:39

fortunes are only improving because I didn't

53:41

have money in 2021. I have more

53:43

money than I want. I had three

53:45

years ago. We have more product. We

53:48

built more stuff. We moved more towards

53:50

the open lakehouse than where we were.

53:52

Right. So I think it's it's progressing

53:54

pretty well. If I if I evaluate

53:57

ourselves on that arc. That's super cool.

54:00

Oh, but enough. Great chatting with you. Hi. Yeah, for

54:02

sure. Yeah, it's, sure. up some time, but in up

54:04

some time yeah, in the really got up, we had

54:06

a chance to chat and congrats on the success.

54:08

So So it's awesome. Right. you very much much.

Rate

Get this podcast via API

From The Podcast

The Joe Reis Show

The official podcast of tech/data nerd and "recovering data scientist" Joe Reis. He provides refreshingly candid thoughts on the world of technology and data. Each week, he broadcasts from somewhere in the world, sometimes ranting solo or with the smartest people in the business.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More