Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
How's it going? Good. All right,
0:03
let's do this. Let's do this.
0:05
Yeah, so it's good to see
0:07
you again. How have things been?
0:09
It's been good. If I, more
0:11
correctly, we met once in that,
0:14
like. the data engineering conference in
0:16
India, right? I think. Yeah, yeah.
0:18
So, yeah. Yeah, it's been good.
0:20
Lots of things happening last year.
0:23
Yeah. Continuing to build, like, you
0:25
know, one house and lake house
0:27
and whatnot. So, yeah, love to
0:30
dig in and just, just, kind
0:32
of put it on. Yeah, super cool.
0:34
Yeah, even up to a lot. I
0:36
guess I kicked things off first. For
0:38
people who don't know who you are,
0:41
do you want to give a
0:43
quick intro? Folks, my name is
0:45
Vinot and I currently founded a
0:47
starter called One House which essentially
0:49
provides an open data lake house
0:51
as a foundation for data infrastructure
0:53
and we call it the Universal
0:56
Data Lake House and probably should
0:58
be clear why as we as
1:00
the show goes on. My background
1:02
before this I've worked in data
1:04
infrastructure for quite a bit and
1:06
right before this I was a
1:08
confluent. I was a principal engineer
1:10
working cross Kafka, streaming, connect
1:13
and bunch of different areas
1:15
there. What brings me to the Lake
1:17
House space is my work actually before
1:19
that a duper. Same thing. We built
1:22
the walls first data lake house,
1:24
which should be like a probably
1:26
like a fun trivia at some
1:28
point as the space builds up.
1:30
We actually, it wasn't called the
1:32
lake house. We called it a
1:34
transactional data lake. Kind of like
1:36
what we called it, but we
1:38
built the first production data lake
1:40
house operated at scale. Open data
1:42
formats across multiple engines each for
1:44
its one use case and through
1:46
this Apache hoodie project and continue
1:48
to kind of like lead the
1:50
project in the Apache software foundation
1:53
as PMC chair as well as now
1:55
also involved with another project in
1:57
the foundation called Apache X table
1:59
in. which is now kind of
2:01
you know bringing interoperability and brokering piece
2:03
across the open data ecosystem and yeah
2:06
before that I let Key Valley storage
2:08
at LinkedIn there was a fun experience
2:10
for me back in the day build
2:13
this Key Valley store called Walmart which
2:15
was like a dynamo source deck lead
2:17
for that and it was a like
2:20
I said a good experience for me
2:22
like scaling and like a system like
2:24
that for a very popular website like
2:26
LinkedIn. So I make a real article
2:29
doing database replication and like CQL streams,
2:31
Golden Gate replication software, not things like
2:33
that. So that's kind of like my
2:36
background. Okay, so just dabbling in databases.
2:38
Just kidding. Yeah, I generally describe myself
2:40
as a one-trick pony. I actually don't
2:43
know anything beyond this. So actually how
2:45
I look at it. So even one
2:47
house I started because I was like,
2:50
okay, we built, I've been fortunate to
2:52
be part of these, you know, like
2:54
LinkedIn or Uber, kind of, you know,
2:57
very large, behemoth data, kind of like
2:59
companies, And when I was thinking about
3:01
what to do next, I actually don't
3:03
know anything else. And it felt like
3:06
I was going to go somewhere else
3:08
and build the same data platform again.
3:10
And then I saw people building the
3:13
same data platform over and over again,
3:15
you know, kind of that taping open
3:17
source projects. So we said, hey, the
3:20
industry kind of like does not have
3:22
a data platform that is built prepackaged
3:24
with these like, like a bundled. platform
3:27
if you will, like build from purely
3:29
open source foundations that can now provide
3:31
the same ease of use as you
3:33
would expect from cloud managed services, but
3:36
build completely on open source foundations. So
3:38
that's kind of like actually how I
3:40
even got going. So you could say
3:43
that my background actually led me to
3:45
this. Yeah, right. So yeah, that's awesome.
3:47
And then I guess a interest to
3:50
me is you work at Uber with
3:52
what you called it. transactional data, like
3:54
I believe is what you said. What
3:57
were the insights into that? Because I
3:59
think around that, around what year was
4:01
that? This was 2016 that we built
4:03
it. We had things, we opened up
4:06
the project, Jan, 2018, after we got
4:08
all of our, you know, like our
4:10
critical, most of our data on the
4:13
thing. Yeah. So, yeah. Because up before
4:15
that, Data Lakes had sort of been
4:17
popular for a bit, then they kind
4:20
of, I don't know, I don't know,
4:22
they disappeared, but they weren't as popular.
4:24
Then I think they started seeing, I
4:27
started reading, I don't know, some of
4:29
the Uber blogs around the end of
4:31
the last decade was sort of hinting
4:33
like this Data Lake was coming back,
4:36
but it was sort of resembling more
4:38
of a database, and so it's kind
4:40
of interesting. Like, what were some of
4:43
the insights that led you to want
4:45
to build something like that? Yeah, to
4:47
be honest, we were just trying to
4:50
solve a business problem. We were trying
4:52
to build like a new industry category.
4:54
I was trying to start like come
4:57
in, I think like that. So the
4:59
problem that we had was pretty common.
5:01
So you had a database and you
5:04
remember like when we started this at
5:06
Uber, like I mentioned, I was just
5:08
working on No sequel. key value stores
5:10
and everything at LinkedIn before that. That
5:13
was primarily what I did. So the
5:15
operational data was kind of like scaling
5:17
out, right? So we were moving away
5:20
from relational databases purely to plenty of
5:22
companies needed to have these operational data,
5:24
like databases, which are like scale out
5:27
databases now. We had streaming data, right?
5:29
Another thing that I was kind of,
5:31
you know, privy to, like, you know,
5:34
kind of front row seats to was
5:36
the rise of Kafka at LinkedIn, like,
5:38
you know, to do, be, tune the
5:40
Kafka clusters, GM and all that back
5:43
in the day. So there's a lot
5:45
of data, right? So the problem that
5:47
we ran into it, Uber, was it
5:50
was a highly real-time business. And, I
5:52
mean, literally, you know, whether or many
5:54
factors could just change how the dynamics
5:57
of the dynamics of the business. business
5:59
operates in real time. And we had
6:01
the scale out database, which was storing
6:04
all our trips, transactions, you know, all
6:06
of the data core business data. And
6:08
we wanted to just ingest it downstream
6:10
to a warehouse. We had an on-prem
6:13
warehouse at that point. And the warehouse
6:15
was another specialized database, right? So you
6:17
just do database to database replication. So
6:20
it was fine until it couldn't like
6:22
fit either the scale. or it couldn't,
6:24
it was too closed, so we couldn't
6:27
fit like new use cases on top
6:29
of like our warehouse without running many
6:31
instances of the warehouse or making parallel
6:34
copies of the data. So we needed
6:36
a scale out compute processing architecture, which
6:38
is what the map produced data lake
6:40
of the worlds already did at that
6:43
point, right? Spark was just coming up
6:45
at that point. And then you had
6:47
HDFS or S3 are like, you know,
6:50
you had scale out compute, you had
6:52
scale out storage, right? But what you
6:54
needed now was all these databasey primitives
6:57
that the warehouse had, you did not
6:59
have on the lake house. Sorry, on
7:01
the lake, right? So that's kind of,
7:04
so we borrowed all these transactional primitives.
7:06
So our on-prem warehouse had updates. It
7:08
had a notion of an index. They
7:11
called it projections. They called it projections.
7:13
And they had all these like, you
7:15
know, different ways of coordinating between, you
7:17
know, handling writer and reader concurrency scale
7:20
and actually had a bunch of services
7:22
that can continuously optimize the data. So
7:24
they had like a database runtime, right,
7:27
that was actually managing the data presenting
7:29
clean tables for you to consume. So
7:31
that's why we said, okay, we have
7:34
a data lake. We're going to retain
7:36
the core. compute storage scaling aspects of
7:38
it and just borrow the transactional aspects
7:41
and we call it a transaction data
7:43
like that's kind of what we did
7:45
that and it helped us ingest like
7:47
all this low high scale data like
7:50
either high scale database no sequel change
7:52
logs basically RDBM has changed logs I
7:54
scale streaming data all into a central
7:57
kind of like a layer of data
7:59
then you could build a kind of
8:01
you know send it downstream to various
8:04
data marks if you will I don't
8:06
think anybody uses these terms anymore but
8:08
like you could build like real-time data
8:11
marks for example you could build one
8:13
for data science you could like you
8:15
know you we could even feed do
8:17
do offload the e-tails from our warehouse
8:20
which was pretty expensive offloaded it to
8:22
the the lake. And then, you know,
8:24
just copy the final serving of the
8:27
rolled up tables, we could move our
8:29
data modeling to the lake and do
8:31
it in a more higher scale fashion.
8:34
This we built as like the central,
8:36
like I would say, like maybe like
8:38
a watering hole for every all of
8:41
the company can come clean data. And
8:43
you from there, it goes to many
8:45
different use cases, right? That's kind of
8:47
the overall architecture. Right. And now that
8:50
a paradigm is kind of taken over,
8:52
I would say, I guess it's now
8:54
called the Lake House. Yeah. So walk
8:57
me through the evolution from sort of
8:59
what you just described to maybe Hootie
9:01
and then to today. Yeah, yeah, that's
9:04
actually been interesting. So honestly, when we
9:06
were building it, we were like, hey,
9:08
we have at the core of it,
9:11
we had to solve three problems, right.
9:13
We had to provide a way for
9:15
you to mutate data. Data links were
9:18
a bunch of files. that you throw
9:20
into like a distributed file system or
9:22
cloud storage bucket and then you somehow
9:24
deal with it. That's how it was.
9:27
So we need to bring some transactional
9:29
boundaries that's first and then while doing
9:31
that you needed to provide it ability
9:34
to fast updates right and then the
9:36
one thing that we did very early
9:38
on from the first version was give
9:41
a change log on the other side.
9:43
So we wanted it to be like
9:45
a database table just like how we
9:48
could CDC from a upstream database you
9:50
could should be able to CDC from
9:52
this lake. like you know, lake houses
9:54
or a table as well, and then
9:57
build on string, right? It was actually
9:59
super controversial when we actually built it
10:01
for a year so we open source
10:04
it early 2017 because we thought this
10:06
is like seems like a general enough
10:08
problem and and we had conviction internally
10:11
in our team it over that oh
10:13
yeah this is where it will go
10:15
to For a year or so, this
10:18
was like a, okay, this is like
10:20
a nerdy thing this Uber engineers built.
10:22
It wasn't like much more than that.
10:24
But slowly after a year, people started
10:27
like running into similar issues, right? You
10:29
see a lot of companies who had
10:31
a lot of transactional data suddenly realized,
10:34
I fulfilled order today, my package gets
10:36
returned or like the transactions, the payments
10:38
won't complete, you retried the card again,
10:41
try an alternate payment, like data is
10:43
mutating. So slowly people started realizing, okay.
10:45
data is mutating so this is an
10:48
efficient way to handle that so community
10:50
started forming right the main thing for
10:52
that pushed this I think into the
10:54
as a use case to me was
10:57
the GDPR thing because suddenly that made
10:59
everybody be like okay you can't dump
11:01
a bunch of it kind of like
11:04
generalized this problem from this is like
11:06
a CDC update problem to a okay
11:08
no no this is a overall data
11:11
management problem You need whether you have
11:13
CDC or append only data models, you
11:15
need to be able to delete stuff.
11:18
So if you need to delete stuff
11:20
with large amounts of data, you need
11:22
all these things. You need an index,
11:25
you need like, you know, all the
11:27
management and you need to mutate, you
11:29
need to be able to produce like
11:31
a new thing, you need to understand
11:34
a change log to propagate it downstream.
11:36
So all of these like, and that's
11:38
suddenly generalized. 2019, so 2017, 2018 is
11:41
what I describe it not. So, 2019
11:43
is when, you know, AWS, in Dell,
11:45
the data breaks, open sourced, Delta Lake,
11:48
and, you know, called the Lake was
11:50
paper and all that happened in 2019.
11:52
And then Amazon basically, you know, kind
11:55
of, we did an integration with hoodie
11:57
and then like, you know, hoodie was.
11:59
bungalen to EMR, all the, all it
12:01
still is pre-installed on all these AEWs
12:04
services, right? And then I think a
12:06
whole bunch of companies started building these
12:08
transactional data lakes. So this terminology you
12:11
can actually find in a lot of
12:13
like AWS blogs and whatnot. So for
12:15
a while it was like transactional, like
12:18
you know how this space goes, it's
12:20
like a lot of marketing terms. It's
12:22
like to talk about the same thing
12:25
we invent some 10 terms. It's like
12:27
that. But then, you know, I think
12:29
even when we started the company, so
12:31
I couldn't start the company for a
12:34
couple of years, I didn't have a
12:36
green card or whatever, so we could
12:38
have probably started before. But anywho, like
12:41
we got started 2021. And originally our
12:43
company name was actually called Infinity Lake
12:45
and then we went out and talked
12:48
to people and they were like, what's
12:50
a data lake? Everybody thought they had
12:52
a data lake already. And we were
12:55
like trying to say, no, no, no,
12:57
that's not a real data lake. Like
12:59
you don't have like good schemas, you
13:01
don't have like, you know, you really
13:04
can't consume this data. It's like the
13:06
classic swamp versus lake type thing. Then
13:08
actually we realized that what data bricks
13:11
had actually done was really good. They
13:13
had given it a new name. And
13:15
you know as an engineer I had
13:18
the first appreciation for like you know
13:20
marketing at that point which was like
13:22
yeah this needs a new name and
13:25
then that's kind of like where we
13:27
when we announced one house we was
13:29
a second vendor in the market after
13:32
data breaks to be like okay we
13:34
are a like a provider right now
13:36
like everybody like this like a big
13:38
crew now right everybody's jumped on to
13:41
that bandwagon from there and there is
13:43
also the table format conversation. which famously
13:45
kind of kick started by I think
13:48
most of that momentum around that has
13:50
been from snowflake right so that's kind
13:52
of like cost a whole bunch of
13:55
confusion I think right now is what's
13:57
a table format what's a lake house
13:59
what's like do I just do open
14:02
open table formats then do I do
14:04
you know that it's been pretty interesting
14:06
for me Obviously there is like, you
14:08
know, the open source side is hoodie,
14:11
iceberg, delta, you know, there's like, you
14:13
know, we are continuing to. kind of
14:15
you know we compete healthy we we
14:18
are innovating so we just had our
14:20
release where we have indexes and whatnot
14:22
and I think there like we can
14:25
go deeper into that but that's one
14:27
side of the open source innovation type
14:29
of the coin right but on the
14:32
other side I think for example something
14:34
like iceberg support on snowflake is great
14:36
because it finally opens up the snowflake
14:38
compute engine to data that is not
14:41
in snowflakes proprietary format same for big
14:43
So overall I would say it's moved
14:45
to a healthy place now where, oh
14:48
yeah, open. data or what used to
14:50
be called external tables back in the
14:52
the warehouse thing. They're a mainstream thing
14:55
now and it for example, you know,
14:57
big query and snowflake and redshift all
14:59
are improving to making sure that performance
15:02
is good on the external tables versus
15:04
thing. It took like a few couple
15:06
of years. I don't know why for
15:08
snowflake to get there, but big query
15:11
did that in six months from the
15:13
launch. So it's definitely in a healthier
15:15
place now. with these open tables format
15:18
support in all of these warehouses and
15:20
they're generally supported across any open source
15:22
query engine, right, or spark platforms. We
15:25
are in a really good position as
15:27
an industry, I think, to start now
15:29
thinking about what should this new world
15:32
data architecture be because now the customers
15:34
are saying. I don't want to store
15:36
the data five times. It's just like
15:39
not even if you want to go
15:41
as far it's I don't think it's
15:43
even good for the planet in some
15:45
sense like how many copies do you
15:48
want to store how many servers do
15:50
we want to run it all makes
15:52
no sense. We should be using the
15:55
volume of data you know so high
15:57
we should be using the right engine
15:59
for the right workload even if it's
16:02
like 30% better. means that it's like
16:04
it's gonna save you money it's gonna
16:06
like lower you know your green tack
16:09
lower like you know compute footprint that's
16:11
generally better so we need to figure
16:13
out how to from here move to
16:15
that architecture where data remains like you
16:18
know instead of building let me put
16:20
this way instead of trying to sell
16:22
warehouse or in compute engine software top
16:25
down we should think about data going
16:27
up and that is the portion that
16:29
we have because the engines come and
16:32
go but there'll be new engines all
16:34
the time but your data is the
16:36
constant across them the bytes that you
16:39
write today should be readable five years
16:41
from now four years from now right
16:43
so how do we set ourselves up
16:45
for that I think that that to
16:48
me is where I think the future
16:50
challenges like That's super interesting. I like
16:52
that. I like that vision a lot.
16:55
Um, we'll give you a second, but
16:57
uh, by the way, congrats on the
16:59
1.0. Uh, really? That's awesome. Um, what
17:02
happened there? Oh yeah, that's a, it's
17:04
a good one and this touches upon
17:06
the whole, uh, databasey aspects of it
17:09
right so so one dot o for
17:11
us is is more about so we've
17:13
been like you know the projects been
17:15
around for from 2017 what like a
17:18
good six seven years now right and
17:20
and since we had a whole bunch
17:22
of first more disadvantages if you will
17:25
like for example bunch of these compute
17:27
frameworks weren't even ready to kind of
17:29
take the kind of these things back
17:32
in the day when we did all
17:34
the things. So for us, Bondado is
17:36
like a reimagination of the project. It
17:39
broadly, right from storage to start off
17:41
concurrency control, right? And we've tried to
17:43
apply the lessons that we learned from
17:45
the past few years of working with
17:48
so many companies in open source to
17:50
run these like mega data lakes. So
17:52
just to kind of summarize. One, we
17:55
realize a lot of the processing is
17:57
columnar, but the storage is not. in
17:59
a sense, like in a true sense,
18:02
because people have wide tables, in which
18:04
only few columns are changing. So we
18:06
made the storage actually very efficient. It
18:09
can encode partial updates now. So it
18:11
cuts down significantly the amount of data
18:13
that you store, rewrite, store, rewrite. So
18:16
that's like an example of a fundamental
18:18
storage change. The second thing that
18:20
we realized was, so we call it
18:22
the lake house, it gets compared to
18:24
a database and all that. But this
18:27
is not. an RDBMS. This is not
18:29
an online store or playcoses are not
18:31
even not used to build apps right
18:34
or consumer facing or even internal apps.
18:36
End of the day we run jobs
18:38
essentially like a There are only long running
18:40
transactions, right? When I was at Oracle,
18:43
anything, like for example, if a transaction
18:45
took more than a minute, we'll classify
18:47
that as a long running transaction or
18:49
something. Everything is long running transaction here.
18:52
And then some of the concurrency control
18:54
that we had before, optimistic concurrency control,
18:56
kind of says, okay, I want to assume
18:58
there's no concurrency. If there is, then I
19:01
want to retry. that retry again weighs compute
19:03
cycles because his job runs for hours and
19:05
minutes in a lot of cases. That's just
19:07
a lot of things to retry. So we
19:09
come up with what we call non-blocking concurrency
19:12
control which kind of incorporates some
19:14
techniques from stream processing which is
19:16
kind of matured. So there's a
19:18
different kind of concurrency control that
19:20
lets multiple riders continuously write without
19:22
sort of like blocking on each
19:25
other. I think this is more
19:27
suited for the kind of workloads
19:29
that Lake House runs. That's too.
19:31
Number three indexes. So we've introduced
19:33
like secondary indexes like database like
19:35
secondary indexes to the to the
19:37
lake houses. Again, it still has
19:39
the same architecture, the indexes and
19:42
data metadata system is three or
19:44
like a cloud storage and totally
19:46
compute and storage are like decoupled.
19:48
But now you're able to do
19:50
massive speedups of like, you know,
19:52
point, look up point. It still
19:54
won't be the same performance as
19:56
like your postgrass, right. But that's
19:58
not the goal. But it's it's
20:00
still like, you know, narrows down, it
20:03
unlocks new use cases like you can
20:05
do a needle in a you can
20:07
search for a few transactions in a
20:09
large table. You can do joins like
20:12
more effectively stuff like that. So we
20:14
introduced a bunch of indexes. And the
20:16
fourth thing is just a lot more
20:19
we've absorbed. One thing people use hoodie
20:21
a lot for is to deal with
20:23
later arriving data. So and kind of
20:25
like process that. And for a example,
20:28
let's say, you know, the downstream data,
20:30
like, you get events out of order.
20:32
For example, you get a order created
20:34
event after the order has been processed.
20:37
So if you just process it in
20:39
the array, like the order in which
20:41
the data arrives, you're going to say,
20:43
you're going to lose the fact that
20:46
the order was processed. You're just going
20:48
to say this order is in created
20:50
state, right? So, so hoodie can. actually
20:52
we pushed that intelligence into storage where
20:55
we can actually understand and resolve records
20:57
by a business field that you can
20:59
say like this is called event time
21:01
processing in in the streaming world. So
21:04
we've incorporated some of these that people
21:06
routinely need to write these kind of
21:08
processing pipelines because people constantly do things
21:10
like or this a main pipe like
21:13
a jaw writer that is pumping new
21:15
results while there's a backfill writer or
21:17
there is like you know something that's
21:19
like deleting the record. So what if
21:22
you update some value and then you
21:24
deleted it? right at the same time
21:26
you don't want to delete it since
21:28
the value basically flip makes it like
21:31
so that invalidates the delete condition or
21:33
something so that we introduced a lot
21:35
of intelligence to kind of deal with
21:38
the record level merges at that for
21:40
these kinds of practical kind of scenarios
21:42
that people do right so that's so
21:44
it's a very exciting release we landed
21:47
most of the storage changes in this
21:49
one and we are continuing towards like
21:51
a 1.1 1.1 1.0 where we slowly
21:53
make also rewriting the software layer on
21:56
top. because really is like a lot
21:58
of software stuck not just the table.
22:00
format layer. So we're also rewriting a
22:02
lot of them, like I mentioned, taking
22:05
advantage of newer APIs in, let's say,
22:07
Spark or Link, that we didn't have
22:09
when we originally created the project. Right.
22:11
So that's that's very exciting for me
22:14
personally to report or to rewrite it.
22:16
How do you know? I've always wanted
22:18
to ask somebody this. How do you
22:20
know when you hit the 1.0, like
22:23
the major number versioning? Yeah. So great
22:25
question. At least for us, we didn't
22:27
think of this as like a typical
22:29
enterprise software one dot o type thing.
22:32
I get that question a lot. So
22:34
we've had backwards compatibility and like all
22:36
these different things. The format's been stable
22:38
for a long long time. For us,
22:41
at least we just like, you know,
22:43
we bumped it up just to tell
22:45
everybody that hey this is like a
22:48
major change we've changed the the core
22:50
metadata log that we have the core
22:52
transactional log and so that's how we
22:54
did it I think a lot of
22:57
projects do that from a standpoint of
22:59
it is format is now considered stable
23:01
and deemed stable so you can go
23:03
use it I think for example like
23:06
iceberg project did it that way I
23:08
think it well yeah there's no I
23:10
think one rule I guess around this
23:12
but to your point though I think
23:15
for enterprise software it is commonly the
23:17
latter right that's how they like sometimes
23:19
perceive it as it's not stable until
23:21
you get one auto but that's not
23:24
that's not interesting and then I notice
23:26
in their release notes too there was
23:28
something called the data lake house management
23:30
service or something like that yeah they
23:33
kind of the top of the pyramid
23:35
there what's that yeah yeah great question
23:37
so I think that's the so one
23:39
thing that's been bugging me for a
23:42
while is I feel we are in
23:44
slow motion trying to build a database.
23:46
Like that's what I feel. Yeah. Because
23:48
if you think about it, what's the
23:51
lakehouse architecture? Essentially open formats is fine,
23:53
but data metadata on scalable storage and
23:55
then computer stateless, which is running on.
23:57
carbonities or yarn or whatever someone of
24:00
these like resource management that's basically what
24:02
it is right but there is there's
24:04
no reason that this needs to be
24:07
this cannot be packaged in like a
24:09
like a piece of installable software that
24:11
you know I can do darker compost
24:13
blah or like I can I can
24:16
bring something up right I can't See
24:18
if you if you want to install
24:20
a warehouse you can like probably download
24:22
click or something like unzipped and install
24:25
and there's a server you can send
24:27
queries to interact right there's no such
24:29
thing for a lakehouse right it's essentially
24:31
you take so like a hoodie on
24:34
a spark you like right pipelines and
24:36
you need to register it to a
24:38
catalog then you need to you need
24:40
to do some 10 things to get
24:43
something that that's doing this. coherent set
24:45
of functionality together which is like Inges,
24:47
write, transform your data, query your data,
24:49
index and optimize and then with some
24:52
monitoring. So there is no real database
24:54
experience. So we think essentially what's going
24:56
on right now is if you look
24:58
at it, something like, like snowflake and
25:01
iceberg or something as an example. So
25:03
you have the warehouse, right, the DBMS
25:05
or the warehouse compute stack that does
25:07
all of what I described. It's packaged
25:10
really well. as managed software, then you
25:12
have like, you know, you have an
25:14
open table format and a close table
25:16
format or a storage format, right? So
25:19
that's the level that we are in.
25:21
But this DLMS is arguing for, okay,
25:23
we need like a whole open stack
25:26
that I can take and deploy in
25:28
my Kubernities, it behaves like a database,
25:30
right? It behaves like how you would
25:32
run a Cassandra or a click house
25:35
or some distributed cluster in your thing,
25:37
but it's It's still, you know, doing,
25:39
it's built on top of the lake's
25:41
architecture or the technical foundations. It's a
25:44
packaging problem. Ashley, in my opinion, and
25:46
we probably a few missing components, like
25:48
a high performance metadata layer and a
25:50
cashing layer are still missing from this.
25:53
Yeah, that's the Soviet. if you look
25:55
at our one or another RFC we
25:57
basically when we started like designing indexing
25:59
or all these different things I talked
26:02
about we baselined on the like a
26:04
blue like a general architecture for a
26:06
database and you see this like standard
26:08
components and these are the ones that
26:11
are missing so that's that's kind of
26:13
but but we've still not packaged it
26:15
as a database right we're still handing
26:17
out jars jars or libraries that you
26:20
pull into other compute frameworks that will
26:22
still that's the second that's the second
26:24
layer in that pyramid. I think people
26:26
need a easier way to start the
26:29
lakehouse, open source lakehouse hat on. I
26:31
think that's going to broaden the reach
26:33
of the lakehouse and then kind of
26:36
make it the staple thing that you
26:38
start with and instead of like going
26:40
to a warehouse because it's easy to
26:42
use, then you take on a migration
26:45
project like basically it's a it's a
26:47
ease of use packaging problem that needs
26:49
to be solved in my opinion. So
26:51
it was the intention of this then
26:54
that it'll always be whatever variety of
26:56
open source you want first, pick your
26:58
flavor, but it's under this sort of
27:00
Lake House umbrella? Or what's the, um,
27:03
the vision for that? The vision for
27:05
that is, you need to be able
27:07
to have a one click, open source,
27:09
like host, where you can like download
27:12
and click. And then the experience you
27:14
get is just like, for example, let
27:16
me pick something like a post class
27:18
on my sequel. You install the thing,
27:21
there's a server on or running on
27:23
a port, you're able to point your
27:25
data frame, you can write like a
27:27
Python data frame or something, send logical
27:30
plans, execute on that, come back or
27:32
the SQL, standard ODBC, that you can
27:34
interact with, doing all this, there is
27:36
all these, let's say, post-grice demands or
27:39
these my sequel buffer threads or all
27:41
these things that are under the hood,
27:43
making sure tables are well optimized and
27:45
like, you know, all that's happening for
27:48
you automatically. So today we do a
27:50
whole bunch of this, for example, in
27:52
the hoodie spark or fling writers will
27:55
do some of this for you, but
27:57
it's. not really packaged in this way.
27:59
So a lot of times people are
28:01
like, I don't know how to tune
28:04
this table servers and stuff, but without
28:06
them, you're not going to get good
28:08
performance. So that's why I say it's
28:10
a packaging problem. So we need an
28:13
experience to be something. like that where
28:15
you can interact with it. The underlying
28:17
storage is lake of storage, right? And
28:19
the, the, you know, it's open, the
28:22
same, everything else remains the same. I
28:24
think the current way of interacting will
28:26
remain the same, like remain there for
28:28
a while, right? It's not going to
28:31
go away, even maybe forever, but we
28:33
just want to have an alternate way
28:35
of interacting with the lake house, which
28:37
is far easier. for like somebody who
28:40
understands an RDBMS today cannot easily build
28:42
a lakehouse. It's kind of hard. You
28:44
need to duct tape a lot of
28:46
things together and it can like, you
28:49
know, take you for a ride. We
28:51
just want that to be that like
28:53
easy experience for them. That's basically what
28:55
I mean. Like end of the day,
28:58
what is changed? What changes this storage
29:00
compute separation and then you have columnar
29:02
file formats? We do a lot of
29:04
the same database problems of indexing, writing,
29:07
concurrency control. We do all that differently
29:09
here for good reasons. That's it. Otherwise,
29:11
yeah, it's just the same sequel queries
29:14
or data frame or like code programs
29:16
that you're writing against. It's an interesting
29:18
vision because the way I view the
29:20
Lakehouse right now is it feels if
29:23
you're going to take kind of the
29:25
current state and say this is how
29:27
it's going to be from now till
29:29
kingdom come. Like it's cool, but it
29:32
also feels like we're not. all the
29:34
way there yet, because I still have
29:36
to rely on, I have to go
29:38
find a query engine still, I got
29:41
to go, and typically it's like, geez,
29:43
which walled garden do I want to
29:45
go partner with? Well, I get to
29:47
use my data in an open format.
29:50
And so it's this kind of schizophrenic
29:52
relationship with my data and the provider.
29:54
And so I think that what you
29:56
outlined seems like a more complete vision
29:59
towards a more open way of interacting
30:01
with data. Yeah, it just felt kind
30:03
of, way it's done now it seems
30:05
like it's like I said it feels
30:08
like it's part of the way there
30:10
but doesn't seem like this is it
30:12
feel I feel unsettled for whatever reason
30:14
so it it it yeah so okay
30:17
this is like a great a great
30:19
talk and there's like a lot to
30:21
unpack here so first what I'd admit
30:24
is what I'm saying is actually getting
30:26
probably 70% there but it's not fully
30:28
gonna like address the thing that you
30:30
talked about so let me separate these
30:33
right so one I think the first
30:35
problem that you mentioned was around I
30:37
still need to pick like a query
30:39
engine. So this will solve that problem
30:42
in a sense that there is a
30:44
sequel interface that's gonna like you know
30:46
it's gonna your lake house can be
30:48
installed with like a command you can
30:51
have a health chart running like you
30:53
know do that there's something running on
30:55
a port that you can send sequel
30:57
queries to and it has a sequel
31:00
query engine right when we solve that
31:02
problem but Even these sequel engines, I
31:04
think we need to evolve. So if
31:06
you want to eliminate the choice, or
31:09
I don't want to pick one, I
31:11
just want to stick with this, then
31:13
we need to solve a lot of
31:15
like technical computer engine problems. For example,
31:18
if you take. Even warehouses, snowflake does,
31:20
I think, at least by going by
31:22
whatever I can know from their paper
31:24
and stuff, they do the, you know,
31:27
like push space processing, right? Like, so
31:29
they, they're good for like lower latency
31:31
on interactive queries, that's the kind of
31:33
the warehouse workload that you want. If
31:36
you look at something like spark, sparkle,
31:38
shuffle, shuffle, data to disk between stages,
31:40
they're great for pipelines because pipelines need
31:43
the resilience need the resilient C to
31:45
kind of like retry and whatnot, right?
31:47
So this is where I feel maybe
31:49
we can like you know we try
31:52
I don't know when those gaps will
31:54
converge so for it to fully converge
31:56
to okay this is the one open
31:58
thing that you need you can start
32:01
with this it's gonna be a while
32:03
an industry perspective what I see is
32:05
the wall garden comment that you have
32:07
right so if you if you look
32:10
at it right now we draw all
32:12
these like. Okay, here is open table
32:14
format or iceberg, read, write, read, read,
32:16
write, write, like, but there's actually only
32:19
right and everybody else can read because
32:21
the point is there is a technical
32:23
problem called the catalog. So we've been
32:25
like very distracted with this like table
32:28
formatting method. Here at least all the
32:30
data is simply in parquet. You have
32:32
some metadata around it. It's a solid
32:34
problem. We solved it with the X
32:37
table, data, try to, you know, solids,
32:39
the uniform. But the catalog problem is
32:41
a kind of like a kind of
32:43
like a kind of like a very
32:46
narrowly one. for let's say snowflake when
32:48
big query to safely write to a
32:50
single table they need to agree on
32:52
as like a single catalog right there
32:55
needs to be one guy who's going
32:57
to be designated to do concurrency control
32:59
across these two rights so I don't
33:02
know how we're going to solve that
33:04
problem because it needs vendors to perfectly
33:06
collaborate with each other to be able
33:08
to solve this problem. So until then,
33:11
like I feel decoupling the storage. That's
33:13
kind of simple. Everybody will agree on
33:15
that. But also, the lock-in is not
33:17
just in storage. I like to comment
33:20
so much. The lock-in is not just
33:22
in storage, but all your core compute,
33:24
right? So all your core compute, which,
33:26
which by, by, I mean, you just...
33:29
any wall garden that you're in you
33:31
need to still ingest your data you
33:33
need to transform and do your data
33:35
modeling build your fact dimension tables or
33:38
what not and then optimize this all
33:40
centrally do your GDPR deletions and do
33:42
your compliance management blah blah blah blah.
33:44
My takers like that has to be
33:47
at least our vision at one house
33:49
is like no parade that from the
33:51
individual wall gardens if you will and
33:53
like centralize them so it has a
33:56
lot of benefits kind of do you
33:58
do it once and you can support
34:00
multiple catalogs and you can then pick
34:02
your data scientists needs the the entire
34:05
product stack right the pipe i Python
34:07
notebook and the spark for their cashing
34:09
and their r scripts and blah blah
34:11
blah while your data analyst you know
34:14
just wants to be left alone inside
34:16
snowflake because it's such a easy platform
34:18
to go shift through data and then
34:21
they can create downstream databases from data
34:23
from there and whatnot. This is generally
34:25
where I what C is possible from
34:27
where we are right now for us
34:30
to reach that like kind of that
34:32
you know ideal state we need to
34:34
everybody needs to align on one chat
34:36
log and I don't know how to
34:39
even technically solve that problem because like
34:41
This is the service that every computer
34:43
engine has to call it runtime to
34:45
plan a query. And you know, so
34:48
it doesn't run in the same zone
34:50
and the region with the same, I
34:52
mean, like it's just not going to
34:54
happen engineering wise in my opinion, right?
34:57
So, so I feel practically doing it
34:59
this way where the storage and the
35:01
transformation, the common stuff that you repeat
35:03
across these wall gardens is like decouple,
35:06
then you have the freedom to use
35:08
whichever quite compute engine is great like
35:10
use press took like engine for interactive
35:12
query performance or use spark for pipelines
35:15
or use flink instead of spark you
35:17
know it fosters a more open ecosystem
35:19
where all of us can solve these
35:21
hard problems independently you know new engines
35:24
can emerge. There are startups doing FPG
35:26
acceleration for these, you know, the common
35:28
filter project joints, you can push it
35:31
in the hardware if you want, there
35:33
is projects that do GPU acceleration of
35:35
the same medial workloads. So it just
35:37
sets up for a much. level playing
35:40
field, new innovations can come up, they
35:42
have access to the same data to
35:44
prove out the value. The problem right
35:46
now is much of this data is
35:49
locked into your proprietary cloud data warehouses
35:51
that even if you had, you and
35:53
I had like a great sequel engine
35:55
today, how are we going to get
35:58
access to all that data to be
36:00
able to prove? It's a huge pain
36:02
and a huge problem, right? able to
36:04
even like do an evaluation you need
36:07
to like export that data which comes
36:09
to the cost like you see what
36:11
I'm saying right yeah yeah it makes
36:13
a lot of sense walk me through
36:16
X table that's it's an interesting project
36:18
Yeah, it's a so maybe I'll start
36:20
with like the origins of the project
36:22
right which is always like interesting to
36:25
kind of understand it from so what
36:27
we what happened was in this vision
36:29
like you know it this project started
36:31
from like a one-house lens because we
36:34
so whatever I just told you about
36:36
like decoupling the storage and the core
36:38
transformation work and the optimization is what
36:40
we wanted at one house For example,
36:43
we don't offer a like a data
36:45
science platform like Databrix does or we
36:47
don't like, you know, kind of offer
36:50
like a new query. There's like a
36:52
market, there's plenty of people doing that,
36:54
right? Our job is to make sure
36:56
the data gets into that. So where
36:59
there was friction was, okay, let's say,
37:01
listening to Databrix, it's a great data
37:03
science platform, but if you look at
37:05
it from an then like for an
37:08
end-to-end use case I want to ingest
37:10
a lot of data very quickly and
37:12
then get it in front of my
37:14
data scientists. So we are really good
37:17
at this right it's pretty like industry-wide
37:19
there's like few workloads that like you
37:21
know we as in like one house
37:23
a hoodie does really well as CDC
37:26
ingestion we have all these indexes that
37:28
like other projects don't. We do a
37:30
great job when it comes to near
37:32
real-time ingestion efficient ingestion blah blah blah
37:35
blah blah. On the other side, some
37:37
data science platform, like data breaks, just
37:39
understands Delta Lake, right? So now, how
37:41
do you bridge this? Because as a
37:44
customer, they want both the fast and
37:46
just, as well as an easy, nice
37:48
data science platform. So this is how
37:50
we started thinking about it. And then
37:53
the difference between, like I said, like,
37:55
you know, thankfully, between all these three
37:57
projects, we just stored park files. So
37:59
what we did was, okay, you write
38:02
to one project. We pick who we
38:04
del salaker iceberg you write as one
38:06
and then we can just translate with
38:09
very low overhead, we can translate the
38:11
metadata and at that commit boundary in
38:13
two other projects as well. So what
38:15
this gives you is the same data
38:18
can be, we can quickly ingest this.
38:20
have a table registered as an external
38:22
iceberg table in snowflake same physical copy
38:24
of data registered into unity catalog as
38:27
a data breaks delta table your data
38:29
scientists and your data analysts are happily
38:31
consuming the same data while your data
38:33
engineers are engineering or data platform costs
38:36
are also going down so this is
38:38
how we created this project I think
38:40
we open sources originally is one table
38:42
as a feature in our platform And
38:45
there was a lot of interest to
38:47
open source this as a general thing.
38:49
And later that year, this is 23,
38:51
23. We open source this with like,
38:54
you know, Azure and Google Cloud. And
38:56
it's now powering the one-lake translation between
38:58
a snowflake and Azure data breaks or
39:00
fabric inside Azure. So the same problem,
39:03
right? You write us from Snowflake as
39:05
iceberg tables, and then there's a translation
39:07
of metadata that happens, and you can
39:09
read in fabric as a delta table.
39:12
And so this is now, you know.
39:14
It's actually pretty cool like the amount
39:16
of impact it's had in a short
39:19
time where it's kind of as you
39:21
can see this example broken piece between
39:23
three giants with a small piece of
39:25
conversion code running has been pretty fascinating
39:28
to see and also personally for me
39:30
this was like a thing where you
39:32
know we were starting to when snowflakes
39:34
started the iceberg. you know, support and
39:37
you know, it's become like a little
39:39
bit of a snowflake, the Arabic kind
39:41
of conversation. I at least felt as
39:43
somebody who's woken up to work in
39:46
the space every day, kind of like
39:48
brush my teeth and think about lake
39:50
houses, right? So for me, it was
39:52
kind of like unfortunate. because we are
39:55
starting to like, you know, kind of
39:57
shrink wrap these open table formats like
39:59
how we would deal with close table
40:01
formats, which kind of defeats the purpose
40:04
on the power that these things can
40:06
bring. The power in doing this is
40:08
actually that. open layer that every code
40:10
that the watering hole analogy that I
40:13
told you at the start right it
40:15
kind of takes away that if you
40:17
say no no you have either pick
40:19
iceberg or Delta Lake or you're wrong
40:22
this for me personally this flew in
40:24
the face of that because it basically
40:26
said no you can do both because
40:28
technically there's nothing limiting us from doing
40:31
that and it's actually moved the conversation
40:33
to the real bottleneck here, which is
40:35
the catalog, right? So it's actually had,
40:38
I think, a pretty good industry impact.
40:40
I would say it's not technically that
40:42
complex compared to, let's say, Fourier and
40:44
Delta, like themselves, but the impact it's
40:47
had. in terms of strategically how it's
40:49
moved us towards this interoperability. I'm very
40:51
happy about maybe like how and we
40:53
are like moving towards addressing the catalog
40:56
interoperability problem on X table broadly. So
40:58
you want to like grow this as
41:00
the interoperability kind of peacekeeping force if
41:02
you will to make sure that your
41:05
data remains. you know, same physical copy
41:07
of data remains variable really well across
41:09
multiple formats and catalogs. So there's a
41:11
lot of, you know, design work and
41:14
RFCs and all of that going in
41:16
the project right now towards that. That's
41:18
super cool. You mentioned the iceberg in
41:20
Delta Lake and I wanted to ask
41:23
you for a while, like, what was
41:25
your reaction? Because last year, DataBrickspot, tabular.
41:27
And all of a sudden, that was,
41:29
that was the big big news. What
41:32
were your thoughts when that happened? I
41:34
actually, we don't know how much to
41:36
do with it. So, so, but, but
41:38
my honest thoughts are, I kind of,
41:41
I didn't, I mean, generally, I don't
41:43
think, I think many people would, would,
41:45
would, would have thought this way that
41:47
like, you know, it's not quite quick.
41:50
kind of like do the like you
41:52
know they probably get bought by snowflake
41:54
because that's kind of like how the
41:57
table was set before that but yeah
41:59
when this when this happened I don't
42:01
know like I was surprised like everybody
42:03
else but for us from a open
42:06
source standpoint it doesn't change much I
42:08
am curious to see how they're going
42:10
to integrate or consolidate both there is
42:12
still no really good clarity on that
42:15
right like whether it's going to be
42:17
like there's only one project or no
42:19
I think of delta more like hoodie
42:21
in a sense it has a lot
42:24
more upper level stack so iceberg can
42:26
be a table format that interrupts between
42:28
like even like Dell like you know
42:30
engines us hoodie this like like what
42:33
a database that we talked about anything
42:35
really but that part is not clear
42:37
to me but other than that yeah
42:39
I am curious to see how they
42:42
actually consolidated but I could actually understand
42:44
why they did it in some sense
42:46
if I can't speak for data breaks
42:48
but we were kind of in that
42:51
camp in some sense, although we are
42:53
like very small, as you know, like
42:55
we are an organic grassroots open source
42:57
project. Our company has a ton of
43:00
funding, but it's not remotely close to
43:02
what the, if you look at them
43:04
as competitors to us, then it's not
43:06
even close. But what really happened here
43:09
was, right, if you look at 2022,
43:11
right, when database started all this like
43:13
benchmark wars and all of that, I
43:16
think Snowflake did not have a. Lakeos
43:18
offering at all, right? They were like,
43:20
you know, if you remember all the
43:22
reinvent ads to be like, which 30-year-old
43:25
technology are you using? Yeah, they went
43:27
very strong on the Lakeos angle. Remember
43:29
those? And Daywicks was basically at the
43:31
point saying... Look, this is what Uber
43:34
built, this is what Netflix built, this
43:36
is what we built, this is a
43:38
new way, new kind of technology, right?
43:40
Which is true, which is now validated
43:43
now. But then essentially when Snowflake came
43:45
into the picture, an interesting dynamic happened,
43:47
which just kind of fascinates me today.
43:49
So everybody else in every other vendor,
43:52
cloud provider, small vendor, every... for them
43:54
it's like Christmas right because on one
43:56
hand you are able to by attacking
43:58
Delta or something you're able to kind
44:01
of like counter data breaks who's like
44:03
a streaming is like doing a bunch
44:05
of things but on the other hand
44:07
Snowflake is telling its customer say I'm
44:10
gonna do this iceberg thing and then
44:12
by saying I'm iceberg ready you're like
44:14
if I'm Athena's iceberg ready then yeah
44:16
maybe I can offload my snowflake costs
44:19
from you know like reduce this by
44:21
like the interesting thing that happened and
44:23
then the entire ecosystem basically kind of
44:26
ganged up on data breaks for a
44:28
while there I feel and I think
44:30
we have a lot of crossfire from
44:32
that like you know we're like kind
44:35
of doing our thing around in probability
44:37
and building towards you know this is
44:39
what this is the technicality this is
44:41
the product vision on one house that's
44:44
been pretty constant but this does set
44:46
off like a lot of narratives and
44:48
whatnot and I the one thing I
44:50
realize is like lot of the decisions
44:53
in this space actually based on these
44:55
narratives a lot more than what I
44:57
believed as just an engineer I was
44:59
like oh yeah people they were like
45:02
no a lot of decisions are a
45:04
little bit more top-down so I think
45:06
I could see why they did that
45:08
is my point of it all is
45:11
like it's it's the one marketing thing
45:13
that was used against database of sustained
45:15
for over like 18 months. It looks
45:17
like a mode to just consolidate around
45:20
that. Yeah, it's good. I think it's
45:22
healthy in a way, right, having one,
45:24
like some support, we interrupt with iceberg.
45:26
We are fine. You can land data.
45:29
You can read data. You can read
45:31
Delta Table iceberg tables, whatnot. But of
45:33
course, we have, we stick to our
45:35
guns on what I, because of popularity,
45:38
I can't suddenly change my portion and
45:40
agree that. Yeah, if you use an
45:42
index, no, you will not get to
45:45
and you will if you use we
45:47
have an index if you use an
45:49
index you will get foster performance that's
45:51
a technical fact. So I can't change
45:54
my technical portion because of this. But
45:56
yeah, if everybody supports something, we'll play
45:58
ball, we'll support. We're not here to
46:00
kind of build another silo or something.
46:03
So that's just simply my my push
46:05
in around this. Yeah, I would say,
46:07
you know, as the series B startup,
46:09
like, I don't think if any other
46:12
series B startup has had the open
46:14
source project as had the kind of
46:16
marketing headwinds that, you know, you know,
46:18
we've been kind of, kind of marketing
46:21
headwins that, kind of, happy of how
46:23
we continue to the community, if you
46:25
look at get up stats, if you
46:27
look at open source contributor metrics, we're
46:30
still like, you know, we're like even
46:32
pretty much in spite of all this
46:34
marketing push, which kind of I'm very
46:36
happy about how the technology is holding,
46:39
you know, on its own, and that's
46:41
all I can do, right? I cannot
46:43
control anything else. That's interesting. This brings
46:45
to mind a question. I mean, it
46:48
doesn't seem like much has changed. in
46:50
terms of your approach with hoodie or
46:52
X table or one house, given everything
46:54
to happen. I think for some founders,
46:57
you'd be like, crap, okay, like I
46:59
need to pivot pretty hardcore, but it
47:01
doesn't seem like that's the case. Like
47:04
what's your true north? What do you,
47:06
what do you, what do you consider?
47:08
Just like this is direction we have
47:10
to go, no matter what? Yeah, so
47:13
that's what I think. Some of this
47:15
actually comes from thinking about one of
47:17
this actually comes from thinking about one
47:19
house as a hoodie company, common like
47:22
angle because we are actually tuned into
47:24
thinking like that because of so many
47:26
open-source startups like be like doing that
47:28
we cannot like we never wanted to
47:31
or I never want to start a
47:33
hoodie company right this there's nothing like
47:35
you know we like sure we offer
47:37
a ton of things for hoodie users
47:40
yes but are not starters to build
47:42
this Open a data platform this what
47:44
we currently I don't know if I'm
47:46
good with the words yet But essentially
47:49
what we call universal data like us.
47:51
We believe that the biggest difference between
47:53
us and data breaks for example is
47:55
saying look yes lake house is the
47:58
way to go like you know delta
48:00
versus who do technical differences aside we
48:02
are lying there but spark is not
48:04
the best engine to run everything We
48:07
think like virus is good for some
48:09
use cases. So that's that that's the
48:11
true not for us and this decoupling
48:14
of the core compute that you need
48:16
across these wall gardens is basically what
48:18
we are after. So that like I
48:20
said, these are actually good. If for
48:23
example, Delta and Iceberg converge in one
48:25
way and make one, we just have
48:27
one thing to support. That's it. So
48:29
we will support new formats that come
48:32
up. So as a company. We're not
48:34
pivoting, we're not doing anything because this
48:36
vision is pretty set and it's strong.
48:38
This is why we started the company.
48:41
And so that's if you think about
48:43
it, we built one table and all
48:45
of this before any of this happened,
48:47
right? It's not a reaction to this
48:50
acquisition or something. That is just like
48:52
true to our principles on, hey, the
48:54
market has plenty of engines and we
48:56
are confusing everybody on, hey, I'm the,
48:59
everybody claims, I'm the best. Right. You
49:01
cannot be the best, right? Because there
49:03
are these like fundamental tradeoffs that you're
49:05
making in the core of your thing.
49:08
And with these complex systems, you make
49:10
five of them, then you cannot go
49:12
back to doing some other workload. That's
49:14
how it is. That's what the computer
49:17
science is. So we just want to
49:19
build a platform that can ensure that
49:21
your data is open and you are
49:23
not suddenly beholden by a world garden
49:26
to keep paying them because of what
49:28
for this core workloads. And why are
49:30
we not that Walgarten? Why will we
49:33
not be the bad guy? So we've
49:35
kind of addressed this even in our
49:37
like launch blog, which is for anything
49:39
that we run on our platform, hoodie
49:42
has open services that can do the
49:44
same thing. So you can duct tape
49:46
all of this together if you want,
49:48
right? We want to give you the
49:51
total freedom to build that unbundled data
49:53
platform if you want, which is great
49:55
if you have the team to go
49:57
build it. If not we can get
50:00
you started on the right fundamentals we
50:02
will keep your data. right in optimize
50:04
the same exact way and you can
50:06
even like think about it like can
50:09
you even can people even easily do
50:11
apples to apples test between photon and
50:13
data you know snowflakes equal today because
50:15
you will load it one way here
50:18
you load it one way there you
50:20
will not run clustering here you will
50:22
run clustering there so we can ensure
50:24
that this level playing field data is
50:27
optimistic same exact way and you can
50:29
now bring multiple engines you can figure
50:31
out what product stack features and what
50:33
cost performance fits your need right so
50:36
that's the not star for us so
50:38
that's why we are not panicking or
50:40
we're not doing anything around this right
50:42
for us for hoodie same thing right
50:45
see hoodie we don't own hoodie it's
50:47
like we have six out of 16
50:49
PMC members in the project so if
50:52
you know like And as a Apache
50:54
project, we should be focusing on how
50:56
to make the overall system benefit in
50:58
terms of, you know, like if there
51:01
are ways for us to make, you
51:03
know, hoodie, extiable iceberg delta all work
51:05
well together for the benefit of the
51:07
community, we will not stand in the
51:10
way, right? We will do it if
51:12
things come up. Yeah, we cleanly separate
51:14
how we feel technically about the choices
51:16
and solving like the core technical problems
51:19
in the lake house from how the
51:21
ecosystem should be. So we don't want
51:23
to build another actual silo, right? Yeah.
51:25
That's how we generally approach it. So
51:28
for us, the challenge really is, can
51:30
we, is this too idealistic of a
51:32
vision? Right? Or not? And how are
51:34
we going to get this message across
51:37
with, you know, like vendors and competition
51:39
which are thousand, ten thousand times bigger
51:41
than us? That is the true challenge
51:43
for us, I would say, and that's
51:46
where we need to do a lot
51:48
of work in sort of getting, getting
51:50
people to like truly understand this. Yeah,
51:52
that's awesome. That's a master class that
51:55
can just stick with your guns, you
51:57
know. Like I said, earlier I think
51:59
a temptation for a lot of people
52:02
would be to freak out and take
52:04
a run around the room and, you
52:06
know, pivot galore. Like that scene in
52:08
a Silicon Valley, the TV show where
52:11
they're just pivoting. So. Yeah, yeah. This
52:13
is like some of these things that
52:15
I hang on to is like the
52:17
middle out compression, right? I mean, that's
52:20
fact. The problem for you is like,
52:22
you can't suddenly pivot away from that.
52:24
right like let's say for example if
52:26
you're like so we land what we
52:29
call the fastest iceberg tables or delta
52:31
tables or hoodie tables because we write
52:33
a sootie and you can read as
52:35
delta iceberg or whatever today right so
52:38
it combines the powers of both that's
52:40
the right way we approach it I
52:42
cannot argue against myself on whether you
52:44
know like hoodie is the only the
52:47
one that has record level indexes so
52:49
if you have an index would the
52:51
rights be faster yes how do you
52:53
how do you against myself and do
52:56
something else I would actually be doing,
52:58
you know, like the customers at this
53:00
service by slowing down their jobs. Actually,
53:02
it may make us more money because
53:05
the job will for longer time and
53:07
we price for use and whatever. But
53:09
that's not the right way to approach
53:11
this, right? And there's a good chunk
53:14
of the community that understands these benefits.
53:16
Our challenges, how do we... kind of
53:18
educate more people when you have such
53:21
top-down kind of like you know marketing
53:23
or like messaging like lots of buzz
53:25
if you will You know from the
53:27
big fight basically the three clouds and
53:30
data bricks and snowflake doing saying one
53:32
thing or the other. Yeah, that's the
53:34
that's the challenge and I think from
53:36
about I said as a founder Our
53:39
fortunes are only improving because I didn't
53:41
have money in 2021. I have more
53:43
money than I want. I had three
53:45
years ago. We have more product. We
53:48
built more stuff. We moved more towards
53:50
the open lakehouse than where we were.
53:52
Right. So I think it's it's progressing
53:54
pretty well. If I if I evaluate
53:57
ourselves on that arc. That's super cool.
54:00
Oh, but enough. Great chatting with you. Hi. Yeah, for
54:02
sure. Yeah, it's, sure. up some time, but in up
54:04
some time yeah, in the really got up, we had
54:06
a chance to chat and congrats on the success.
54:08
So So it's awesome. Right. you very much much.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More