Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Hello and welcome to the
0:02
Thoughtworks Technology Podcast. My name
0:04
is Lily Ryan, I'm one
0:07
of your regular hosts, and
0:09
I am speaking to you
0:12
from Melbourne Australia. In today's
0:14
conversation, I'm joined by Ryan
0:16
Collingwood and Andrew Jones. Ryan
0:19
is a data strategist at
0:21
Thoughtworks and is currently writing
0:23
a book on data contracts
0:26
that will be out in
0:28
2025, Ryan. Yeah, Q1, 2025. Fingers crossed.
0:30
Wonderful. And Andrew is the author
0:32
of Driving Quality with Data
0:34
Contracts, which was published in 2023.
0:37
Andrew, would you mind introducing yourself?
0:39
Yeah, sure. Yeah, hi, everyone. I'm
0:41
Andrew. I'm a independent date consultant,
0:43
and I help organizations build data platforms
0:46
that reduce risk and drive revenue. I've
0:48
been doing that for a little while
0:50
now, which is why I end up
0:52
coming up with a date contract a
0:54
few years ago and writing that book
0:56
that you mentioned. That's wonderful and as
0:58
you may have guessed we are here
1:01
today to talk about data contracts. This
1:03
is a topic that has been kicking
1:05
around for quite some time and has
1:07
recently I think come through quite strongly
1:09
in the way that we're all focusing
1:11
on data and a lot of our
1:13
practices. So to talk about it today
1:15
and to get to terms with what
1:17
data contracts are where they fit into
1:19
the software delivery life cycle and all
1:21
of that kind of stuff, Ryan and
1:23
Andrew are going to be our expert
1:25
guides and to kick us off. What is
1:27
a data contract? Yeah, sure. I think
1:29
the easiest way to think about it,
1:32
particularly if you're not familiar with
1:34
data engineering and the sort of
1:36
data landscape, the easy way to
1:39
think about it is that it's
1:41
an API for data. So the reason why
1:43
I came up with the idea originally
1:45
was we were getting data directly
1:47
from a date basis, chuck them
1:50
into a data warehouse and we're
1:52
trying to use that to drive
1:54
data applications. As you know, if you know that API,
1:56
it's like you don't really don't go direct to a date
1:58
base because the date base keeps changing. keeps, the
2:00
schemes keep changing and database evolving all
2:02
the time and you don't want that
2:05
to affect your downstream applications. And that's
2:07
the common situation we're in. We want
2:09
to use our data for more important
2:11
things, use it for more revenue generating
2:13
applications, use it for more AI andML
2:15
based applications and we couldn't do that
2:18
because our data was not reliable enough.
2:20
So I started thinking about why that
2:22
was the case and I realized... Eventually
2:24
I realized that the problem was the
2:26
lack of interfaces. The fact was building
2:28
on top of the database meant that
2:31
we couldn't have reliable data streams from
2:33
the optional applications. And click off engineering
2:35
background. I started thinking about why that
2:37
was the case. And eventually I came
2:39
to the idea about we need some
2:41
sort of API here. We need some
2:44
sort of interface here. That's what was
2:46
preventing us from building reliable applications on
2:48
top of our data. So yeah, I
2:50
would like to think of as an
2:52
API of data. and the key word
2:54
variable is the interface. This is the
2:57
interface data that provides abstraction, provides a
2:59
place for us to ensure that the
3:01
transfer of data between one place and
3:03
the other is reliable, more liable than
3:05
it was before. How does this differ
3:07
from straight up API documentation? Yeah, it's
3:10
similar in a way. The difference really
3:12
is the interface that you provide data
3:14
through. So an API generally you're making
3:16
a call back and forth, you're getting
3:18
a small mass data moving around or
3:20
you're making a call to start an
3:22
action. with data generally need to have
3:25
entire data set, for example if you're
3:27
on trained data science model, so that
3:29
might be in a table in a
3:31
data warehouse somewhere rather than a sort
3:33
of HTP API, or it might be
3:35
in a stream application like Kafka or
3:38
similar, you can use eight contracts there,
3:40
but generally you move around greater volumes
3:42
of data and the interaction between the
3:44
consumer and provider of that interface is
3:46
a bit different than it is compared
3:48
to a standard HCP or rest-based API.
3:51
Kind of coming back to this topic
3:53
of interfaces. It's interesting Andrew speaking about
3:55
one of the fundamental differences with data
3:57
right compared to say like you know
3:59
API specifications that we tend to think
4:01
of them for web services is that
4:04
the interface you know you mentioned a
4:06
couple there like Kafka or you know
4:08
even your flavor of sequel like that
4:10
can change and you know that's an
4:12
implementation detail that you know is as
4:14
fundamental and you know as much as
4:17
people might want to sort of have
4:19
like an Omni connector for every single
4:21
type of database interface in the world
4:23
and that's awesome. things that kind of
4:25
take some of that pain away, there
4:27
still is going to be some difference
4:30
here and there. But coming back to
4:32
the value, one of the values that
4:34
I see around the data contract, again
4:36
that word interface, is that in my
4:38
mind, what I really saw in data
4:40
contracts was this interface, you know, as
4:43
in, it is a plane of glass
4:45
where both systems and people can look
4:47
at the data, understand what the data
4:49
is, and when I say data, it's
4:51
data plus context, you know, so information
4:53
about the data. And it's having that
4:55
sort of standardized and not having to
4:58
think about how we're going to describe
5:00
the interface, because we figure that out
5:02
ahead of time, and then we apply
5:04
that pattern as time goes by. I
5:06
don't know whether it's because I'm getting
5:08
lazy or getting older, but I see
5:11
real value in not having to think
5:13
about how I'm going to do the
5:15
thing while doing the thing, if I
5:17
can just do the thing. You know,
5:19
that makes me happy. So yeah, you
5:21
know, that's a great call out there
5:24
around how, yes, you know, we can
5:26
think of it as an API for
5:28
data, but fundamentally the nature of the
5:30
problem does mean that there's going to
5:32
be a little bit of differences here
5:34
and there. And so having something that's
5:37
kind of consistent that we can hold
5:39
on to and also gives us guidance
5:41
around the things that we should be
5:43
describing as part of that information about
5:45
our data is only going to be
5:47
a good thing. So yeah, exactly that
5:50
right. And when we talk about interfaces,
5:52
we're talking, we start off talking about
5:54
the technical side, so it could be
5:56
a Kafka, it could be a database.
5:58
But when you start describing the interface,
6:00
like I say, you do want to
6:03
describe exactly how it's done. You want
6:05
to describe what you want to achieve
6:07
from that. So you start describing, I
6:09
want to make data available to micro-assumers
6:11
through, say, a streaming application. And then
6:13
you also want to. Because of data,
6:16
there's more things you need to describe
6:18
the data. So maybe we want to
6:20
describe how timely it is. Is it
6:22
going to be our day? Is it
6:24
going to be real time? You might
6:26
also want to describe what kind of
6:29
data is in there. Is it personal
6:31
data? Is it not personal data? Is
6:33
it not personal data? How can this
6:35
data be used? What can we be
6:37
joined with? You can stop with all
6:39
sorts of things about that contract to
6:41
help describe data even more. So like
6:44
I said, turning about data plus amount
6:46
of data plus context into information that
6:48
can be consumed much more quickly and
6:50
much more easy by the debt consumers.
6:52
So they can build the data applications
6:54
much more quickly as much more quickly
6:57
as they can build the data applications
6:59
much more quickly as well as be
7:01
much more reliably and deliver greater value
7:03
to organization. So I think it's particularly
7:05
important as we're trying to do more
7:07
of data for organization. To give our
7:10
listeners a sense of what you're talking
7:12
about when it comes to the way
7:14
that these things are defined, we have
7:16
talked about the kinds of schemas that
7:18
you might want. What does it actually
7:20
look like in practice for someone to
7:23
define? There is, for example, the open
7:25
data contract standard, which is a standard,
7:27
but we know that there could be
7:29
other standards, and organizations may want to
7:31
define their own, depending on their use
7:33
cases. How do you actually, from a
7:36
nuts and bolts perspective? put a data
7:38
contract together, what does it look like?
7:40
Yeah, it's a standard, the Appendate Contract
7:42
Standard, which I'm a little bit involved
7:44
in this part. of the Student Committee
7:46
there. And that's quite good, because it's
7:49
kind of a general-based standard that contains
7:51
everything you could think about putting in
7:53
date contract. But I would necessarily say
7:55
if people should use that just off
7:57
about necessarily, because when you think about
7:59
how to define date contracts in your
8:02
organization, the organization is a bit different,
8:04
but more importantly, people who's going to
8:06
be completing my date contract or using
8:08
their date contract, the interface needs to
8:10
be. the implementation of day contracts needs
8:12
to work for them. So, for example,
8:14
where I worked previously, we defined day
8:17
contracts in this kind of random language
8:19
that many people might not have heard
8:21
of called Jason and it's like Jason
8:23
with extensions, it's like the creation language
8:25
that came out of Google. And it's
8:27
not because it's the best way to
8:30
implement date contracts. It's because our engineering
8:32
teams, who we wanted to create these
8:34
day contracts and manage these date contracts
8:36
and own these date contracts, These engineering
8:38
teams, they use Jesuits to define the
8:40
APIs. They use Jesuits to find their
8:43
instructions code. So it made sense for
8:45
us to find their contract in the
8:47
same way, because you wanted it to
8:49
be clear that their contracts were at
8:51
this sort of level, similar to APIs,
8:53
similar to your instructions code, that is
8:56
important at these things, and to reduce
8:58
the friction for adoption, we wanted it
9:00
to be as easy as possible for
9:02
us software engineers to use. So that's
9:04
why we use Jeson it. I would
9:06
recommend anyone using this text on it,
9:09
unless they already using Jeson, unless they
9:11
already using Jeson, unless they already using
9:13
Jeson, unless they already using Jeson, and
9:15
unless they already using Jeson, and as
9:17
they already using a already using a
9:19
already using a already using Jeson, and
9:22
as they already using Jeson, and as
9:24
they already using a, and as they
9:26
already using a, and as they already
9:28
using a, and as they already using
9:30
a, and as they already using a,
9:32
and as they already using a, and
9:35
as they But the key point there
9:37
is when you think about implementing date
9:39
contracts in your organisation, think about first
9:41
of what you want to put in
9:43
it, which might be quite similar to
9:45
other organisations. So you can use the
9:48
standards for inspiration there, things like ownership,
9:50
things like SLOs, things like schemas, they're
9:52
going to be quite standard of course
9:54
organisations. But when it comes to implementing
9:56
it in your organisation, think about how
9:58
you want the date contract owners. to
10:00
complete two own rotate contracts, how they're
10:03
going to be using it and start
10:05
from there rather than making them use.
10:07
say just on it or Afro or
10:09
protobuff or whatever it is we might
10:11
want to use, think about the user
10:13
first. Yeah, like definitely resonate with that.
10:16
You know, my contacts when I first
10:18
encountered data contracts, the number of people
10:20
that had serious technical chops were definitely
10:22
in the minority. And you know, I'll
10:24
just say it, my first attempt at
10:26
a data contract was essentially an Excel
10:29
spreadsheet because that is what the people
10:31
that I needed to get knowledge out
10:33
of. that's what they understood. That's what
10:35
they were comfortable with. And yeah, it's
10:37
not necessarily my favorite way to structure
10:39
data, but it is a start. And
10:42
it was good enough for me to
10:44
kind of get going and then gradually
10:46
over time, you know, I managed to
10:48
find something that worked a little bit
10:50
better for me, but was still within
10:52
their tolerance that they could work with
10:55
that we could then collaborate on and
10:57
work together. But yeah, you know, I
10:59
definitely agree. I identify your audience. So,
11:01
and perhaps it's also maybe a moment
11:03
to kind of talk about that the
11:05
general parties are involved, you know, that
11:08
data producers and the data consumers. You
11:10
may find yourself in the situation where
11:12
I was initially where I was both
11:14
the consumer and effectively the producer working
11:16
in a centralized data team, which is
11:18
perhaps a topic we can briefly chat
11:21
about in a second. But you sort
11:23
of recognize who your parties are, what
11:25
their level of comfort is. And then,
11:27
you know, I think it's great in
11:29
your situation, Andrew, where you had an
11:31
existing set of tooling that worked for
11:33
people, they were comfortable with it, because,
11:36
you know, when we're changing the way
11:38
people work, I do believe that people
11:40
have like a tolerance for sort of
11:42
the amount of stuff that you can
11:44
throw them at once before they start
11:46
getting uncomfortable and it varies by people.
11:49
And so... If by reusing things that
11:51
people are already comfortable with that, you
11:53
know, gives you a little bit more
11:55
wiggle room to kind of make people
11:57
uncomfortable, then, you know, that's a smart
11:59
move. You're talking a lot about making
12:02
sure that you meet people whether
12:04
within the organization and ensuring that
12:06
whatever your data contracts are are
12:08
things that people can actively use
12:10
and understand and be part of
12:12
creating. What kind of maturity does
12:14
an organization need to have as
12:16
a prerequisite for getting value out
12:18
of data contracts or making them
12:20
work in the first place? You
12:22
know, in one case. to me,
12:24
maybe it would imply the existence
12:26
of databases, but we also know
12:28
that many organizations have to start
12:30
with spreadsheets and that's really, it
12:32
makes a lot of sense. The
12:34
world runs on Excel in a
12:36
lot of cases and that's something
12:38
that we have to work with.
12:40
So what kind of situation does
12:42
an organization need to get benefit
12:44
from a data contract and make
12:46
it work for them? It's not
12:48
one size of it's all, but
12:50
certainly that the problem that led
12:52
me to encountering data contracts was
12:54
I was at a point in
12:56
an organization where after much inspection
12:58
of the data platforms and data
13:00
systems which you know I had
13:02
inherited and there was still this
13:04
perception of there being a data
13:07
problem and you know yeah there
13:09
were some things that needed some
13:11
uplift but really what I came
13:13
down to in the end was
13:15
you know as career limiting as
13:17
it may sound Often data is
13:19
a side effect of things happening,
13:21
right? And if there's things that
13:23
are happening, not happening in a
13:25
consistent manner, you know, so if
13:27
there's three different ways to process
13:29
a refund, that means that there
13:31
could be at least three different
13:33
expressions of the data associated to
13:35
that event, process a refund, you're
13:37
going to have data quality challenges,
13:39
right? And so... got to the
13:41
stage at this organization where it's
13:43
like, all right, let's have a
13:45
conversation about what our processes are.
13:47
Let's standardize some things. You know,
13:49
they've been rapid growth. People had
13:51
to kind of adapt an event
13:53
as they as they went. And
13:55
sometimes it's necessary. Sometimes you just
13:57
got to make it up, you
13:59
got to figure it out, you
14:01
got to get things done, but
14:03
then certainly, you know, I think
14:05
technology is so familiar with the
14:07
idea of technical debt. At some
14:09
point, you have to pay down
14:11
the debt that you've accumulated. And just
14:13
like technical debt, there is process debt.
14:16
And so that's what initially led me
14:18
to this thing of, all right, well,
14:20
we're having this conversation about what our
14:22
processes. Here is an opportunity to codify
14:24
to codify all this great information and
14:27
this agreement that we're getting. into some
14:29
sort of thing that can be reused.
14:31
And as a recovering business analyst, you
14:33
know, my past lives, I added exposure
14:35
to things like data dictionaries and what
14:38
have you. And I think, kind of
14:40
also having some software development exposure, I
14:42
thought, well, if we do this right, if
14:44
we do this in a way that is
14:46
structured and possible, we can use this, you
14:49
know, like this can be a living
14:51
document, this doesn't have to be a
14:53
document that's correct at the launch of
14:55
this thing. and then immediately becomes out
14:57
of date once we had our first
15:00
SEV1 and then we do like an
15:02
in-place fix and then you know the
15:04
documentation and the implementation just diverge right
15:06
if the documentation can be
15:08
part of the integration can be part of
15:11
the solution that as it evolves and as
15:13
it lives well wow you know that's a
15:15
document that's worth something so to come back
15:17
to the question of you know where should
15:19
you be as an organization I think that
15:21
if you are at a point of you know
15:24
having data quality challenges and
15:26
or, and I think if you are considering data
15:28
quality challenges is also worth having
15:30
a conversation where I will, are
15:32
processes actually what we need them
15:35
to be, because you can tackle
15:37
the data quality on its own,
15:39
but I feel that having that
15:41
conversation around is the, is the
15:44
activity that generates this data what
15:46
we expected to be? Oh, are we
15:48
agreed, right? You're just kicking the ball
15:50
down the road. Yeah, so that will
15:52
kind of be the big call out
15:55
there is that if you are serious
15:57
about this, I would say that are
15:59
you... willing to have a conversation about
16:02
what it is that we do
16:04
that generates this data. And are
16:06
we agreed? Are we aligned on
16:08
what those processes are? Because we've
16:10
briefly touched on it, but within
16:12
the parlance data contracts, and I
16:14
really love the way that Andrew,
16:16
you framed this in your book,
16:18
the conversation around that there are
16:20
data producers and data consumers. So
16:22
data producers are people that are
16:24
producing data, it says that they're
16:26
in the name. Consumers obviously want
16:28
to consume it. But the real
16:30
kind of big gotcha is often
16:32
data producers don't know or don't
16:35
recognize or if you want to
16:37
be an realistic, don't care, but
16:39
I'd say that's the minority of
16:41
cases. I think it's more they
16:43
don't know, they don't recognize, they're
16:45
producing data. Because if you're dealing
16:47
with a department of your company,
16:49
that's in procurement. And you ask
16:51
them, what's your role? What's your
16:53
function? They're not going to say
16:55
it's to enter purchase orders. They're
16:57
going to say something like, it's
16:59
the source materials at a great
17:01
price so that we can, you
17:03
know, have a better margin. It
17:05
might be something like that, right?
17:07
But they're not going to say
17:10
that it's something related to data
17:12
entry. But capturing that data, things
17:14
like capturing purchase orders, it's vitally
17:16
important for like the rest of
17:18
the business, you know, because it's
17:20
all these series of interconnected value
17:22
chains, right? And if it's not
17:24
done or it's poorly done, it
17:26
makes this other really important task
17:28
really difficult. So that's why I
17:30
say, you know, where does an
17:32
organization need to be? I think
17:34
it needs to be kind of
17:36
at that point of a reckoning
17:38
around, yes, maybe it's it's born
17:40
out of a concern around data
17:42
quality, but I think to really
17:45
make a difference, there must be
17:47
an openness to have a conversation
17:49
about what is our process. And,
17:51
and I think underlying it, you
17:53
know, this is where it kind
17:55
of touches into like the social
17:57
aspect of it is being prepared
17:59
to to develop empathy for, you
18:01
know, describing that scenario of data
18:03
producer, you know, way upstream and
18:05
data consumer way downstream, right, is
18:07
that I do believe that if
18:09
we help those people who are
18:11
upstream, have empathy for the people
18:13
who are downstream from them, to
18:15
go like, this information that is
18:17
perhaps kind of a chore for
18:20
you to fill in and capture
18:22
and do, has a real material
18:24
difference these people downstream. I want
18:26
to come back to that. issue
18:28
around empathy in a minute, Ryan,
18:30
because I think it's something that's
18:32
really worth digging into. Andrew, from
18:34
your point of view and in
18:36
your experience, what has it been
18:38
like working with a variety of
18:40
different contexts to make data contracts
18:42
work? And what have you learned
18:44
about it over the time where
18:46
you've been maturing these ideas for
18:48
yourself? Yeah, I see a question.
18:50
I think, like what I said,
18:52
many organizations struggle with that equality,
18:55
are looking at most of their
18:57
contracts trying to help them change
18:59
the organisation of ones where they
19:01
are now using data for something
19:03
more important than, well, even more
19:05
important than it was before. So
19:07
it's some part of their strategic
19:09
goal, where that's some amount of
19:11
data that's just using data to
19:13
create products to drive revenue or
19:15
different chance themselves in the market.
19:17
It's kind of that realization that
19:19
data is key to a business.
19:21
that gives those organizations the, I
19:23
don't want to say the freedom
19:25
but the ability, the chance to
19:27
look again at how they're doing
19:30
data and really like do a
19:32
root cause analysis on why can
19:34
we not achieve this now? What
19:36
is, what is the problem we're
19:38
trying to solve here to achieve
19:40
our goal? And then you go
19:42
back to their quality issues where
19:44
you go back to how you're
19:46
sorting the data and whether there's
19:48
an interface around that and eventually
19:50
you can protect their contracts. better
19:52
data and they go from there
19:54
back up the stream to work
19:56
out why our data isn't what
19:58
it needs to be. How do
20:00
you get to that point in
20:02
a conversation where you help people
20:05
to the realization that it is
20:07
a data quality issue? Because as
20:09
far as I can see, and
20:11
in a lot of my experience,
20:13
it has certainly been the case
20:15
that data quality issues are the
20:17
root of many of many different
20:19
types of business problems, but because
20:21
they are so deeply interconnected, these
20:23
are the byproducts of other actions
20:25
that create the data. And because...
20:27
data so deeply interconnected with everything
20:29
a business does it can be
20:31
really hard to identify that as
20:33
a particular issue and when you
20:35
do it's also kind of difficult
20:37
to get buy-in from people who
20:40
are not data engineers or data
20:42
scientists to participate in a conversation
20:44
like this and see it as
20:46
something that is worth investing their
20:48
time in and worth investing their
20:50
time in involving and maintaining over
20:52
a longer period. So in your
20:54
experience Andrew, how do you get
20:56
to a point with a business
20:58
like that where you really need
21:00
to to come to that realization
21:02
collectively in order for a data
21:04
contract to work? Yeah, it's really
21:06
all about communication. The problem data
21:08
teams often have is they are
21:10
in a part of business, maybe
21:12
under IT or even under finance,
21:15
but not... generally part of, they're
21:17
generally kind of hidden in the
21:19
business and they have been almost
21:21
suffering in silence with this. They're
21:23
like, they are expected to get
21:25
data from a variety of sources,
21:27
a variety of quality levels, and
21:29
turn that into reporting. And I
21:31
think that's kind of okay for
21:33
results weren't that important. I mean,
21:35
if you can't be important, reporting
21:37
is important, particularly if they're important.
21:39
if it fails, the business carries
21:41
on running. Now that's something more
21:43
important things. They need to, or
21:45
they have the ability to, start
21:47
having these conversations with those producing
21:50
data and really highlight these issues
21:52
of having a run, like, why
21:54
are they not able to? to
21:56
turn this data around quickly. Why
21:58
does it pipeline to keep failing?
22:00
And if I keep failing, then
22:02
how can we build this wherever
22:04
treated goal they want to build
22:06
onto that data? So when people
22:08
ask questions like that, they need
22:10
to have ability or they need
22:12
to feel confident enough to raise
22:14
issues, say, well, this is the
22:16
reason. It's because the stream is
22:18
changing or the data entry is
22:20
not correct or whether it might
22:22
be. They need to have these
22:25
conversations. And that's what really. We
22:27
spent a lot of time doing,
22:29
like we started off talking about
22:31
day contracts in a technical sense,
22:33
but really, in my experience, a
22:35
lot of it is, it's about
22:37
communication. And when I started doing
22:39
day contracts, I had a great
22:41
problem manager I worked with, and
22:43
between us, we spent so much
22:45
time talking to all parts of
22:47
business, with our software engineers, or
22:49
that's the CTO, or that's people
22:51
in between, it doesn't really matter,
22:53
well, it doesn't matter who, but
22:55
like, like, always communicating with communicating
22:57
with challenges we're having. Really, repeating
23:00
that message in different ways, different
23:02
audiences, explaining why we had to
23:04
change things in one part of
23:06
business to achieve things in our
23:08
part of business, in data part
23:10
of business. So really, it's all
23:12
about communication. Ryan, I've seen you
23:14
in the past when we've discussed
23:16
this topic say that data contracts
23:18
are a document for both humans
23:20
and machines. And that is something
23:22
that I would love to hear
23:24
a bit more about from you,
23:26
particularly as it relates to the
23:28
way that we can evolve data
23:30
contracts now that they are becoming
23:32
a bit more of a mainstream
23:35
topic that people really want to
23:37
engage with. Yeah, yeah, you know,
23:39
as I mentioned previously, you know,
23:41
I see this, a data contract
23:43
is bundling, yes, this is about
23:45
data, but this is really information
23:47
about data, you know, so data,
23:49
that's a word that... kind of
23:51
has been used so often that
23:53
it feels like it's losing its
23:55
meaning, but really this is kind
23:57
of what I see a lot
23:59
of the value about the data
24:01
control. is like, this is a
24:03
document that describes what the data
24:05
is. And what I mean by
24:07
that is the meaning of the
24:10
data. So one of my bug
24:12
bears that I've had around the
24:14
discourse, particularly amongst data engineering folks.
24:16
And God bless them, I love
24:18
them. You know, they're my tribe.
24:20
But we tend to get ready
24:22
fixated on schema. And that's often
24:24
where the conversation ends. And anyone
24:26
else outside of like the data
24:28
engineering world. does not particularly care
24:30
about schema, they don't understand schema,
24:32
they be far much in the
24:34
semantics, you know, what does this
24:36
data mean? And how can I
24:38
use it? And how am I
24:40
allowed to use it? And what
24:42
were the conditions under which this
24:45
information was captured? Because if you
24:47
think of, say, you know, a
24:49
table of email addresses, let's just
24:51
say, and you have no other
24:53
context about why this table of
24:55
email addresses exists, I mean, you
24:57
know, PII, you know, PII, Spidey
24:59
sends us tingling, but you know,
25:01
this overlooked that for a second.
25:03
Right. If you just know that
25:05
it is a table that has
25:07
string or Varchar type data that
25:09
looks like email addresses, that doesn't
25:11
really tell you a whole lot.
25:13
But if you then learn, oh,
25:15
hold on, these email addresses represent
25:17
people that have told us, please
25:20
stop emailing me. Right? That's a
25:22
very different conversation. Because if you
25:24
didn't know that. you know and
25:26
perhaps you were misbehaving you might
25:28
take that list of email addresses
25:30
like oh fantastic these must belong
25:32
to customers they want to hear
25:34
from us let me send them
25:36
some good news about the fantastic
25:38
things we're doing but this data
25:40
set represents people that do not
25:42
want to hear from you right
25:44
and this is the kind of
25:46
information that is often you know
25:48
as I say it's It's not
25:50
bundled with the data. So this
25:52
is another important element about data
25:55
contracts is that data contracts allow
25:57
us to speak for the data
25:59
in ways the data contracts. necessarily
26:01
speak for itself. And so that's
26:03
a really important aspect about how
26:05
data contracts are both for the
26:07
benefit of machines and also for
26:09
the benefit of humans, because where,
26:11
you know, the benefit of the
26:13
machines comes along is that, yeah,
26:15
we can put in these things
26:17
that, you know, we know, we
26:19
know, we know, and love like
26:21
schema. We can also put in
26:23
things like range checks and, you
26:25
know, pattern matches that we expect
26:27
to exist and validate and build
26:30
into our pipelines, right. And certainly,
26:32
you know, if you're transmitting data
26:34
between different systems and different programming
26:36
languages, I just this afternoon kind
26:38
of went off a bit of
26:40
a tangent on how by virtue
26:42
of being descended from JavaScript, that
26:44
adjacent schema, if you don't take
26:46
the time and care to be
26:48
really specific with your numbers, you
26:50
can. have some really less than
26:52
optimal experiences because of, you know,
26:54
JavaScript's, let's say relaxed approach to
26:56
data types compared to some other
26:58
programming languages, right? And let's not
27:00
give you started on equality. But
27:02
yeah, like, and so there's additional
27:05
information that we can put into
27:07
the data contract that, you know,
27:09
gives us some safeguards and some
27:11
security around systems and interoperability about
27:13
systems, but also importantly, because us
27:15
safety and and I guess a
27:17
degree of comfort around interoperability for
27:19
people, you know, as in how
27:21
can I use this, can I
27:23
use this, where can I use
27:25
this, where were the conditions under
27:27
which this was captured. So that's,
27:29
you know, when I think of
27:31
it as being for both people
27:33
and human, sorry, people and humans,
27:35
humans and machines, it's, again, kind
27:37
of that interface as in, you
27:40
know, this tells you. where you
27:42
can access the data. This tells
27:44
you why you can access the
27:46
data and, you know, other really
27:48
important things like that. Yeah, I
27:50
agree. It should be human readable,
27:52
machine readable, and machine readable allows
27:54
you to do best things that
27:56
you spoke about Ryan. We can
27:58
go even further than that. realized
28:00
this quite early on, but once
28:02
you haven't a machine readable format
28:04
as well, you can do all sorts
28:07
of things and really build a whole
28:09
data platform around it. So we can
28:11
do things like change management around the
28:13
schema. Is it a breaking change or
28:16
is it not? Have a CI check
28:18
and prevent breaking changes and making rates
28:20
of production in the first place. Really
28:23
moving the whole class of instance
28:25
we had. We can categorise data
28:27
whether it's past data or not.
28:29
really no longer have that data. And really since
28:31
the last five or six years since we've
28:33
been using data contracts, we haven't found anything
28:35
we couldn't express in a day contract that
28:38
we could then implement our platform level. Even
28:40
simple things like doing backups on data and
28:42
backup, how long the package should be kept
28:44
for, make sure that's done. All people have
28:47
to do in their data contract is say,
28:49
own backups for 30 days. And the tool
28:51
that just takes care of it, but
28:53
I don't need to know exactly how
28:55
it's been backed up, where it's been
28:57
backed up to, it just happens. So
29:00
there's a lot of these ideas come
29:02
from platform engineering, which some of your
29:04
listeners may be from near with, but
29:06
really it's very powerful. Once you have
29:08
this machine readable and human readable contract,
29:10
there's no limit to what you can
29:12
automate through that, which really helps with
29:14
their governance as well. And coming back
29:17
to that thing around interfaces, you know,
29:19
what you describe, you know in your intent
29:21
rather than having to be like explicit
29:23
and when you think of the things
29:25
that have endured like this just take
29:28
in our favorite sequel right it is
29:30
you declare your intent you don't tell
29:32
it exactly how it should go and
29:34
manipulate the bits and bites and the
29:36
hard drive you say go get me
29:38
this thing from this thing and do
29:40
this aggregation right and in my mind
29:42
those are the interfaces that endure so
29:44
yeah you know if if you have
29:46
something you can evolve that is very
29:48
much around allowing people to say this
29:50
is my intent and this is the
29:52
things that I care about and then
29:54
abstract the you know the
29:57
implementation details or even better yet
29:59
cater for variety of implementation details. I
30:01
mean, that's the other thing that,
30:03
you know, I think is important
30:05
not to overlook is that you
30:07
can have from this one document
30:09
if you need to, you know,
30:11
cater for a variety of implementations.
30:13
You know, perhaps you have a
30:15
very sort of federated data capability
30:17
around your organization and perhaps there's
30:19
a very compelling reason for team
30:21
A to write their, you know,
30:23
implement a functionality that team B
30:25
has in a very specific way.
30:27
And that's okay, as long as
30:29
they can agree on the interface.
30:31
What is the future or the
30:33
desired end state for data contracts
30:35
in general, now that we're seeing
30:37
a lot of attention on them
30:39
and more adoption? Andrew, I think
30:41
you're probably the person who's put
30:43
more thought into this than probably
30:45
anyone else in the world. And
30:47
I'd like to know where your
30:50
thoughts are trending when it comes
30:52
to what happens next in the
30:54
current environment. Yeah, that's a good
30:56
question. I think that contract social
30:58
will continue because I think it's
31:00
a very simple idea, like apply
31:02
interfaces data, describe data, and use
31:04
that description to power a platform.
31:06
Very simple idea, but very powerful.
31:08
And I think we'll continue to
31:10
do that. I like to think
31:12
the standard will continue to take
31:14
off and evolve as well. They're
31:16
able to take contract standards we
31:18
spoke to that earlier because what
31:20
we've, what I find quite often
31:22
is we have a state contract
31:24
in one format because it continues.
31:26
and we're constantly changing it to
31:28
a different format to integrate a
31:30
different tool. So we convert it
31:32
to protobuff, to configure Kafka, or
31:34
whether it might be, or convert
31:36
it to some sort of adjacent
31:38
document to integrate a date catalog.
31:40
I mean, what would be great
31:42
if we could just convert it
31:44
to the standard format and then
31:46
you get a date catalog, you
31:48
get a tooling that helps you
31:50
with authentication and privacy and... governance
31:52
and all those kind of things,
31:54
you just plug it into your
31:56
data platform. So that'd be really
31:58
cool if we can achieve that.
32:00
I think what we're working towards
32:02
would be able to take a
32:04
contract standard. So yeah, I think
32:06
that contracts keep growing as an
32:08
idea. And I think it would
32:10
be driven by the idea that
32:12
what I was saying, a lot
32:14
of companies struggle with data quality.
32:16
That's got a great cost in
32:18
the organization in terms of how
32:20
much data engineering, time is spent
32:22
there, how many instances you have.
32:24
Well, at the same time, people
32:26
want to do more of their
32:28
data and achieve more of their
32:30
data. So I think those two
32:32
things have been true. We're going
32:34
to start applying more discipline to
32:36
our data and date contracts have
32:38
a supply of that discipline. Both
32:40
of you have had a lot
32:42
of thought put into this recently
32:44
with a variety of things that
32:46
you've been writing. We mentioned at
32:48
the top of this. Ryan, you're
32:50
working on a book right now
32:52
about data contracts. Do you want
32:54
to talk a little bit about
32:56
that? What you're hoping will come
32:58
from that and where people can
33:00
find it when it's ready? No
33:02
pressure, so I'm writing a book,
33:04
I may yet live to regret
33:06
that decision. No, it's certainly been
33:08
a growth opportunity, and I will
33:10
be certainly happy when it is
33:12
done. But no, but it's been
33:14
a great opportunity to grow and
33:16
learn. So what am I trying
33:18
to do with this book? Well,
33:21
what I'm looking to do is
33:23
build upon the work that Andrew
33:25
and others have done. And I
33:27
alluded to it earlier, I feel
33:29
that the conversation around data contracts
33:31
started within data engineering and software
33:33
development circles and that's where it
33:35
needed to start you know and
33:37
I think that's a valid place
33:39
for it but where I'm looking
33:41
to do is broaden the conversation
33:43
as I mentioned I am a
33:45
recovering business analyst and when I
33:47
initially approached data contracts I saw,
33:49
you know, it was an interface,
33:51
it was declarative, and it reminded
33:53
me of a number of things
33:55
that I'd gotten a lot of
33:57
value out of my career as
33:59
a business analyst, you know, things
34:01
like data dictionaries, things like, you
34:03
know, cucumber behavior driven development sort
34:05
of cucumber style type requiring specifications
34:07
and I thought yes this this
34:09
is fantastic because of all the
34:11
things that we spoke about earlier
34:13
and you know the idea that
34:15
it's an interface for people and
34:17
machines it is part of a
34:19
solution rather than you know separate
34:21
from the solution but also I
34:23
see it as a way to
34:25
kind of bring in perhaps some
34:27
disciplines that I
34:29
feel maybe have been, I don't say
34:31
missing, but it certainly feels like we
34:34
had an understanding a couple years ago
34:36
around how to build a three-tier application,
34:38
you know, across disciplines. You know, we
34:40
knew it from a database perspective, we
34:43
knew it from a software engineering perspective,
34:45
we knew it from like a requirements
34:47
management perspective. Like everyone had kind of
34:50
had the benefit of seeing the pattern
34:52
a good few times, like yeah, we
34:54
know what we're doing. changed rapidly. You
34:57
know, we kind of, you know, had
34:59
microservices, we had this idea of like
35:01
big data, which is now just become,
35:03
well, is it big enough to fit
35:06
on a Mac book? Yeah, okay, it's
35:08
data. But there's still been some shifts
35:10
around, again, that conversation about what does
35:13
the data mean? You know, like, what
35:15
do we desire, what do we want
35:17
from this data? And so what I'm
35:19
hoping to do with this book is,
35:22
yes, talk to practitioners, you know. data
35:24
engineers, software engineers, but also reach out
35:26
to people from sort of that requirements
35:29
management space, you know, the business analysts
35:31
and the data analysts, and then also
35:33
to leadership to kind of say that
35:36
again, you may come to this seeking
35:38
answers around data quality, right? But data,
35:40
it's a side effect of things happening.
35:42
You know, have we had a conversation
35:45
about how those things happen in our
35:47
organization? Do we agree? Right. because if
35:49
we can't agree on that, then we're
35:52
going to struggle to agree on just
35:54
about everything else that flows from this.
35:56
So that's really where I'm looking to...
35:59
kind of engage with my book is
36:01
to make sort of make the circle
36:03
bigger. I think it's time. You know,
36:05
the data engineers and the software engineers,
36:08
we've been talking about it for a
36:10
while, and I think we need to
36:12
kind of stretch out and bring some
36:15
more people in. For something that encompasses
36:17
an entire organization, yeah, we've got to
36:19
involve the entire organization. So I'm really
36:22
glad to see a lot more conversation
36:24
going on about how we can involve
36:26
different people from different disciplines into that.
36:28
into that space. Andrew, you also recently
36:31
wrote a white paper about data contracts
36:33
as well. Could you talk a bit
36:35
about that? Yeah, that's right. Yeah, I
36:38
wrote a relatively short white paper, again,
36:40
really just to give an instruction to
36:42
the contract for people who haven't heard
36:44
of it before, and really touch on
36:47
the power of it, and why it's
36:49
important, and what problems it can solve
36:51
in your organization. So I published that
36:54
recently on the website. you can get
36:56
to it at DC 101.io, so DC
36:58
for date contracts with numbers 101.io. And
37:01
yeah, if you're interested in date contracts
37:03
based on this conversation from what we've
37:05
been talking about today, then that could
37:07
be a good next step to go
37:10
a little bit deeper and really understand
37:12
the problems that can solve in your
37:14
organization. Hopefully. Throughout this conversation we've piqued
37:17
your interest in the entire topic. So
37:19
I want to thank you both so
37:21
much Ryan Collingwood and Andrew Jones for
37:24
joining me here today on the Thoughtworks
37:26
Technology podcast and we will see you
37:28
next time.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More