Data contracts: What are they and why do they matter?

Data contracts: What are they and why do they matter?

Released Thursday, 14th November 2024
Good episode? Give it some love!
Data contracts: What are they and why do they matter?

Data contracts: What are they and why do they matter?

Data contracts: What are they and why do they matter?

Data contracts: What are they and why do they matter?

Thursday, 14th November 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Hello and welcome to the

0:02

Thoughtworks Technology Podcast. My name

0:04

is Lily Ryan, I'm one

0:07

of your regular hosts, and

0:09

I am speaking to you

0:12

from Melbourne Australia. In today's

0:14

conversation, I'm joined by Ryan

0:16

Collingwood and Andrew Jones. Ryan

0:19

is a data strategist at

0:21

Thoughtworks and is currently writing

0:23

a book on data contracts

0:26

that will be out in

0:28

2025, Ryan. Yeah, Q1, 2025. Fingers crossed.

0:30

Wonderful. And Andrew is the author

0:32

of Driving Quality with Data

0:34

Contracts, which was published in 2023.

0:37

Andrew, would you mind introducing yourself?

0:39

Yeah, sure. Yeah, hi, everyone. I'm

0:41

Andrew. I'm a independent date consultant,

0:43

and I help organizations build data platforms

0:46

that reduce risk and drive revenue. I've

0:48

been doing that for a little while

0:50

now, which is why I end up

0:52

coming up with a date contract a

0:54

few years ago and writing that book

0:56

that you mentioned. That's wonderful and as

0:58

you may have guessed we are here

1:01

today to talk about data contracts. This

1:03

is a topic that has been kicking

1:05

around for quite some time and has

1:07

recently I think come through quite strongly

1:09

in the way that we're all focusing

1:11

on data and a lot of our

1:13

practices. So to talk about it today

1:15

and to get to terms with what

1:17

data contracts are where they fit into

1:19

the software delivery life cycle and all

1:21

of that kind of stuff, Ryan and

1:23

Andrew are going to be our expert

1:25

guides and to kick us off. What is

1:27

a data contract? Yeah, sure. I think

1:29

the easiest way to think about it,

1:32

particularly if you're not familiar with

1:34

data engineering and the sort of

1:36

data landscape, the easy way to

1:39

think about it is that it's

1:41

an API for data. So the reason why

1:43

I came up with the idea originally

1:45

was we were getting data directly

1:47

from a date basis, chuck them

1:50

into a data warehouse and we're

1:52

trying to use that to drive

1:54

data applications. As you know, if you know that API,

1:56

it's like you don't really don't go direct to a date

1:58

base because the date base keeps changing. keeps, the

2:00

schemes keep changing and database evolving all

2:02

the time and you don't want that

2:05

to affect your downstream applications. And that's

2:07

the common situation we're in. We want

2:09

to use our data for more important

2:11

things, use it for more revenue generating

2:13

applications, use it for more AI andML

2:15

based applications and we couldn't do that

2:18

because our data was not reliable enough.

2:20

So I started thinking about why that

2:22

was the case and I realized... Eventually

2:24

I realized that the problem was the

2:26

lack of interfaces. The fact was building

2:28

on top of the database meant that

2:31

we couldn't have reliable data streams from

2:33

the optional applications. And click off engineering

2:35

background. I started thinking about why that

2:37

was the case. And eventually I came

2:39

to the idea about we need some

2:41

sort of API here. We need some

2:44

sort of interface here. That's what was

2:46

preventing us from building reliable applications on

2:48

top of our data. So yeah, I

2:50

would like to think of as an

2:52

API of data. and the key word

2:54

variable is the interface. This is the

2:57

interface data that provides abstraction, provides a

2:59

place for us to ensure that the

3:01

transfer of data between one place and

3:03

the other is reliable, more liable than

3:05

it was before. How does this differ

3:07

from straight up API documentation? Yeah, it's

3:10

similar in a way. The difference really

3:12

is the interface that you provide data

3:14

through. So an API generally you're making

3:16

a call back and forth, you're getting

3:18

a small mass data moving around or

3:20

you're making a call to start an

3:22

action. with data generally need to have

3:25

entire data set, for example if you're

3:27

on trained data science model, so that

3:29

might be in a table in a

3:31

data warehouse somewhere rather than a sort

3:33

of HTP API, or it might be

3:35

in a stream application like Kafka or

3:38

similar, you can use eight contracts there,

3:40

but generally you move around greater volumes

3:42

of data and the interaction between the

3:44

consumer and provider of that interface is

3:46

a bit different than it is compared

3:48

to a standard HCP or rest-based API.

3:51

Kind of coming back to this topic

3:53

of interfaces. It's interesting Andrew speaking about

3:55

one of the fundamental differences with data

3:57

right compared to say like you know

3:59

API specifications that we tend to think

4:01

of them for web services is that

4:04

the interface you know you mentioned a

4:06

couple there like Kafka or you know

4:08

even your flavor of sequel like that

4:10

can change and you know that's an

4:12

implementation detail that you know is as

4:14

fundamental and you know as much as

4:17

people might want to sort of have

4:19

like an Omni connector for every single

4:21

type of database interface in the world

4:23

and that's awesome. things that kind of

4:25

take some of that pain away, there

4:27

still is going to be some difference

4:30

here and there. But coming back to

4:32

the value, one of the values that

4:34

I see around the data contract, again

4:36

that word interface, is that in my

4:38

mind, what I really saw in data

4:40

contracts was this interface, you know, as

4:43

in, it is a plane of glass

4:45

where both systems and people can look

4:47

at the data, understand what the data

4:49

is, and when I say data, it's

4:51

data plus context, you know, so information

4:53

about the data. And it's having that

4:55

sort of standardized and not having to

4:58

think about how we're going to describe

5:00

the interface, because we figure that out

5:02

ahead of time, and then we apply

5:04

that pattern as time goes by. I

5:06

don't know whether it's because I'm getting

5:08

lazy or getting older, but I see

5:11

real value in not having to think

5:13

about how I'm going to do the

5:15

thing while doing the thing, if I

5:17

can just do the thing. You know,

5:19

that makes me happy. So yeah, you

5:21

know, that's a great call out there

5:24

around how, yes, you know, we can

5:26

think of it as an API for

5:28

data, but fundamentally the nature of the

5:30

problem does mean that there's going to

5:32

be a little bit of differences here

5:34

and there. And so having something that's

5:37

kind of consistent that we can hold

5:39

on to and also gives us guidance

5:41

around the things that we should be

5:43

describing as part of that information about

5:45

our data is only going to be

5:47

a good thing. So yeah, exactly that

5:50

right. And when we talk about interfaces,

5:52

we're talking, we start off talking about

5:54

the technical side, so it could be

5:56

a Kafka, it could be a database.

5:58

But when you start describing the interface,

6:00

like I say, you do want to

6:03

describe exactly how it's done. You want

6:05

to describe what you want to achieve

6:07

from that. So you start describing, I

6:09

want to make data available to micro-assumers

6:11

through, say, a streaming application. And then

6:13

you also want to. Because of data,

6:16

there's more things you need to describe

6:18

the data. So maybe we want to

6:20

describe how timely it is. Is it

6:22

going to be our day? Is it

6:24

going to be real time? You might

6:26

also want to describe what kind of

6:29

data is in there. Is it personal

6:31

data? Is it not personal data? Is

6:33

it not personal data? How can this

6:35

data be used? What can we be

6:37

joined with? You can stop with all

6:39

sorts of things about that contract to

6:41

help describe data even more. So like

6:44

I said, turning about data plus amount

6:46

of data plus context into information that

6:48

can be consumed much more quickly and

6:50

much more easy by the debt consumers.

6:52

So they can build the data applications

6:54

much more quickly as much more quickly

6:57

as they can build the data applications

6:59

much more quickly as well as be

7:01

much more reliably and deliver greater value

7:03

to organization. So I think it's particularly

7:05

important as we're trying to do more

7:07

of data for organization. To give our

7:10

listeners a sense of what you're talking

7:12

about when it comes to the way

7:14

that these things are defined, we have

7:16

talked about the kinds of schemas that

7:18

you might want. What does it actually

7:20

look like in practice for someone to

7:23

define? There is, for example, the open

7:25

data contract standard, which is a standard,

7:27

but we know that there could be

7:29

other standards, and organizations may want to

7:31

define their own, depending on their use

7:33

cases. How do you actually, from a

7:36

nuts and bolts perspective? put a data

7:38

contract together, what does it look like?

7:40

Yeah, it's a standard, the Appendate Contract

7:42

Standard, which I'm a little bit involved

7:44

in this part. of the Student Committee

7:46

there. And that's quite good, because it's

7:49

kind of a general-based standard that contains

7:51

everything you could think about putting in

7:53

date contract. But I would necessarily say

7:55

if people should use that just off

7:57

about necessarily, because when you think about

7:59

how to define date contracts in your

8:02

organization, the organization is a bit different,

8:04

but more importantly, people who's going to

8:06

be completing my date contract or using

8:08

their date contract, the interface needs to

8:10

be. the implementation of day contracts needs

8:12

to work for them. So, for example,

8:14

where I worked previously, we defined day

8:17

contracts in this kind of random language

8:19

that many people might not have heard

8:21

of called Jason and it's like Jason

8:23

with extensions, it's like the creation language

8:25

that came out of Google. And it's

8:27

not because it's the best way to

8:30

implement date contracts. It's because our engineering

8:32

teams, who we wanted to create these

8:34

day contracts and manage these date contracts

8:36

and own these date contracts, These engineering

8:38

teams, they use Jesuits to define the

8:40

APIs. They use Jesuits to find their

8:43

instructions code. So it made sense for

8:45

us to find their contract in the

8:47

same way, because you wanted it to

8:49

be clear that their contracts were at

8:51

this sort of level, similar to APIs,

8:53

similar to your instructions code, that is

8:56

important at these things, and to reduce

8:58

the friction for adoption, we wanted it

9:00

to be as easy as possible for

9:02

us software engineers to use. So that's

9:04

why we use Jeson it. I would

9:06

recommend anyone using this text on it,

9:09

unless they already using Jeson, unless they

9:11

already using Jeson, unless they already using

9:13

Jeson, unless they already using Jeson, and

9:15

unless they already using Jeson, and as

9:17

they already using a already using a

9:19

already using a already using Jeson, and

9:22

as they already using Jeson, and as

9:24

they already using a, and as they

9:26

already using a, and as they already

9:28

using a, and as they already using

9:30

a, and as they already using a,

9:32

and as they already using a, and

9:35

as they But the key point there

9:37

is when you think about implementing date

9:39

contracts in your organisation, think about first

9:41

of what you want to put in

9:43

it, which might be quite similar to

9:45

other organisations. So you can use the

9:48

standards for inspiration there, things like ownership,

9:50

things like SLOs, things like schemas, they're

9:52

going to be quite standard of course

9:54

organisations. But when it comes to implementing

9:56

it in your organisation, think about how

9:58

you want the date contract owners. to

10:00

complete two own rotate contracts, how they're

10:03

going to be using it and start

10:05

from there rather than making them use.

10:07

say just on it or Afro or

10:09

protobuff or whatever it is we might

10:11

want to use, think about the user

10:13

first. Yeah, like definitely resonate with that.

10:16

You know, my contacts when I first

10:18

encountered data contracts, the number of people

10:20

that had serious technical chops were definitely

10:22

in the minority. And you know, I'll

10:24

just say it, my first attempt at

10:26

a data contract was essentially an Excel

10:29

spreadsheet because that is what the people

10:31

that I needed to get knowledge out

10:33

of. that's what they understood. That's what

10:35

they were comfortable with. And yeah, it's

10:37

not necessarily my favorite way to structure

10:39

data, but it is a start. And

10:42

it was good enough for me to

10:44

kind of get going and then gradually

10:46

over time, you know, I managed to

10:48

find something that worked a little bit

10:50

better for me, but was still within

10:52

their tolerance that they could work with

10:55

that we could then collaborate on and

10:57

work together. But yeah, you know, I

10:59

definitely agree. I identify your audience. So,

11:01

and perhaps it's also maybe a moment

11:03

to kind of talk about that the

11:05

general parties are involved, you know, that

11:08

data producers and the data consumers. You

11:10

may find yourself in the situation where

11:12

I was initially where I was both

11:14

the consumer and effectively the producer working

11:16

in a centralized data team, which is

11:18

perhaps a topic we can briefly chat

11:21

about in a second. But you sort

11:23

of recognize who your parties are, what

11:25

their level of comfort is. And then,

11:27

you know, I think it's great in

11:29

your situation, Andrew, where you had an

11:31

existing set of tooling that worked for

11:33

people, they were comfortable with it, because,

11:36

you know, when we're changing the way

11:38

people work, I do believe that people

11:40

have like a tolerance for sort of

11:42

the amount of stuff that you can

11:44

throw them at once before they start

11:46

getting uncomfortable and it varies by people.

11:49

And so... If by reusing things that

11:51

people are already comfortable with that, you

11:53

know, gives you a little bit more

11:55

wiggle room to kind of make people

11:57

uncomfortable, then, you know, that's a smart

11:59

move. You're talking a lot about making

12:02

sure that you meet people whether

12:04

within the organization and ensuring that

12:06

whatever your data contracts are are

12:08

things that people can actively use

12:10

and understand and be part of

12:12

creating. What kind of maturity does

12:14

an organization need to have as

12:16

a prerequisite for getting value out

12:18

of data contracts or making them

12:20

work in the first place? You

12:22

know, in one case. to me,

12:24

maybe it would imply the existence

12:26

of databases, but we also know

12:28

that many organizations have to start

12:30

with spreadsheets and that's really, it

12:32

makes a lot of sense. The

12:34

world runs on Excel in a

12:36

lot of cases and that's something

12:38

that we have to work with.

12:40

So what kind of situation does

12:42

an organization need to get benefit

12:44

from a data contract and make

12:46

it work for them? It's not

12:48

one size of it's all, but

12:50

certainly that the problem that led

12:52

me to encountering data contracts was

12:54

I was at a point in

12:56

an organization where after much inspection

12:58

of the data platforms and data

13:00

systems which you know I had

13:02

inherited and there was still this

13:04

perception of there being a data

13:07

problem and you know yeah there

13:09

were some things that needed some

13:11

uplift but really what I came

13:13

down to in the end was

13:15

you know as career limiting as

13:17

it may sound Often data is

13:19

a side effect of things happening,

13:21

right? And if there's things that

13:23

are happening, not happening in a

13:25

consistent manner, you know, so if

13:27

there's three different ways to process

13:29

a refund, that means that there

13:31

could be at least three different

13:33

expressions of the data associated to

13:35

that event, process a refund, you're

13:37

going to have data quality challenges,

13:39

right? And so... got to the

13:41

stage at this organization where it's

13:43

like, all right, let's have a

13:45

conversation about what our processes are.

13:47

Let's standardize some things. You know,

13:49

they've been rapid growth. People had

13:51

to kind of adapt an event

13:53

as they as they went. And

13:55

sometimes it's necessary. Sometimes you just

13:57

got to make it up, you

13:59

got to figure it out, you

14:01

got to get things done, but

14:03

then certainly, you know, I think

14:05

technology is so familiar with the

14:07

idea of technical debt. At some

14:09

point, you have to pay down

14:11

the debt that you've accumulated. And just

14:13

like technical debt, there is process debt.

14:16

And so that's what initially led me

14:18

to this thing of, all right, well,

14:20

we're having this conversation about what our

14:22

processes. Here is an opportunity to codify

14:24

to codify all this great information and

14:27

this agreement that we're getting. into some

14:29

sort of thing that can be reused.

14:31

And as a recovering business analyst, you

14:33

know, my past lives, I added exposure

14:35

to things like data dictionaries and what

14:38

have you. And I think, kind of

14:40

also having some software development exposure, I

14:42

thought, well, if we do this right, if

14:44

we do this in a way that is

14:46

structured and possible, we can use this, you

14:49

know, like this can be a living

14:51

document, this doesn't have to be a

14:53

document that's correct at the launch of

14:55

this thing. and then immediately becomes out

14:57

of date once we had our first

15:00

SEV1 and then we do like an

15:02

in-place fix and then you know the

15:04

documentation and the implementation just diverge right

15:06

if the documentation can be

15:08

part of the integration can be part of

15:11

the solution that as it evolves and as

15:13

it lives well wow you know that's a

15:15

document that's worth something so to come back

15:17

to the question of you know where should

15:19

you be as an organization I think that

15:21

if you are at a point of you know

15:24

having data quality challenges and

15:26

or, and I think if you are considering data

15:28

quality challenges is also worth having

15:30

a conversation where I will, are

15:32

processes actually what we need them

15:35

to be, because you can tackle

15:37

the data quality on its own,

15:39

but I feel that having that

15:41

conversation around is the, is the

15:44

activity that generates this data what

15:46

we expected to be? Oh, are we

15:48

agreed, right? You're just kicking the ball

15:50

down the road. Yeah, so that will

15:52

kind of be the big call out

15:55

there is that if you are serious

15:57

about this, I would say that are

15:59

you... willing to have a conversation about

16:02

what it is that we do

16:04

that generates this data. And are

16:06

we agreed? Are we aligned on

16:08

what those processes are? Because we've

16:10

briefly touched on it, but within

16:12

the parlance data contracts, and I

16:14

really love the way that Andrew,

16:16

you framed this in your book,

16:18

the conversation around that there are

16:20

data producers and data consumers. So

16:22

data producers are people that are

16:24

producing data, it says that they're

16:26

in the name. Consumers obviously want

16:28

to consume it. But the real

16:30

kind of big gotcha is often

16:32

data producers don't know or don't

16:35

recognize or if you want to

16:37

be an realistic, don't care, but

16:39

I'd say that's the minority of

16:41

cases. I think it's more they

16:43

don't know, they don't recognize, they're

16:45

producing data. Because if you're dealing

16:47

with a department of your company,

16:49

that's in procurement. And you ask

16:51

them, what's your role? What's your

16:53

function? They're not going to say

16:55

it's to enter purchase orders. They're

16:57

going to say something like, it's

16:59

the source materials at a great

17:01

price so that we can, you

17:03

know, have a better margin. It

17:05

might be something like that, right?

17:07

But they're not going to say

17:10

that it's something related to data

17:12

entry. But capturing that data, things

17:14

like capturing purchase orders, it's vitally

17:16

important for like the rest of

17:18

the business, you know, because it's

17:20

all these series of interconnected value

17:22

chains, right? And if it's not

17:24

done or it's poorly done, it

17:26

makes this other really important task

17:28

really difficult. So that's why I

17:30

say, you know, where does an

17:32

organization need to be? I think

17:34

it needs to be kind of

17:36

at that point of a reckoning

17:38

around, yes, maybe it's it's born

17:40

out of a concern around data

17:42

quality, but I think to really

17:45

make a difference, there must be

17:47

an openness to have a conversation

17:49

about what is our process. And,

17:51

and I think underlying it, you

17:53

know, this is where it kind

17:55

of touches into like the social

17:57

aspect of it is being prepared

17:59

to to develop empathy for, you

18:01

know, describing that scenario of data

18:03

producer, you know, way upstream and

18:05

data consumer way downstream, right, is

18:07

that I do believe that if

18:09

we help those people who are

18:11

upstream, have empathy for the people

18:13

who are downstream from them, to

18:15

go like, this information that is

18:17

perhaps kind of a chore for

18:20

you to fill in and capture

18:22

and do, has a real material

18:24

difference these people downstream. I want

18:26

to come back to that. issue

18:28

around empathy in a minute, Ryan,

18:30

because I think it's something that's

18:32

really worth digging into. Andrew, from

18:34

your point of view and in

18:36

your experience, what has it been

18:38

like working with a variety of

18:40

different contexts to make data contracts

18:42

work? And what have you learned

18:44

about it over the time where

18:46

you've been maturing these ideas for

18:48

yourself? Yeah, I see a question.

18:50

I think, like what I said,

18:52

many organizations struggle with that equality,

18:55

are looking at most of their

18:57

contracts trying to help them change

18:59

the organisation of ones where they

19:01

are now using data for something

19:03

more important than, well, even more

19:05

important than it was before. So

19:07

it's some part of their strategic

19:09

goal, where that's some amount of

19:11

data that's just using data to

19:13

create products to drive revenue or

19:15

different chance themselves in the market.

19:17

It's kind of that realization that

19:19

data is key to a business.

19:21

that gives those organizations the, I

19:23

don't want to say the freedom

19:25

but the ability, the chance to

19:27

look again at how they're doing

19:30

data and really like do a

19:32

root cause analysis on why can

19:34

we not achieve this now? What

19:36

is, what is the problem we're

19:38

trying to solve here to achieve

19:40

our goal? And then you go

19:42

back to their quality issues where

19:44

you go back to how you're

19:46

sorting the data and whether there's

19:48

an interface around that and eventually

19:50

you can protect their contracts. better

19:52

data and they go from there

19:54

back up the stream to work

19:56

out why our data isn't what

19:58

it needs to be. How do

20:00

you get to that point in

20:02

a conversation where you help people

20:05

to the realization that it is

20:07

a data quality issue? Because as

20:09

far as I can see, and

20:11

in a lot of my experience,

20:13

it has certainly been the case

20:15

that data quality issues are the

20:17

root of many of many different

20:19

types of business problems, but because

20:21

they are so deeply interconnected, these

20:23

are the byproducts of other actions

20:25

that create the data. And because...

20:27

data so deeply interconnected with everything

20:29

a business does it can be

20:31

really hard to identify that as

20:33

a particular issue and when you

20:35

do it's also kind of difficult

20:37

to get buy-in from people who

20:40

are not data engineers or data

20:42

scientists to participate in a conversation

20:44

like this and see it as

20:46

something that is worth investing their

20:48

time in and worth investing their

20:50

time in involving and maintaining over

20:52

a longer period. So in your

20:54

experience Andrew, how do you get

20:56

to a point with a business

20:58

like that where you really need

21:00

to to come to that realization

21:02

collectively in order for a data

21:04

contract to work? Yeah, it's really

21:06

all about communication. The problem data

21:08

teams often have is they are

21:10

in a part of business, maybe

21:12

under IT or even under finance,

21:15

but not... generally part of, they're

21:17

generally kind of hidden in the

21:19

business and they have been almost

21:21

suffering in silence with this. They're

21:23

like, they are expected to get

21:25

data from a variety of sources,

21:27

a variety of quality levels, and

21:29

turn that into reporting. And I

21:31

think that's kind of okay for

21:33

results weren't that important. I mean,

21:35

if you can't be important, reporting

21:37

is important, particularly if they're important.

21:39

if it fails, the business carries

21:41

on running. Now that's something more

21:43

important things. They need to, or

21:45

they have the ability to, start

21:47

having these conversations with those producing

21:50

data and really highlight these issues

21:52

of having a run, like, why

21:54

are they not able to? to

21:56

turn this data around quickly. Why

21:58

does it pipeline to keep failing?

22:00

And if I keep failing, then

22:02

how can we build this wherever

22:04

treated goal they want to build

22:06

onto that data? So when people

22:08

ask questions like that, they need

22:10

to have ability or they need

22:12

to feel confident enough to raise

22:14

issues, say, well, this is the

22:16

reason. It's because the stream is

22:18

changing or the data entry is

22:20

not correct or whether it might

22:22

be. They need to have these

22:25

conversations. And that's what really. We

22:27

spent a lot of time doing,

22:29

like we started off talking about

22:31

day contracts in a technical sense,

22:33

but really, in my experience, a

22:35

lot of it is, it's about

22:37

communication. And when I started doing

22:39

day contracts, I had a great

22:41

problem manager I worked with, and

22:43

between us, we spent so much

22:45

time talking to all parts of

22:47

business, with our software engineers, or

22:49

that's the CTO, or that's people

22:51

in between, it doesn't really matter,

22:53

well, it doesn't matter who, but

22:55

like, like, always communicating with communicating

22:57

with challenges we're having. Really, repeating

23:00

that message in different ways, different

23:02

audiences, explaining why we had to

23:04

change things in one part of

23:06

business to achieve things in our

23:08

part of business, in data part

23:10

of business. So really, it's all

23:12

about communication. Ryan, I've seen you

23:14

in the past when we've discussed

23:16

this topic say that data contracts

23:18

are a document for both humans

23:20

and machines. And that is something

23:22

that I would love to hear

23:24

a bit more about from you,

23:26

particularly as it relates to the

23:28

way that we can evolve data

23:30

contracts now that they are becoming

23:32

a bit more of a mainstream

23:35

topic that people really want to

23:37

engage with. Yeah, yeah, you know,

23:39

as I mentioned previously, you know,

23:41

I see this, a data contract

23:43

is bundling, yes, this is about

23:45

data, but this is really information

23:47

about data, you know, so data,

23:49

that's a word that... kind of

23:51

has been used so often that

23:53

it feels like it's losing its

23:55

meaning, but really this is kind

23:57

of what I see a lot

23:59

of the value about the data

24:01

control. is like, this is a

24:03

document that describes what the data

24:05

is. And what I mean by

24:07

that is the meaning of the

24:10

data. So one of my bug

24:12

bears that I've had around the

24:14

discourse, particularly amongst data engineering folks.

24:16

And God bless them, I love

24:18

them. You know, they're my tribe.

24:20

But we tend to get ready

24:22

fixated on schema. And that's often

24:24

where the conversation ends. And anyone

24:26

else outside of like the data

24:28

engineering world. does not particularly care

24:30

about schema, they don't understand schema,

24:32

they be far much in the

24:34

semantics, you know, what does this

24:36

data mean? And how can I

24:38

use it? And how am I

24:40

allowed to use it? And what

24:42

were the conditions under which this

24:45

information was captured? Because if you

24:47

think of, say, you know, a

24:49

table of email addresses, let's just

24:51

say, and you have no other

24:53

context about why this table of

24:55

email addresses exists, I mean, you

24:57

know, PII, you know, PII, Spidey

24:59

sends us tingling, but you know,

25:01

this overlooked that for a second.

25:03

Right. If you just know that

25:05

it is a table that has

25:07

string or Varchar type data that

25:09

looks like email addresses, that doesn't

25:11

really tell you a whole lot.

25:13

But if you then learn, oh,

25:15

hold on, these email addresses represent

25:17

people that have told us, please

25:20

stop emailing me. Right? That's a

25:22

very different conversation. Because if you

25:24

didn't know that. you know and

25:26

perhaps you were misbehaving you might

25:28

take that list of email addresses

25:30

like oh fantastic these must belong

25:32

to customers they want to hear

25:34

from us let me send them

25:36

some good news about the fantastic

25:38

things we're doing but this data

25:40

set represents people that do not

25:42

want to hear from you right

25:44

and this is the kind of

25:46

information that is often you know

25:48

as I say it's It's not

25:50

bundled with the data. So this

25:52

is another important element about data

25:55

contracts is that data contracts allow

25:57

us to speak for the data

25:59

in ways the data contracts. necessarily

26:01

speak for itself. And so that's

26:03

a really important aspect about how

26:05

data contracts are both for the

26:07

benefit of machines and also for

26:09

the benefit of humans, because where,

26:11

you know, the benefit of the

26:13

machines comes along is that, yeah,

26:15

we can put in these things

26:17

that, you know, we know, we

26:19

know, we know, and love like

26:21

schema. We can also put in

26:23

things like range checks and, you

26:25

know, pattern matches that we expect

26:27

to exist and validate and build

26:30

into our pipelines, right. And certainly,

26:32

you know, if you're transmitting data

26:34

between different systems and different programming

26:36

languages, I just this afternoon kind

26:38

of went off a bit of

26:40

a tangent on how by virtue

26:42

of being descended from JavaScript, that

26:44

adjacent schema, if you don't take

26:46

the time and care to be

26:48

really specific with your numbers, you

26:50

can. have some really less than

26:52

optimal experiences because of, you know,

26:54

JavaScript's, let's say relaxed approach to

26:56

data types compared to some other

26:58

programming languages, right? And let's not

27:00

give you started on equality. But

27:02

yeah, like, and so there's additional

27:05

information that we can put into

27:07

the data contract that, you know,

27:09

gives us some safeguards and some

27:11

security around systems and interoperability about

27:13

systems, but also importantly, because us

27:15

safety and and I guess a

27:17

degree of comfort around interoperability for

27:19

people, you know, as in how

27:21

can I use this, can I

27:23

use this, where can I use

27:25

this, where were the conditions under

27:27

which this was captured. So that's,

27:29

you know, when I think of

27:31

it as being for both people

27:33

and human, sorry, people and humans,

27:35

humans and machines, it's, again, kind

27:37

of that interface as in, you

27:40

know, this tells you. where you

27:42

can access the data. This tells

27:44

you why you can access the

27:46

data and, you know, other really

27:48

important things like that. Yeah, I

27:50

agree. It should be human readable,

27:52

machine readable, and machine readable allows

27:54

you to do best things that

27:56

you spoke about Ryan. We can

27:58

go even further than that. realized

28:00

this quite early on, but once

28:02

you haven't a machine readable format

28:04

as well, you can do all sorts

28:07

of things and really build a whole

28:09

data platform around it. So we can

28:11

do things like change management around the

28:13

schema. Is it a breaking change or

28:16

is it not? Have a CI check

28:18

and prevent breaking changes and making rates

28:20

of production in the first place. Really

28:23

moving the whole class of instance

28:25

we had. We can categorise data

28:27

whether it's past data or not.

28:29

really no longer have that data. And really since

28:31

the last five or six years since we've

28:33

been using data contracts, we haven't found anything

28:35

we couldn't express in a day contract that

28:38

we could then implement our platform level. Even

28:40

simple things like doing backups on data and

28:42

backup, how long the package should be kept

28:44

for, make sure that's done. All people have

28:47

to do in their data contract is say,

28:49

own backups for 30 days. And the tool

28:51

that just takes care of it, but

28:53

I don't need to know exactly how

28:55

it's been backed up, where it's been

28:57

backed up to, it just happens. So

29:00

there's a lot of these ideas come

29:02

from platform engineering, which some of your

29:04

listeners may be from near with, but

29:06

really it's very powerful. Once you have

29:08

this machine readable and human readable contract,

29:10

there's no limit to what you can

29:12

automate through that, which really helps with

29:14

their governance as well. And coming back

29:17

to that thing around interfaces, you know,

29:19

what you describe, you know in your intent

29:21

rather than having to be like explicit

29:23

and when you think of the things

29:25

that have endured like this just take

29:28

in our favorite sequel right it is

29:30

you declare your intent you don't tell

29:32

it exactly how it should go and

29:34

manipulate the bits and bites and the

29:36

hard drive you say go get me

29:38

this thing from this thing and do

29:40

this aggregation right and in my mind

29:42

those are the interfaces that endure so

29:44

yeah you know if if you have

29:46

something you can evolve that is very

29:48

much around allowing people to say this

29:50

is my intent and this is the

29:52

things that I care about and then

29:54

abstract the you know the

29:57

implementation details or even better yet

29:59

cater for variety of implementation details. I

30:01

mean, that's the other thing that,

30:03

you know, I think is important

30:05

not to overlook is that you

30:07

can have from this one document

30:09

if you need to, you know,

30:11

cater for a variety of implementations.

30:13

You know, perhaps you have a

30:15

very sort of federated data capability

30:17

around your organization and perhaps there's

30:19

a very compelling reason for team

30:21

A to write their, you know,

30:23

implement a functionality that team B

30:25

has in a very specific way.

30:27

And that's okay, as long as

30:29

they can agree on the interface.

30:31

What is the future or the

30:33

desired end state for data contracts

30:35

in general, now that we're seeing

30:37

a lot of attention on them

30:39

and more adoption? Andrew, I think

30:41

you're probably the person who's put

30:43

more thought into this than probably

30:45

anyone else in the world. And

30:47

I'd like to know where your

30:50

thoughts are trending when it comes

30:52

to what happens next in the

30:54

current environment. Yeah, that's a good

30:56

question. I think that contract social

30:58

will continue because I think it's

31:00

a very simple idea, like apply

31:02

interfaces data, describe data, and use

31:04

that description to power a platform.

31:06

Very simple idea, but very powerful.

31:08

And I think we'll continue to

31:10

do that. I like to think

31:12

the standard will continue to take

31:14

off and evolve as well. They're

31:16

able to take contract standards we

31:18

spoke to that earlier because what

31:20

we've, what I find quite often

31:22

is we have a state contract

31:24

in one format because it continues.

31:26

and we're constantly changing it to

31:28

a different format to integrate a

31:30

different tool. So we convert it

31:32

to protobuff, to configure Kafka, or

31:34

whether it might be, or convert

31:36

it to some sort of adjacent

31:38

document to integrate a date catalog.

31:40

I mean, what would be great

31:42

if we could just convert it

31:44

to the standard format and then

31:46

you get a date catalog, you

31:48

get a tooling that helps you

31:50

with authentication and privacy and... governance

31:52

and all those kind of things,

31:54

you just plug it into your

31:56

data platform. So that'd be really

31:58

cool if we can achieve that.

32:00

I think what we're working towards

32:02

would be able to take a

32:04

contract standard. So yeah, I think

32:06

that contracts keep growing as an

32:08

idea. And I think it would

32:10

be driven by the idea that

32:12

what I was saying, a lot

32:14

of companies struggle with data quality.

32:16

That's got a great cost in

32:18

the organization in terms of how

32:20

much data engineering, time is spent

32:22

there, how many instances you have.

32:24

Well, at the same time, people

32:26

want to do more of their

32:28

data and achieve more of their

32:30

data. So I think those two

32:32

things have been true. We're going

32:34

to start applying more discipline to

32:36

our data and date contracts have

32:38

a supply of that discipline. Both

32:40

of you have had a lot

32:42

of thought put into this recently

32:44

with a variety of things that

32:46

you've been writing. We mentioned at

32:48

the top of this. Ryan, you're

32:50

working on a book right now

32:52

about data contracts. Do you want

32:54

to talk a little bit about

32:56

that? What you're hoping will come

32:58

from that and where people can

33:00

find it when it's ready? No

33:02

pressure, so I'm writing a book,

33:04

I may yet live to regret

33:06

that decision. No, it's certainly been

33:08

a growth opportunity, and I will

33:10

be certainly happy when it is

33:12

done. But no, but it's been

33:14

a great opportunity to grow and

33:16

learn. So what am I trying

33:18

to do with this book? Well,

33:21

what I'm looking to do is

33:23

build upon the work that Andrew

33:25

and others have done. And I

33:27

alluded to it earlier, I feel

33:29

that the conversation around data contracts

33:31

started within data engineering and software

33:33

development circles and that's where it

33:35

needed to start you know and

33:37

I think that's a valid place

33:39

for it but where I'm looking

33:41

to do is broaden the conversation

33:43

as I mentioned I am a

33:45

recovering business analyst and when I

33:47

initially approached data contracts I saw,

33:49

you know, it was an interface,

33:51

it was declarative, and it reminded

33:53

me of a number of things

33:55

that I'd gotten a lot of

33:57

value out of my career as

33:59

a business analyst, you know, things

34:01

like data dictionaries, things like, you

34:03

know, cucumber behavior driven development sort

34:05

of cucumber style type requiring specifications

34:07

and I thought yes this this

34:09

is fantastic because of all the

34:11

things that we spoke about earlier

34:13

and you know the idea that

34:15

it's an interface for people and

34:17

machines it is part of a

34:19

solution rather than you know separate

34:21

from the solution but also I

34:23

see it as a way to

34:25

kind of bring in perhaps some

34:27

disciplines that I

34:29

feel maybe have been, I don't say

34:31

missing, but it certainly feels like we

34:34

had an understanding a couple years ago

34:36

around how to build a three-tier application,

34:38

you know, across disciplines. You know, we

34:40

knew it from a database perspective, we

34:43

knew it from a software engineering perspective,

34:45

we knew it from like a requirements

34:47

management perspective. Like everyone had kind of

34:50

had the benefit of seeing the pattern

34:52

a good few times, like yeah, we

34:54

know what we're doing. changed rapidly. You

34:57

know, we kind of, you know, had

34:59

microservices, we had this idea of like

35:01

big data, which is now just become,

35:03

well, is it big enough to fit

35:06

on a Mac book? Yeah, okay, it's

35:08

data. But there's still been some shifts

35:10

around, again, that conversation about what does

35:13

the data mean? You know, like, what

35:15

do we desire, what do we want

35:17

from this data? And so what I'm

35:19

hoping to do with this book is,

35:22

yes, talk to practitioners, you know. data

35:24

engineers, software engineers, but also reach out

35:26

to people from sort of that requirements

35:29

management space, you know, the business analysts

35:31

and the data analysts, and then also

35:33

to leadership to kind of say that

35:36

again, you may come to this seeking

35:38

answers around data quality, right? But data,

35:40

it's a side effect of things happening.

35:42

You know, have we had a conversation

35:45

about how those things happen in our

35:47

organization? Do we agree? Right. because if

35:49

we can't agree on that, then we're

35:52

going to struggle to agree on just

35:54

about everything else that flows from this.

35:56

So that's really where I'm looking to...

35:59

kind of engage with my book is

36:01

to make sort of make the circle

36:03

bigger. I think it's time. You know,

36:05

the data engineers and the software engineers,

36:08

we've been talking about it for a

36:10

while, and I think we need to

36:12

kind of stretch out and bring some

36:15

more people in. For something that encompasses

36:17

an entire organization, yeah, we've got to

36:19

involve the entire organization. So I'm really

36:22

glad to see a lot more conversation

36:24

going on about how we can involve

36:26

different people from different disciplines into that.

36:28

into that space. Andrew, you also recently

36:31

wrote a white paper about data contracts

36:33

as well. Could you talk a bit

36:35

about that? Yeah, that's right. Yeah, I

36:38

wrote a relatively short white paper, again,

36:40

really just to give an instruction to

36:42

the contract for people who haven't heard

36:44

of it before, and really touch on

36:47

the power of it, and why it's

36:49

important, and what problems it can solve

36:51

in your organization. So I published that

36:54

recently on the website. you can get

36:56

to it at DC 101.io, so DC

36:58

for date contracts with numbers 101.io. And

37:01

yeah, if you're interested in date contracts

37:03

based on this conversation from what we've

37:05

been talking about today, then that could

37:07

be a good next step to go

37:10

a little bit deeper and really understand

37:12

the problems that can solve in your

37:14

organization. Hopefully. Throughout this conversation we've piqued

37:17

your interest in the entire topic. So

37:19

I want to thank you both so

37:21

much Ryan Collingwood and Andrew Jones for

37:24

joining me here today on the Thoughtworks

37:26

Technology podcast and we will see you

37:28

next time.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features