Mastering System Design Interview: Capacity Estimates, Scaling Challenges, and Strategic Insights

Mastering System Design Interview: Capacity Estimates, Scaling Challenges, and Strategic Insights

Released Tuesday, 15th October 2024
Good episode? Give it some love!
Mastering System Design Interview: Capacity Estimates, Scaling Challenges, and Strategic Insights

Mastering System Design Interview: Capacity Estimates, Scaling Challenges, and Strategic Insights

Mastering System Design Interview: Capacity Estimates, Scaling Challenges, and Strategic Insights

Mastering System Design Interview: Capacity Estimates, Scaling Challenges, and Strategic Insights

Tuesday, 15th October 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Hello everyone , welcome to Season

0:02

2 , Episode 2 of the Learn

0:04

System Design Podcast , with your host

0:06

, me , benny Ketchel . This

0:09

week we are going to continue down

0:11

the primer of building a system . These

0:14

episodes are geared a little

0:17

more towards the system design interview

0:19

that you'll find at most

0:21

tech companies , especially if you're going into

0:23

a more senior candidate role

0:25

, but I also just want to make clear

0:27

that these are the exact same steps and considerations

0:30

that I try to take into account when designing

0:32

a new system , whether that's

0:34

a fresh product or something from the

0:36

ground up . If you haven't yet

0:38

, I definitely recommend listening to the last

0:41

episode , that's episode number eight

0:43

, before listening to this one , as we will

0:45

be talking a lot retrospectively

0:48

about the functional and non-functional requirements

0:50

, not just in this episode but the next

0:52

couple of episodes , because all

0:54

of these topics are so tightly coupled

0:56

together . They're very important and they

0:59

sort of reinforce what you're doing

1:01

and why you're doing it . Yeah

1:03

, it's just important to build lost in the weeds on specific

1:05

topics and

1:47

, in my personal opinion , today's topic

1:50

is not only the easiest to get caught

1:52

in but it's also the easiest

1:54

to just completely wreck an interview

1:56

. That topic , of course , is

1:59

capacity estimates . It

2:01

is almost a rite of passage

2:03

to be an engineer . Find

2:06

yourself in a 45-minute interview

2:08

system design discussing

2:11

and spending about 40 of those

2:13

minutes trying to do math , make

2:15

broad-stroke assumptions about

2:17

how much load you might have . Do I have 1.6

2:20

million daily active users ? Do I have 1.625

2:23

daily active users ? 6

2:26

million daily active users ? Do I have 1.625 daily active users ? You

2:28

know these sorts of things and it's nothing to be ashamed of . That's a part of this interview

2:30

and it's a part of the process to learn how

2:32

to actually handle these sorts of ideas

2:35

and what it actually means

2:37

to scale . But , honestly

2:39

, here's the little secret that I've learned

2:41

the answer to how

2:44

many people are going to be using the product

2:46

or how much data throughput

2:48

do you need to handle . It's

2:51

always going to be a lot , not

2:53

1.265 , not 1.5

2:56

. It's just going to be a giant number . That

2:58

is very giant and doesn't

3:00

actually help you that much , because

3:03

here's the catch it doesn't really

3:06

matter , because the answer will always

3:08

be the giant number that

3:10

makes you feel like you need to focus on it and

3:13

instead of taking all your time focusing

3:15

on the arithmetic , today

3:17

I'm going to teach you how to focus on , how

3:20

to take into consideration the size

3:22

, but in a more streamlined fashion

3:24

so that you don't get caught up

3:26

on it . Let's think

3:28

about writing an algorithm , for instance

3:30

, something you've probably

3:32

done a lot , whether it's in school

3:35

, at work or in an interview

3:37

. When you do that

3:39

, you don't try to estimate the

3:41

number of people that are going to be using your

3:43

system . You don't try and estimate how

3:46

many times a certain piece of data

3:48

will go through a loop or what

3:50

have you . You think about the worst case scenario

3:53

. You call that number

3:55

relative . In , for example

3:57

, this loop will take big O of N

3:59

in time . Complexity , right . Then

4:02

why are we designing a system so

4:05

mathematically so precise , like

4:07

how to find how long

4:09

2,435 gigabytes

4:12

of video being read from sequential

4:14

memory will take if it's on a spinning

4:16

disk ? You know these sorts of things

4:19

are important , but the

4:21

specifics aren't that important . Instead

4:24

, let's focus on the crux

4:26

of the problem our

4:29

estimations and our capacity

4:31

, not our specifics and

4:33

the arithmetic . So what then

4:35

do we estimate ? The amount

4:37

of read and write throughput in your system

4:40

, for instance , is important , and

4:42

the amount of storage your

4:44

system will need to hold . When dealing

4:46

with the numbers , just keep it in

4:49

a factor of a thousand , right

4:51

, because the more

4:53

you round , the easier

4:56

it is , and the bigger the number , the less

4:58

the specifics matter . If I

5:00

say I'm going to give you $110

5:04

, you might just tell

5:06

everyone oh yeah , ben gave me $100

5:08

, right , the 10 doesn't really

5:11

matter , just like when you're doing

5:13

an algorithm . Is it big O

5:15

of n plus 2 ? Well

5:17

then , it's just big O of n , the plus 2 doesn't actually

5:20

matter . And so what

5:22

I mean when I say stick to factors of 1,000

5:24

, matter

5:27

. And so what I mean when I say stick to factors of a thousand , simply put , it's the difference

5:29

in 1,500 , 670 people and

5:31

154,000 people . Right

5:33

, those extra 200 or so people

5:35

is not going to bankrupt your company , it's

5:38

not going to break your system , and

5:40

it's a lot easier to do that sort

5:42

of back of napkin math with

5:44

nice round numbers . So

5:47

when , then , is it important to do those

5:49

quick calculations that I speak of

5:51

? Well , we'll get to that a little

5:53

later . For now , let's make sure we have

5:56

a few things memorized . These are

5:58

the important pieces of information you

6:00

should bring into any interview or

6:02

any calculation when you want to consider

6:04

your capacity . For some of

6:06

you it might feel like a refresher or common

6:09

knowledge , but for others it may

6:11

be the first time considering it , so I

6:13

want to cover it that way regardless . We're

6:15

all on the same page going into

6:17

how to do this back of napkin math

6:20

and what's really important

6:22

about it . So the first

6:24

thing is to always remember your

6:27

scales of 1000 and how they relate

6:29

to data sizes . In

6:31

this case I'll be using bytes , but

6:33

these factors can technically

6:36

be applied to anything that is a metric

6:38

. When it comes to tech , when

6:41

we think of data sizes

6:43

, we usually describe them as bits and bytes right , but it is important to understand

6:45

the levels of data sizes . We usually describe them as bits and bytes , right , but it is important to

6:47

understand the levels of these sizes

6:50

relative to one another . For

6:52

every thousand increments , we use a different

6:55

prefix . So for

6:57

every thousand bytes , we have

6:59

a kilo , like a kilobyte . For

7:01

every 1 million bytes , which , you

7:04

might note , is 1000 squared , we

7:07

use mega , like megabyte , and

7:09

so on and so on . I'll get to the rest

7:11

in a minute . The important ones

7:13

to remember are that 1000

7:15

or less is just unit . For example

7:17

, 560 bytes

7:20

of data and if you think

7:22

about it , a thousand raised to zero

7:24

is one , right ? A thousand raised to

7:26

one is a thousand , so

7:28

that would be a kilobyte . A thousand raised

7:31

to two would be a million . A

7:33

thousand raised to three would be a

7:35

billion . A thousand raised to four would

7:38

be a trillion , right . And so the

7:40

designations for those in

7:43

order would be just a byte

7:45

, a kilobyte , which is our thousand

7:47

, a megabyte

7:50

, which is a million , a gigabyte

7:52

, which is a billion , and then

7:54

terabyte , which is

7:57

a trillion , right

8:07

. It's probably easier to do arithmetic rather than having you know nine zeros , uh

8:09

, just having a thousand raised to three , because , again , once you take the

8:11

three out , ignore it , perform your arithmetic

8:14

, then add the cubed back in

8:16

. You're getting a good idea

8:18

of the scale without a lot of

8:20

very complicated um

8:22

, you know , arithmetic going

8:24

on in actual , when you

8:26

do your capacity estimates , it shouldn't

8:29

take you longer than five

8:31

to 10 minutes of the interview . It's honestly

8:34

sometimes possible

8:36

to just say hey , interviewer

8:38

, I'm going

8:41

to skip over this for now . I know it'll

8:43

be a large number , I'll give it 1.5

8:46

based on past examples

8:49

, and sometimes the interviewer will

8:51

just say okay , yeah , no problem

8:53

, I want to know how you think , I want to know

8:55

how you would approach the problem . I don't

8:57

care whether or not you can add

8:59

a bunch of large numbers , bunch

9:06

of large numbers . But we can even go further than that , right ? So let's take

9:08

into consideration , you have 1.35 trillion bytes

9:10

, right ? It's a lot easier

9:12

to drop all those zero and just have it be 1.35

9:16

and then say well , if

9:18

we want to 3x scale this , you can multiply

9:20

1.35 times 3 , rather

9:23

than some obscure number

9:25

like 1 , 3 , 5 , 6

9:27

, 2 , 3 , 4 times 3 . Right

9:30

, one of those is going to take you significantly

9:32

less time to parse through and

9:34

do that back of napkin math if

9:36

it's necessary . The important

9:38

part is that you can take those numbers

9:40

, say , say you know , with

9:43

a reasonable doubt , this is 1.35

9:45

gigabytes or terabytes , and

9:47

know , you know the difference

9:49

in scale between those numbers . And if the

9:52

interviewer presses you on you

9:54

know what that actually is . You can tell

9:56

them oh , that's 1.35

9:58

trillion or 1.35

10:00

, you know quadrillion or what

10:03

have you Um . And it shows

10:05

that you know what you're talking about , that you know the

10:08

, the numbers , and it's not just you

10:10

know , you playing um and

10:12

that you you're doing calculations , but you're

10:14

making it a lot easier for yourself . You're

10:17

working smarter . And

10:19

speaking on this concept

10:21

of working smarter , you

10:24

know , sometimes these tests can

10:26

take on specific constraints

10:29

and sometimes you need to think about a budget

10:31

. And that honestly brings

10:33

me to my next important factor that

10:36

you know we need for understanding

10:38

the dynamics of latency

10:41

across different constraints

10:43

on our system . You will

10:45

remember latency from

10:48

the episode one and episode

10:50

three and also kind of across

10:53

our core episodes

10:56

and our non-functional

10:58

requirements from the last episode as well

11:00

. The comparisons I'm about

11:03

to give you will directly link

11:06

to the non-functional requirements

11:08

from before . So

11:10

, if you're keeping track , currently we are on

11:13

step three in this whole process

11:15

and we are already calling

11:17

back to step two . By

11:20

considering the latency and making

11:22

call-outs about specific hardware , you're

11:25

already checking this

11:27

non-functional requirement off the

11:29

list . So let's talk

11:31

about the hard numbers . Right , to

11:34

read one megabyte

11:36

from memory , the entire

11:39

process will take around a

11:41

quarter of a millisecond , which is pretty

11:44

fast . But , as you may or

11:46

may not know , memory

11:53

is temporary , so you can't just hold everything in it . You need

11:55

some sort of long-term storage , and so the next fastest thing for our

11:58

process are solid-state drives , and

12:00

to fetch the exact same

12:02

amount of data one megabyte for

12:05

an SSD that you just

12:07

fetched from memory , if

12:10

you remember , which was quarter

12:12

of a millisecond , is actually

12:14

a 4x slowdown . Yeah

12:17

, that's right . So fetching one megabyte

12:20

of data from an SSD actually

12:22

takes an entire millisecond of

12:27

data from an SSD actually takes an entire millisecond . But of course , solid-state

12:29

drives are more expensive than a more traditional spinning disk

12:31

drive . How much more expensive ? Well

12:33

, if we talked not that long ago

12:35

, it would have been $40

12:37

per gigabyte to $0.05

12:40

per gigabyte , but thanks

12:42

to innovation and beautifully minded

12:44

, wonderful people , today

12:47

it's more along the lines of $2 per

12:49

gigabyte to 5 cents per gigabyte

12:51

, which again , may not seem like

12:53

a lot , but when you get into petabytes

12:56

worth of data , it means a big

12:58

, big bill . Why

13:00

, then , do most engineers design with

13:02

SSDs in mind when we don't

13:04

have these specific constraints ? Because

13:06

fetching one megabyte of data from

13:09

a spinning disk hard drive takes

13:11

a whopping 20 milliseconds , 20

13:14

times slower than an SSD

13:16

, 80 times slower than

13:18

fetching it from memory . Hard

13:21

disks still have their place , though

13:23

they're continued use , in

13:26

the ability to utilize them as cold

13:28

storage . If your

13:30

data is not being accessed a lot or

13:33

it's not super important that it comes up immediately

13:35

, maybe you can load it in the background . Storing

13:38

data on spinning disks is perfectly fine

13:40

, often encouraged to save some

13:42

money . Honestly

13:44

, one clever way I have seen cold storage

13:47

in this way implemented is with user

13:49

data , which might seem a little

13:51

strange . But follow me , think

13:53

about a video game that's gone viral

13:55

. Everyone is playing it on day one and

13:57

they're super excited and everyone's logging in

13:59

constantly . They've created their login

14:02

, modified their character and

14:04

it's been a month and they're over

14:06

it . Some people might stick around

14:08

and when they log in they want that process

14:10

to be quick . You want to get them in game

14:12

as quickly as possible . Again

14:15

, see the Amazon reference from episode

14:17

one and how fast you lose money when

14:19

things are slow . But if you're someone

14:21

who hasn't logged on in a while maybe

14:24

over a year then you're a little

14:26

more patient with logging in . You have no point

14:28

of reference for how long it should take

14:30

to be logged in and you know it's

14:33

sort of . You have a little

14:35

bit more flexibility in that sense and

14:37

so what you can do is on the

14:39

back end you can have like a cron job

14:41

that checks for the last login

14:44

time for a user and if it's

14:46

been over say , a month , move

14:48

that user data from the SSD

14:51

database to the hard drive database

14:53

. If , for

14:55

whatever reason , they log back on , you

14:58

just move that data back to the SSD

15:00

and if they never

15:02

log on again , no worries , you

15:04

aren't being charged a ton of money to store

15:06

it and the data is always there

15:08

. The

15:10

next piece of information I want you to sort of memorize

15:13

is the rough size of data in

15:15

a storage capacity . Have

15:18

you ever thought about how much space

15:20

the things you interact with on a daily basis

15:22

take up ? Let's talk

15:24

about like a company like netflix . They

15:27

get roughly 100 million videos

15:29

streamed a day . That

15:32

in itself is a gigantic number

15:34

, even if you're just talking about pictures

15:36

. But you know , netflix of

15:38

course works with video and

15:41

the rough size of a two-hour movie

15:43

on average is about one to two

15:45

gigabytes , and that's not 4k

15:47

. If we're talking about 4k

15:49

high-res movies , we're looking

15:51

somewhere in 10 to 20 gigabytes

15:54

apiece . And again , as

15:56

we talked about before , the number here doesn't

15:58

technically matter . Netflix

16:00

simply deals with a lot of data

16:02

. So having these rough estimates

16:05

are handy when trying to think intelligently

16:07

about the amount of data you're dealing

16:10

with . So if

16:12

a two-hour movie is one

16:14

to two gigabytes , if

16:16

you think about a small book worth of text

16:18

or a high-res photo , you're

16:20

looking more around a megabyte

16:22

, whereas a medium-resolution photo you can get as around like a megabyte

16:24

, whereas like a medium resolution photo , you can get as small as like 100

16:27

kilobytes . So

16:29

it's safe to say if you're building something

16:31

like Netflix versus building something

16:33

like Instagram , for instance , you're

16:35

going to approach it in a different

16:37

way . And

16:40

for our final piece of information

16:42

that I want you to remember , I want to talk

16:44

about the rough sizes of the company's

16:46

operations . This is to help

16:48

you memorize the scale at which you'll

16:50

need to think about the data being processed

16:53

, that sort of throughput , for

16:56

instance , designing a system that will handle

16:58

the same load as like a social

17:00

media network . You're looking around

17:03

a billion daily active

17:05

users . We talked

17:07

about Netflix and how they stream 100 million

17:09

videos a day . That's very important as well

17:12

. And Google it fights off

17:14

around 100,000 queries

17:16

per second , and

17:19

building an app like Wikipedia means

17:22

storing data somewhere in the neighborhood

17:24

of like 100 gigabytes if it's uncompressed

17:27

. So try to remember these

17:29

numbers so that if you go into an

17:31

interview and they say , design Netflix

17:33

or design Wikipedia

17:35

, design Google you know these very

17:37

common questions that you get . You

17:40

can sort of already know okay , I know

17:42

not only how much

17:44

you know . I'm going to need to handle like

17:47

a billion daily active users

17:49

, but also you know

17:51

a billion daily active users that will be

17:53

uploading , you

17:55

know , possibly a

17:58

megabytes worth of photos , but

18:00

also they're going to be reading a

18:02

megabyte's worth of photos , or

18:04

more If

18:06

they have a feed . You're going to be having

18:09

a lot of their friends and each

18:11

one of those friends has a one megabyte photo

18:13

, and so now you're sending a

18:16

billion daily active users photos

18:18

times the number

18:21

of friends they might have on average . But

18:26

again , these sort of quick maths , right , it's a lot easier to say

18:28

, well , one billion , okay , well

18:30

, that's just one

18:33

. And so if

18:35

it's one megabyte , okay , then roughly

18:37

one megabyte times a billion people

18:39

. We're looking in the neighborhood of

18:41

a terabyte worth of data

18:44

, you know , flooding through our

18:46

system . And okay , now then

18:48

, is it a read system or is it a write system

18:50

? Well , more people are doing reading on

18:53

Instagram and looking at pictures than

18:55

they are uploading . So , okay , I need

18:57

to focus on making sure I can handle that

18:59

throughput of having a terabyte worth of

19:01

data going out a

19:03

day and being read

19:05

from my system . And so

19:08

, finally , I want to give

19:10

you a reminder about the common mistakes

19:12

to avoid how to approach this

19:14

step when designing a system . If

19:17

you're taking an interview , you might be able

19:19

to brush past this step a bit

19:21

. As said , saying something

19:23

like the system will share similarities

19:25

to netflix . So I want to consider around

19:27

100 million videos at two gigabytes

19:29

a piece . Sometimes that's enough for the interview

19:31

, but they might push for a little extra

19:34

math and you want to give it some thought and

19:36

do those calculations like I just did

19:38

. Uh with like a social media , like

19:40

an instagram , but avoid trying

19:42

to calculate , like how many hard drives

19:44

it will take to hold it , or getting

19:47

too low levels with the numbers and

19:49

getting too specific with the numbers you're using

19:51

. On the other hand , if

19:54

you are designing a system in a real world

19:56

scenario , your capacity considerations

19:58

are important . You should consider

20:01

the cost of hardware . You should

20:03

consider the price of

20:05

hardware . You should consider using

20:07

an SSD versus an HDD

20:09

or both . Sometimes

20:11

using a spinning

20:14

disk drive is a lot cheaper . Sometimes

20:16

it's easier to host your own servers

20:18

than to use the cloud . Sometimes

20:21

it's a lot more expensive . It just

20:23

depends on your situation . Regardless

20:26

, remember to focus on the

20:28

crux of the problem . If this

20:30

is primarily a video-based service

20:32

, like Netflix , you don't need

20:34

to worry about calculating the size of

20:36

text for descriptions or the

20:39

size of the avatar

20:41

for the user and what that would

20:43

mean right . Focus on the big thing the videos

20:46

, the users consuming those videos

20:48

, how they're consuming them , and

20:50

work from there . At

20:53

the end of the day , the elements of

20:55

capacity estimates that

20:57

are always good to remember are your core facts

21:00

. Think about your numbers and you

21:02

know factors of a thousand . Keep things

21:04

high level and focus

21:06

on what your system should be doing , not

21:09

necessarily how many specific

21:12

times it should be doing it . It

21:14

will almost always be impossible

21:16

to take every little thing into consideration

21:19

, especially in an interview , so

21:22

just try and focus on the crux of the

21:24

problem , not all the little things that

21:26

might pop up . It

21:30

is perfectly fine to make small mistakes . No one is judging you on your ability

21:32

to multiply a couple of numbers

21:34

together . Instead , I want

21:36

to know that you have a good

21:38

idea of scale and a rough size

21:40

of the data that's being implemented , and that

21:42

you can take that into consideration

21:45

. From there , we can start talking

21:47

about how to scale the system , how to work with

21:49

it , et cetera , et cetera . Next

21:52

episode , we're going to be focused on steps

21:54

four and five . We'll be talking about

21:56

DB and API design

21:58

. These are very

22:00

important things . It helps you flesh out your

22:02

models and help understand what

22:05

an API will look like and that

22:07

you know how the data will be flowing through your

22:09

system . I want to give a special

22:11

thank you for everyone that reached

22:13

out . Of course , antonio Lettieri

22:15

, you had a great call out on our load

22:17

balancers episode , which I greatly appreciate

22:20

. Gamesply and BeerX

22:22

you guys have been killing it on the discord

22:25

, just making everyone feel welcome , and , and

22:27

I greatly appreciate that , we

22:29

also got a couple pieces of fan mail . Um

22:32

, unfortunately I can't reply and

22:34

I can't see your name . Uh

22:36

, so to the wonderful person in del

22:38

mar , california , and the other

22:40

wonderful person in the united Kingdom Thank

22:43

you so so much for the feedback and thank

22:45

you for the fan mail . And

22:47

, yeah , if you want to have

22:50

a more specific shout-out , feel free to send

22:52

me an email . And finally

22:54

, the biggest thank yous to everyone

22:56

on the Patreon Jake

23:09

Mooney , charles Cazals , eduardo Muth-Martinez

23:11

. Thank you so so much for

23:14

everything for supporting us on Patreon . Later

23:17

this month , I'm hoping to release a special episode

23:20

just for everyone on Patreon

23:22

. You guys are still getting the episodes a week early

23:24

, but I also want to do a special episode

23:26

just for you , focusing on authentication

23:28

. We did have a couple of people

23:30

vote on the poll asking for

23:33

an authentication episode , so I

23:35

want to do that special episode and eventually

23:37

I will release it on the main channel

23:39

, but it might not be for a month

23:41

or so , so they're getting a special

23:43

thing over there . So very much appreciated to

23:46

all of you . I will also be posting

23:48

a new poll very soon on Patreon , probably

23:50

around the time this goes out , asking

23:53

for what specific

23:55

interviews you guys want

23:57

me to start tackling . This has all just

23:59

been a primer , but I actually want

24:01

to talk about specific interviews , how

24:03

I would approach them , do some

24:05

research on the best ways to approach them

24:08

and sort of flesh that out for you guys

24:10

. So , definitely

24:12

, please go to the Patreon

24:15

, become a supporter if you have the

24:17

means , if you just

24:19

enjoy listening . It would mean the world

24:21

to me if you just provided some feedback

24:23

and send an email , tell

24:25

a friend , anything like that . It

24:28

all means the world to me , so I very much appreciate

24:30

it . If you would like to suggest

24:32

specific topics that you want

24:34

me to jump on , feel

24:37

free to drop me an email at learnsystemdesignpod

24:39

at gmailcom . Remember

24:42

to include your name if you'd like a shout out , if

24:44

you would like to help support the podcast , help me

24:46

pay my bills again . Please jump SoundCloudcom

24:49

. Slash aimless orbiter music and , with all

24:52

that being said , this has been Thank

24:54

you .

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features