Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Hi listeners welcome to No Priors. This week we're
0:02
speaking to Chelsea Finn, co-founder of Physical
0:04
Intelligence, a company bringing General Purpose AI
0:06
into the Physical World. Chelsea co-founded Physical
0:08
Intelligence, a company bringing General Purpose AI
0:11
into the physical world. Chelsea co-founded Physical
0:13
Intelligence alongside a team of leading researchers
0:15
and minds on the field. She's an
0:17
associate professor of computer science and electrical
0:19
engineering at Stanford University, and prior to
0:21
that she looked at Google Brain and
0:24
was at Berkeley. Chelsea's research focuses
0:26
on how AI systems can acquire general
0:28
purpose skills through interactions with the world.
0:30
So Chelsea, thank you so much for
0:32
joining us today. I know priors. Yeah,
0:34
thanks for having me. You've done a
0:36
lot of really important storied work in
0:39
robotics between your work and robotics between
0:41
your work and Google, at Stanford, etc.
0:43
So I would just love to hear
0:45
a little bit firsthand your background in
0:48
terms of your path in the world of
0:50
robotics, what drew you to it initially and
0:52
some of the work that you've. in the
0:54
world, but at the same time
0:56
I was also really fascinated by
0:58
this problem of developing perception and
1:00
intelligence and machines and robots embody
1:02
all of that. And also there's
1:04
sometimes there's some cool math that
1:06
you can do as well that
1:08
makes keep your brain active, makes
1:10
you think. And so I think
1:12
all of that is really fun
1:14
about working. in the field. I
1:16
started working more seriously in robotics
1:18
more than 10 years ago at
1:20
this point at the start of
1:22
my PhD at Berkeley and we
1:24
were working on neural network control
1:27
trying to train neural networks that
1:29
map from image pixels to directly
1:31
actually to motor torques on a
1:33
robot arm at the time. this
1:35
was not very popular and we've
1:37
come a long way and it's
1:39
a lot more accepted in robotics
1:41
and also just generally something that
1:43
a lot of people are excited
1:45
about. Since that beginning point it
1:47
was very clear to me that
1:50
we could train robots to do
1:52
pretty cool things but that getting the
1:54
robot to do one of those things
1:56
in many scenarios with many objects was
1:58
a major major challenge. So 10 years
2:00
ago we were training robots to like
2:02
screw a cap onto a bottle and
2:05
use a spatula to lift an object
2:07
into a bowl and kind of do
2:09
a tight insertion or hang up like
2:11
a hanger on a clothes rack. And
2:13
so pretty cool stuff, but actually getting
2:15
the robot to do that in many
2:18
environments with many objects, that's where a
2:20
big part of the challenge comes in
2:22
and I've been thinking about ways to
2:24
make. broader data sets, train on those
2:26
broader data sets, and also different approaches
2:28
for learning, whether it be reinforcement learning,
2:31
video prediction learning, all those things. And
2:33
so, yeah, move from, so into your
2:35
at Google Brain, in between my PhD
2:37
and joining Stanford, became a professor at
2:39
Stanford, started a lab there, did a
2:41
lot of work along all these lines.
2:44
and then recently started physical intelligence almost
2:46
a year ago at this point. So
2:48
I've been on leave from Stanford for
2:50
that and it's been really exciting to
2:52
be able to try to execute on
2:54
the vision that the co-founders that we
2:57
collectively have and do it with a
2:59
lot of resources and so forth and
3:01
I'm also still advising students at Stanford
3:03
as well. That's really cool. And I
3:05
guess we started physical intelligence with four
3:07
other co-founders and an incredibly impressive team.
3:09
Could you tell us a little bit
3:12
more about what physical intelligence is working
3:14
on in the approach that you're taking?
3:16
Because I think it's a pretty unique
3:18
slant on the whole field and approach.
3:20
Yeah, so we're trying to build a
3:22
big neural network model that could ultimately
3:25
control any robot to do anything in
3:27
any scenario. And like a big part
3:29
of our vision is that in the
3:31
past robotics is focused on like trying
3:33
to go deep on one application and
3:35
like developing a robot to do one
3:38
thing and then ultimately gotten kind of
3:40
stuck in that one application it's really
3:42
hard to like solve one thing and
3:44
then try to get out of that
3:46
and broaden and instead we're really in
3:48
it for the the long term to
3:51
try to address this broader problem of
3:53
physical intelligence in the real world. We're
3:55
thinking a lot about generalization generalists and
3:57
unlike other robotics companies we think that
3:59
being able to leverage all of the
4:01
possible data is very important. And this
4:04
comes down to actually not just leveraging
4:06
data from one robot, but from any
4:08
robot platform that might have six joints
4:10
or seven joints or two arms or
4:12
one arm. We've seen a lot of
4:14
evidence that you could actually transfer a
4:17
lot of rich information across these different
4:19
embodiments and allows you to use data.
4:21
And also if you iterate on your
4:23
robot platform, you don't have to throw
4:25
all your data away. I have faced
4:27
a lot of pain in the past
4:30
where we got a new version of
4:32
the new version of the robot. It's
4:34
a really painful process to try to
4:36
get back to where you were on
4:38
the previous robot iteration. So yeah, trying
4:40
to build General's robots and essentially kind
4:42
of develop foundation models that will power
4:45
the next generation of robots in the
4:47
real world. That's really cool, because I
4:49
mean, I guess there's a lot of
4:51
sort of parallels. to the large language
4:53
model world where you know really a
4:55
mixture of deep learning the transformer architecture
4:58
in scale has really proven out that
5:00
you can get real generalizability in different
5:02
forms of transfer between different areas. Could
5:04
you tell us a little bit more
5:06
about the architecture you're taking or the
5:08
approach or you know how you're thinking
5:11
about the basis for the foundation model
5:13
that you're developing? At the beginning we
5:15
were just getting off the ground we're
5:17
trying to scale data collection and a
5:19
big part of that is Unlike in
5:21
language, we don't have Wikipedia or an
5:24
internet of robot motions, and we're really
5:26
excited about scaling data on real robots
5:28
in the real world. This is, this
5:30
kind of real data is what has
5:32
fueled machine learning advances in the past,
5:34
and a big part of that is
5:37
we actually need to collect that data,
5:39
and that looks like teleoperating robots in
5:41
the physical world. We're also exploring other
5:43
ways of scaling data as well, but
5:45
the kind of bread and butter is
5:47
scaling real robot data. We released something
5:50
in late October where we showed some
5:52
of our initial efforts around. scaling data
5:54
and how we can learn very complex
5:56
tasks of folding laundry, cleaning tables, constructing
5:58
a cardboard box. Now where we are
6:00
in that journey is really thinking a
6:02
lot about language interaction and generalization to
6:05
different environments. So what we showed in
6:07
October was the robot in one environment
6:09
and it had data in that environment.
6:11
We did we were able to see
6:13
some. amount of generalization, so it was
6:15
able to fold shorts that had never
6:18
seen before, fold shorts that has never
6:20
seen before, but the degree of generalization
6:22
was very limited, and you also couldn't
6:24
interact with it in any way. You
6:26
couldn't prompt it and tell you what
6:28
you want to do beyond kind of
6:31
fairly basic things that it saw in
6:33
the training data. And so being able
6:35
to handle lots of different prompts in
6:37
lots of different environments is a big
6:39
focus right now. And in terms of
6:41
the architecture... We're using Transformers and we
6:44
are using pre-trained models, pre-trained vision language
6:46
models and... that allows you to leverage
6:48
all of the rich information on the
6:50
internet. We had a research result a
6:52
couple years ago where we showed that
6:54
if you leverage vision language models, then
6:57
you could actually get the robot to
6:59
do tasks that require concepts that were
7:01
never in the robots training data, but
7:03
we're in the internet. Like one famous
7:05
example is that you can pass the
7:07
Coke can to Taylor Swift and the
7:10
robot has never seen Taylor Swift in
7:12
person, but the internet has lots of
7:14
images of Taylor Swift in it. And
7:16
you can leverage all of the information
7:18
of the information on the pre-train kind
7:20
of transfer that to the robots. We're
7:22
not starting from scratch and that helps
7:25
a lot as well. So that's a
7:27
little bit about the approach. Happy to
7:29
dive deeper as well. That's really amazing.
7:31
And then, um, what do you think
7:33
is the main basis then for really
7:35
getting to generalizability? Is it scaling data
7:38
further? Is it scaling? And then, um,
7:40
what do you think is the main
7:42
basis then for really getting to generalizability,
7:44
like as you think through the common
7:46
pieces that people are spending. a lot
7:48
of time on reasoning modules and other
7:51
things like that as well. So I'm
7:53
curious, like, what are the components that
7:55
you feel are missing right now? Yeah,
7:57
so I think the number one thing,
7:59
and this kind of the boring thing,
8:01
is just getting more diverse robot data.
8:04
So for that release that we had
8:06
in late October, last year, we collected
8:08
data in three buildings, technically, the internet,
8:10
for example, and everything that is fueled
8:12
language models and vision models. is way
8:14
way more diverse than that because the
8:17
internet is pictures that are taken by
8:19
lots of people and text written by
8:21
lots of different people. And so just
8:23
trying to collect data in many more
8:25
diverse places and with many more objects,
8:27
many more tasks. So scaling the diversity
8:30
of the data, not just the quantity
8:32
of the data, is very important and
8:34
that's a big thing that we're focusing
8:36
on right now, actually bringing our robots
8:38
into lots of different places and collecting
8:40
data in it. As a side product
8:43
of that we also learn. what it
8:45
takes to actually get your robot to
8:47
be operational and functional in lots of
8:49
different places. And that is a really
8:51
nice byproduct because if you actually want
8:53
to get robots to work in the
8:55
real world, you need to be able
8:58
to do that. So that's the number
9:00
one thing, but then we're also exploring
9:02
other things, leveraging videos of people, again,
9:04
leveraging data from the web, leveraging pre-trained
9:06
models, thinking about... reasoning, although more basic
9:08
forms of reasoning, in order to, for
9:11
example, put a dirty shirt into a
9:13
hamper, if you can recognize where the
9:15
shirt is and where the hamper is
9:17
and what you need to do to
9:19
accomplish that task, that's useful, or if
9:21
you want to make a sandwich, and
9:24
the user has a particular request in
9:26
mind, you should reason through that request
9:28
if they're allergic to pickles, you probably
9:30
shouldn't put pickles on the sandwich. things
9:32
like that. So there's some basic things
9:34
around there, although the number one thing
9:37
is just more diverse for robot data.
9:39
And then I think a lot of
9:41
the pursuit of taking the data has
9:43
really been an emphasis on releasing open
9:45
source models and packages for robotics. Do
9:47
you think that's the long-term path? Do
9:50
you think it's open core? Do you
9:52
think it's eventually proprietary models? Or how
9:54
do you think about that? of the
9:56
industry because it feels like there's a
9:58
few different robotics companies now each taking
10:00
different approaches in terms of either hardware
10:03
only, I mean, excuse me, hardware plus
10:05
software and they're focused on a specific
10:07
hardware footprint, there's software and there's close
10:09
source versus open source if you're just
10:11
doing the software. So I'm sort of
10:13
curious where in that spectrum, physical intelligence
10:15
flies. Definitely. So we've actually been quite
10:18
open. Not only have we open source
10:20
some of the weights and release details
10:22
and technical papers, we've actually also been
10:24
working with hardware companies and giving designs
10:26
of robots to hardware companies. And some
10:28
people have actually, like, when I tell
10:31
people this, sometimes they're actually really shocked
10:33
that, like, what about the IP, what
10:35
about, I don't know, confidentiality and stuff
10:37
like that? And we've actually made this,
10:39
made a very intentional choice around this.
10:41
There's a couple of reasons for it.
10:44
One is that we think that the
10:46
field, it's really just the beginning, and
10:48
these models will be so, so much
10:50
better, and the robots should be so
10:52
much better in a year, in three
10:54
years, and we want to support the
10:57
development of the research, and we want
10:59
to support the community, support the robots,
11:01
so that when we hopefully develop the
11:03
technology of these generalist models. the world
11:05
will be more ready for it, will
11:07
have better, like, more robust robots that
11:10
are able to leverage those models, people
11:12
who have the expertise and understand what
11:14
it requires to use those models. And
11:16
then the other thing is also, like,
11:18
we have a really fantastic team of
11:20
researchers and engineers and really. really fantastic
11:23
researchers and engineers want to work at
11:25
companies that are that are open, especially
11:27
on researchers where they can get kind
11:29
of credit for their work and share
11:31
their ideas, talk about their ideas. And
11:33
we think that having the best researchers
11:36
engineers will be necessary for solving this
11:38
problem. The last thing that I'll mention
11:40
is that I think the biggest risk
11:42
with this bet is that it won't
11:44
work. Like I'm not really worried about
11:46
competitors. I'm more worried that No one
11:48
will solve the problem. Oh, interesting. And
11:51
why do you worry about that? I
11:53
think robotics is it's very hard. And
11:55
there's been many many failures in the
11:57
past and unlike when you're like recognizing
11:59
an object in an image there's very
12:01
little tolerance for error you can miss
12:04
a grasp on an object or like
12:06
not make like the difference between making
12:08
contact and not making contact in an
12:10
object is so small and it has
12:12
a massive impact on the outcome of
12:14
whether the robot can actually successfully manipulate
12:17
the object. And I mean, that's just
12:19
one example. There's challenges on the data
12:21
side of collecting data. Well, just anything
12:23
involving hardware is hard as well. I
12:25
guess we have a number of examples
12:27
now of robots in the physical world.
12:30
You know, everything from autopilot on a
12:32
jet on through to some forms of
12:34
pick-in-pack and or other types of robots.
12:36
and distribution centers, and there's obviously the
12:38
different robots involved with manufacturing, particularly in
12:40
automotive, right? So there's been a handful
12:43
of more constrained environments where people have
12:45
been using them in different ways. Where
12:47
do you think the impact of these
12:49
models will first show up? Because to
12:51
your point, there are certain things where
12:53
you have very low tolerance for error,
12:56
and then there's a lot of fields
12:58
where actually it's okay, or maybe you
13:00
can constrain the problem sufficiently relative to
13:02
the capabilities of the model that it
13:04
works fine. physical intelligence will have the
13:06
nearest term impact or in general the
13:08
field of robotics and these new approaches
13:11
will substantiate themselves. Yeah, as a company
13:13
we're really focused on on the long-term
13:15
problem and not like anyone particular application
13:17
because of the failure modes that can
13:19
come up when you focus on one
13:21
application, I don't know where the first
13:24
applications will be. I think one thing
13:26
that's actually challenging is that typically in
13:28
machine learning a lot of the successful
13:30
applications of like recommender systems, language models,
13:32
like image detection, a lot of the
13:34
consumers of that of the model outputs
13:37
are actually humans who could actually check
13:39
it and the humans are good at
13:41
the thing. A lot of the very
13:43
natural applications of robots is actually the
13:45
robot doing something autonomously on its own,
13:47
where it's not like a human consuming
13:50
the commanded arm position, for example. and
13:52
then checking it and then validating it
13:54
and so forth. And so I think
13:56
we need to think about new ways
13:58
of having some kind of tolerance for
14:00
mistakes or scenarios where that's fine or
14:03
scenarios where humans and robots can work
14:05
together. That's I think one big challenge
14:07
that will come up when trying to
14:09
actually deploy these and some of the
14:11
language interaction work that we've been doing
14:13
is actually. motivated by this challenge where
14:16
we think it's really important for humans
14:18
to be able to kind of provide
14:20
input for how they want the robot
14:22
to behave and what they want the
14:24
robot to do, how they want the
14:26
robot to help in a particular scenario.
14:29
That makes sense. I guess the other
14:31
form of generalizability to some extent at
14:33
least in our current world is the
14:35
human form, right? And so some people
14:37
are specifically focused on humanoid robots like
14:39
Tesla and others under the assumption that
14:41
the world is designed for people and
14:44
therefore is the perfect form factor to
14:46
coexist with people. And then other people
14:48
have taken very different approaches in terms
14:50
of single. I need something that's more
14:52
specialized for the home in certain ways
14:54
or for factories or manufacturing or you
14:57
name it. What is your view on
14:59
kind of humanoid versus not? On one
15:01
hand, I think that they're a little
15:03
overrated. And one way to practically look
15:05
at it is I think that we're
15:07
generally fairly ballnecked on data right now.
15:10
And some people argue that with humanoids.
15:12
you can maybe collect data more easily
15:14
because it matches the human form factor.
15:16
And so maybe it'd be easier to
15:18
mimic humans. And I've actually heard people
15:20
make those arguments, but if you've ever
15:23
actually tried to teleoperate a humanoid, it's
15:25
actually a lot harder to teleoperate than
15:27
that a static manipulator or mobile manipulator
15:29
with wheels. Optimizing for being able to
15:31
collect data, I think, is very important,
15:33
because if we can get to the
15:36
point where we have more data than
15:38
we could ever want, then it just
15:40
comes down to. research and compute and
15:42
evaluations. And so we're optimizing for, that's
15:44
one of the things we're kind of
15:46
optimizing for, and so we're using cheap
15:49
robots. We're using robots that we can.
15:51
very easily developed teleoperation interfaces for in
15:53
which you can do teleoperation very quickly
15:55
and collect diverse data, collect lots of
15:57
data. Yeah, it's funny. There was that
15:59
viral fake and Kardashian video for going
16:01
shopping with a robot following her around
16:04
carrying all of her shopping bags. When
16:06
I saw that, I really wanted a
16:08
humanoid robot to follow me around everywhere.
16:10
That'd be really funny to do that.
16:12
So I'm hopeful that someday I can
16:14
use your software to cause a robot
16:17
to follow me around to do things.
16:19
So exciting future. How do you think
16:21
about the embodied model of development versus
16:23
not on some of these things in
16:25
terms of that that that's another sort
16:27
of, sort of, trade-offs that some people
16:30
are making or deciding between. A lot
16:32
of the AI community is very focused
16:34
on just like language models, vision language
16:36
models and so forth and there's like
16:38
a ton of hype around like reasoning
16:40
and stuff like that. Oh, let's create
16:43
like the most intelligent thing. I feel
16:45
like actually people underestimate. how much intelligence
16:47
goes into motor control. Many, many years
16:49
of evolution is what led to us
16:51
being able to use our hands the
16:53
way that we do. And there are
16:56
many animals that they can't do it,
16:58
even though they had so many years
17:00
of evolution. And so I think that
17:02
there's actually so much complexity and intelligence
17:04
that goes into being able to do
17:06
something as basic as make a bowl
17:09
of cereal or poor glass of water.
17:11
And yeah, so in some ways I
17:13
think that actually like embodied intelligence or
17:15
physical intelligence or physical intelligence is very
17:17
core to intelligence and maybe kind of
17:19
underrated compared to some of the less
17:22
embodied models. One of the papers that
17:24
I really loved over the last couple
17:26
years in robotics was your Aloha paper
17:28
and I thought it was a very
17:30
clever approach. What is some of the
17:32
research over the last two or three
17:34
years that you think has really caused
17:37
this flurry of activity because I feel
17:39
like there's been a number of people
17:41
now starting companies in this area because
17:43
a lot of people feel like now
17:45
is the time to do it. And
17:47
I'm a little bit curious what research
17:50
you feel was the basis for that
17:52
shift and people thinking this was a
17:54
good place to work. At least for
17:56
us, there were a few things that
17:58
we felt like were turning points that
18:00
felt like, where it felt like the.
18:03
field was moving a lot faster compared
18:05
to where it was before. One was
18:07
the say can work where we found
18:09
that you can plan with language models
18:11
as kind of the high-level part and
18:13
then kind of plug that in with
18:16
a low-level model to get a model
18:18
to do long horizon tasks. One was
18:20
the Archie 2 work which showed that
18:22
you could do the Taylor Swift example
18:24
that I mentioned earlier and be able
18:26
to plug in kind of the a
18:29
lot of the... web data and get
18:31
better generalization on robots. A third was
18:33
our RTX work, where we were actually
18:35
were able to train models across robot
18:37
embodiments and significantly, we basically took all
18:39
the robot data that different research labs
18:42
had, it's a huge effort to aggregate
18:44
that into a common format and train
18:46
on it. And we also, when we
18:48
trained on that, we actually found that
18:50
we could take a checkpoint, send that
18:52
model checkpoint to another lab. halfway across
18:54
the country and the grad student at
18:57
that lab could run the checkpoint on
18:59
the robot and it would actually... more
19:01
often than not do better than the
19:03
model that they had specifically iterated on
19:05
themselves in their own lab. And that
19:07
was like another big sign that like
19:10
this stuff is actually starting to work
19:12
and that you can get benefit across
19:14
by pooling data across different robots. And
19:16
then also like you mentioned I think
19:18
the Aloha work and later the Mobile
19:20
Allejo work was work that showed that
19:23
you can tell you operate and get
19:25
models to train pretty complicated dexterous manipulation
19:27
tasks. We also had a follow-up paper
19:29
with the shoelase tying that was a...
19:31
a fun project because someone said that
19:33
they would retire if they saw a
19:36
robot tie shoelaces. So did they retire?
19:38
They did that retire. We need to
19:40
force them into retirement. Whoever that person
19:42
is, we need to follow up on
19:44
that. Yeah, so those were a few
19:46
examples. And so yeah, I think we've
19:49
seen a ton of progress in the
19:51
field. I also, it seems like after
19:53
we started pie that that was also
19:55
kind of assigned to others that if
19:57
the experts are really willing to bet
19:59
on this, then something, maybe something will
20:02
happen. So. One thing that you all
20:04
came out with today from Pi was
20:06
what you call a hierarchical interactive robot
20:08
or high robot. Can you tell us
20:10
a little bit more about that? So
20:12
this is a really fun project. There's
20:14
two things that we're trying to look
20:17
at here. One is that if you
20:19
need to do like a longer horizon
20:21
task, meaning a task that might take
20:23
minutes to do, then if you just
20:25
train a single policy to like output
20:27
actions based on images, Like if you're
20:30
trying to make a sandwich and you
20:32
train a policy that's just outputting the
20:34
next motor command, that might not do
20:36
as well as something that's actually kind
20:38
of thinking through the steps to accomplish
20:40
that task. That was kind of the
20:43
first component. That's where the hierarchy comes
20:45
in. And the second component is a
20:47
lot of the times when we train
20:49
robot policies. We're just saying, like, we'll
20:51
take our data, we'll annotate it and
20:53
say, like, this is picking up the
20:56
sponge, this is putting the bowl in
20:58
the bin, this segment is, I don't
21:00
know, folding the shirt, and then you
21:02
get a policy that can, like, follow
21:04
those basic commands of, like, fold the
21:06
shirt, or pick up the cup, those
21:09
sorts of things. But at the end
21:11
of the day, we don't want robots
21:13
just to be able to do that.
21:15
We want them to be able to
21:17
be able to interact. maybe don't include
21:19
those, and maybe also be able to
21:22
interject in the middle and say like,
21:24
oh, hold off on the tomatoes or
21:26
something. It's actually kind of a big
21:28
gap between something that can just follow
21:30
like an instruction like pick up the
21:32
cup and something that could be able
21:35
to handle those kinds of prompts and
21:37
those situated corrections and so forth. And
21:39
so we developed a system that basically
21:41
has one model that takes us and
21:43
put the prompt and kind of reasons
21:45
through as able to output like the
21:47
next step that the robot should follow
21:50
and that might be that's kind of
21:52
like it's going to tell it to
21:54
then the next thing will be pick
21:56
up the tomato for example and then
21:58
a lower level model that takes its
22:00
input pick up the tomato and outputs
22:03
the sequence of motor commands for the
22:05
next like half second that's the gist
22:07
of it we it was a lot
22:09
of fun because we actually got the
22:11
robot to make a vegetarian sandwich or
22:13
a ham and cheese sandwich or whatever.
22:16
We also did a grocery shopping example
22:18
and a table cleaning example and I
22:20
was excited about it first because it
22:22
was just like cool to see the
22:24
robot be able to respond to different
22:26
problems and do these challenging tasks and
22:29
second because it actually seems like a
22:31
like the right approach for solving the
22:33
problem. On the technical capabilities side one
22:35
thing I was wondering about a little
22:37
bit was If I look at the
22:39
world of self-driving, there's a few different
22:42
approaches that are being taken, and one
22:44
of the approaches that is the more
22:46
kind of waymo-centric one is really incorporating
22:48
a variety of other types of sensors
22:50
besides just visions. We have LIDAR and
22:52
a few other things, and a few
22:55
other things, and a few other things,
22:57
as ways to augment the self-driving capabilities
22:59
of a vehicle. Where do you think
23:01
we are in terms of the sensors
23:03
that we use in the context of
23:05
robots? So we've gotten very far just
23:07
with vision with RGB images even and
23:10
we typically will have one or multiple
23:12
external kind of what we call base
23:14
cameras that are looking at the scene
23:16
and also cameras mounted to each of
23:18
the risks of the robot. We can
23:20
get very very far with that. I
23:23
would love if like skin if we
23:25
could give our robot skin. Unfortunately, a
23:27
lot of the tactile sensors that are
23:29
out there are either far less robust
23:31
than skin, far more expensive, or very,
23:33
very low resolution. So there's a lot
23:36
of kind of challenges on the hardware
23:38
side there. And we found that actually
23:40
that mounting RGB cameras to the wrists
23:42
ends up being very, very helpful, and
23:44
probably giving you a lot of the
23:46
same information that tactile sensors. can give
23:49
you? Because when I think about the
23:51
set of sensors that are incorporated into
23:53
a person, obviously to your point, there's
23:55
the tactile sensors, effectively, right? And then
23:57
there's heat sensors, there's actually a variety
23:59
of things that are incorporated that people
24:02
usually don't really think about much. Absolutely.
24:04
And I'm just sort of curious, like
24:06
how many of those are actually necessary
24:08
in the context of robotics versus not,
24:10
what are some of the things we
24:12
should think about, like, just if we
24:15
extrapolate off of humans or animals or
24:17
other, you know. It's a great question.
24:19
I mean, for the sandwich making, you
24:21
could argue that you'd want the robot
24:23
to be able to taste the sandwich
24:25
to know if it's good or not.
24:28
For smell it at least, you know.
24:30
Yeah, I've made a lot of arguments
24:32
for smell to Sergei in the past,
24:34
because there's a lot of nice things
24:36
about smell, although we've never actually attempted
24:38
it attempted it before. For example, and
24:40
I think like audio, for example, like
24:43
a human, if you hear something that's
24:45
unexpected, it can actually kind of alert
24:47
you to something. In many cases, it
24:49
might actually be very, very redundant with
24:51
your other sensors, because you might be
24:53
able to actually see something fall, for
24:56
example, and that redundancy can lead to
24:58
robustness. For us, it's not currently not
25:00
a priority to look into these sensors,
25:02
because we think that the bottleneck right
25:04
now is... elsewhere is on the data
25:06
front is on kind of the architectures
25:09
and so forth. The other thing I'll
25:11
mention is actually right now where most
25:13
like our policies right now do not
25:15
have any memory. They only look at
25:17
the current image frame. They can't remember
25:19
even half a second prior. And so
25:22
I would much rather add memory to
25:24
our models before we add other sensors.
25:26
We can have commercially viable robots for
25:28
a number of applications without other centers.
25:30
What do you think is a time
25:32
frame on that? I have no idea.
25:35
Some parts of robotics that make it
25:37
easier than self-driving and some parts that
25:39
make it harder. On one hand, it's
25:41
harder because you're not just, like, it's
25:43
just a much higher dimensional space. even
25:45
our static robots have 14 dimensions of
25:48
seven for each arm. You need to
25:50
be more precise in many scenarios than
25:52
driving. We also don't have as much
25:54
data right off the bat. On the
25:56
other hand, with driving, I feel like
25:58
you kind of need to solve the
26:00
entire distribution to have anything that's viable.
26:03
You have to be able to handle
26:05
an intersection at any time of day
26:07
or with any kind of possible pedestrian
26:09
scenario or other cars and all that.
26:11
Whereas in robotics, I think that there's
26:13
lots of commercial use cases where you
26:16
don't have to handle this whole huge
26:18
distribution, and you also don't have as
26:20
much of a safety risk as well.
26:22
That makes me optimistic, and I think
26:24
that also, like, all the results in
26:26
self-driving have been very encouraging, especially, like,
26:29
the number of waymows that I see
26:31
in San Francisco. Yeah, it's been very
26:33
impressive to watch them scale up by
26:35
usage. I think I found striking about
26:37
this help driving world is... There was
26:39
two dozen startups, started roughly, I don't
26:42
know, 10 to 15 years ago around
26:44
self-driving. And the industry is largely consolidated,
26:46
at least in the US, and obviously
26:48
the China market's a bit different, but
26:50
it's consolidated into Waymo and Tesla, which
26:52
effectively were two incumbents, right? Google and
26:55
Tesla was an automaker. And then there's
26:57
maybe one or two startups that either
26:59
spacked and went public or two startups
27:01
that either spacked and went public or
27:03
still kind of working in the area.
27:05
And then most of it's kind of
27:08
fallen off, right? And the set of
27:10
players that existed at that starting moment,
27:12
at that starting moment, just consolidation. Do
27:14
you think that the main robotics players
27:16
are the companies that exist today? And
27:18
do you think there's any sort of
27:21
incumbency bias that's likely? A year ago,
27:23
like, it would be completely different. And
27:25
I think that we've had so many
27:27
new players recently. I think that the
27:29
fact that self-driving was like that suggested
27:31
that it might have been a bit
27:33
too early 10 years ago for it.
27:36
And I think that arguably it was,
27:38
like, I think deep learning has come
27:40
a long, long way since then. And
27:42
so I think that that's also part
27:44
of it. And I think that the
27:46
same with robotics, like if you were
27:49
to ask you 10 years ago, or
27:51
even five years ago, honestly, I think
27:53
it would be too early. I think
27:55
the technology wasn't there yet. We might
27:57
still be too early. For all we
27:59
know, I mean, it's a very hard
28:02
problem. And like, how hard self-driving has
28:04
been, and I think it's a testament
28:06
to how hard is to build intelligence
28:08
in the physical world. In terms of
28:10
like major players, liked about the startup
28:12
environment and a lot of things that
28:15
were very hard to do when I
28:17
was at Google. Google is an amazing
28:19
place in many, many ways, but like,
28:21
as one example, taking a robot off
28:23
campus was like almost a non-starter, just
28:25
for code security reasons. And if you
28:28
want to collect diverse data, taking robots
28:30
off campus is valuable. You can move
28:32
a lot faster when you're a smaller
28:34
company when you don't have... kind of
28:36
restrictions, red tape, that sort of things.
28:38
The really big companies, they have a
28:41
ton of capital so they can last
28:43
longer, but I also think that there's,
28:45
they're going to move slower too. If
28:47
you were to give advice to somebody
28:49
thinking about starting a robotics company today,
28:51
what would you suggest they do or
28:53
where would you point them in terms
28:56
of what to focus on? I think
28:58
that actually like... trying to deploy quickly
29:00
and learn and iterate quickly. That's probably
29:02
the main advice and try to, yeah,
29:04
like. actually get the robots out there,
29:06
learn from that. I'm also not sure
29:09
if I'm the best person to be
29:11
giving startup advice because I've only been
29:13
an entrepreneur myself for 11 months, but
29:15
yeah, that's probably the advice. Thank you.
29:17
Yeah, that's cool. I mean, you're running
29:19
an incredibly exciting startup. So I think
29:22
you have a full ability to suggest
29:24
stuff to people in that area for
29:26
sure. I've heard a number of different
29:28
groups doing is really using observational data
29:30
of people as part of the training
29:32
set. purpose. How do you think about
29:35
that in the context of training robotic
29:37
models? I think that they can have
29:39
a lot of value, but I think
29:41
that by itself it won't get you
29:43
very far. And I think that there's
29:45
actually some really nice analogies you can
29:48
make where, for example, if you watch
29:50
like an Olympic swimmer, swimmer race, even
29:52
if you had their strength, just their
29:54
practice at moving their own muscles to
29:56
do the to accomplish with their accompl.
29:58
is like essential for being able to
30:01
do it or if you're trying to
30:03
learn how to hit a tennis ball
30:05
well you won't be able to learn
30:07
it by kind of watching the pros.
30:09
No. Maybe these examples seem a little
30:11
bit contrived because they're talking about like
30:14
experts. The reason why I make those
30:16
analogies is that we humans are experts
30:18
at motor control, low level motor control
30:20
already for a variety of things and
30:22
our robots are not. And I think
30:24
the robots actually need experience from their
30:26
own body in order to learn. And
30:29
so I think that it's really promising
30:31
to be able to leverage that form
30:33
of data, especially to expand on the
30:35
robots own experience, but it's really going
30:37
to be essential to like actually have
30:39
the data from the robot itself. is,
30:42
is that just general data that you're
30:44
generating around that verbat, or would you
30:46
actually have it mimic certain activities, or
30:48
how do you think about the data
30:50
generation? Because you mentioned a little bit
30:52
about the transfer and generalizability. It's interesting
30:55
to ask, well, what is generalizable or
30:57
not, and what types of data are,
30:59
and things like that. I mean, when
31:01
we collect data, we have, it's kind
31:03
of like puppeteering, like the original Aloha
31:05
work, and then you can record both.
31:08
like the camera images and so that
31:10
is the like experience for the robot
31:12
and then I also think that autonomous
31:14
experience will play a huge role just
31:16
like we've seen in language models after
31:18
you get an initial language model if
31:21
you can use reinforcement learning to have
31:23
the robot the the language model bootstrap
31:25
on its own experience. That's extremely valuable.
31:27
Yeah, and then in terms of what's
31:29
generalizable versus not, I think it all
31:31
comes down to the breadth of the
31:34
distribution. It's really hard to quantify or
31:36
measure how broad the robot zone experiences.
31:38
And there's no way to categorize. the
31:40
breadth of the tasks, like how different
31:42
one task is from another, how different
31:44
one kitchen is from another, that sort
31:46
of thing. But we can at least
31:49
get a rough idea for that breadth
31:51
by like looking at things like the
31:53
number of buildings or the number of
31:55
scenes, those sorts of things. And then
31:57
I guess we talked a lot about
31:59
humanoid robots. and other sort of formats,
32:02
if you think ahead in terms of
32:04
the form factors that are likely to
32:06
exist in end years as this sort
32:08
of robotic future comes into play, do
32:10
you think there's sort of one singular
32:12
form or there are a handful? Is
32:15
it a rich ecosystem? Just like in
32:17
biology, like how do you think about
32:19
what's gonna come out of all this?
32:21
I don't know exactly, but I think
32:23
that my bet would be on something
32:25
where there's actually a... really wide range
32:28
of different robot platforms. I think Sergei,
32:30
my co-founder, likes to call it a
32:32
Cambrian explosion of different robot hardware types
32:34
and so forth. Once we actually can
32:36
have the technology that can, it's intelligence
32:38
that can power all those different robots.
32:41
And I think it's kind of similar
32:43
to like, we have all these different
32:45
devices in our kitchen, for example, that
32:47
can do all these different things for
32:49
us. And rather than just like one
32:51
device that can that. cooks the whole
32:54
meal for us. And so I think
32:56
we can envision like a world where
32:58
there's like one kind of robot arm
33:00
that does things on on the kitchen
33:02
that has like some hardware that's optimized
33:04
for that and maybe also optimized for
33:06
it to be cheap for that particular
33:09
use case and another hardware that's kind
33:11
of designed for for like folding clothes
33:13
or something like that, dishwashing, those sorts
33:15
of things. These all like speculation of
33:17
course, but I think that a world
33:19
like that is something where, yeah, it's
33:22
I think different from what a lot
33:24
of people think about. In the book
33:26
The Diamond Age, there's sort of this
33:28
view of like matter pipes going into
33:30
homes and you have these 3D printers
33:32
that make everything for you. And in
33:35
one case you're like downloading schematics and
33:37
then you 3D print the thing and
33:39
then people who are kind of bootlegging
33:41
some of the thing and then people
33:43
who are kind of bootlegging some of
33:45
the stuff end up with almost evolutionary
33:48
base processes to build hardware and then
33:50
select against certain functionality is the mechanism
33:52
by which to optimize. you don't need
33:54
that much specialization if you have enough
33:56
generalizability in the actual underlying intelligence. I
33:58
think the world like that is very
34:01
possible. And I think that you can
34:03
make a cheaper hardware piece of hardware
34:05
if you are optimizing for a particular
34:07
use case and maybe it'd be like
34:09
also be a lot faster and so
34:11
forth. Yeah, obviously very hard to predict.
34:14
Yeah, it's super hard to predict because
34:16
one of the arguments for a smaller
34:18
number of hardware platforms is just supply
34:20
chain, right? It's just going to be
34:22
cheaper at scale to manufacture all the
34:24
sub components and therefore you're going to
34:27
collapse down to fewer things because easily
34:29
scalable, reproducible, cheap to make, etc. right?
34:31
If you look at sort of general
34:33
hardware approaches. So it's an interesting question
34:35
in terms of that tradeoff between those
34:37
two tensions. Yeah, although maybe we'll have
34:39
robots in the supply chain that can
34:41
manufacture any customizable device that you want.
34:43
It's robots all the way down. So
34:45
that's our future. Yeah. Well, thanks so
34:47
much for joining me today. It's a
34:49
super interesting conversation. We covered a wide
34:51
variety of things. I really appreciate your
34:53
time. Find us on Twitter
34:56
at no priors pod. Subscribe to our
34:58
YouTube channel if you want to see
35:00
our faces. Follow the show on Apple
35:02
podcast, Spotify, or wherever you listen. That
35:04
way you get a new episode every
35:06
week. And sign up for emails or
35:08
find transcripts for every episode at no
35:10
dash priors.com.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More