Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
We just launched this experiment and then
0:02
we are very surprised to see that
0:04
the hugely overparameterized model not only train
0:06
out of the box, like you have
0:09
very nice training curves, but also they
0:11
don't overfit aggressively at all. And what
0:13
we found empirically is that we just
0:15
out of the box, use typical like
0:18
supervised training. We don't have to play
0:20
with the hyperparameter optimizer, and you have
0:22
very very stable training. So
0:26
this brings us to
0:28
the question, is it
0:30
worth it to spend
0:32
so much money to
0:34
gather gigantic pre-training data
0:36
set, spend like months
0:38
on many GPUs to
0:40
produce those models, but
0:42
at least for some
0:44
application it seems to not be
0:46
much better than from there. based in
0:49
Switzerland. They have an amazing team. You've
0:51
seen many of the folks on the
0:53
team. They acquired Mine's AI, of course.
0:55
They did a lot of great work
0:57
on arc. They're now working on O1-style
0:59
models and reasoning and thinking and test
1:01
time computation. The reason you want to
1:03
work for them is you get loads
1:05
of autonomy, you get visibility, you can
1:07
publish your research, and also they are
1:10
hiring as well as ML engineers, the
1:12
hiring a chief scientist. They really, really
1:14
want to find the best possible person
1:16
for this role for this role. as
1:18
a joining bonus. So if you're interested
1:20
in working for them as an MO
1:22
engineer or their chief scientist, get in
1:24
touch with Benjamin Cruzier, go to two
1:26
for labs.a.i. and see what happens. Originally,
1:29
the main motivation was to see,
1:31
okay, how much information you gain
1:34
by doing play training, right? And
1:36
is this next second prediction really
1:38
making your network learn something about
1:40
language and reasoning? And so then
1:42
we are saying, one way to
1:45
compare this at least empirically is
1:47
to just take a randomly initialized
1:49
model. train it from scratch on
1:51
a supervised task like sentiment prediction,
1:53
sentiment analysis. And then in theory,
1:56
because we have a very very
1:58
small training data, let's say... samples
2:00
and because of Smodel, I've like... 7
2:02
billion parameters. The pre-train one will perform
2:05
very nicely with a little bit of
2:07
lower-off and tuning because it already knows
2:09
how to reason about the world, right?
2:11
So maybe you just adjust a little
2:13
bit to the specific task that you
2:16
want, but since you have so many
2:18
prior knowledge, you will serve the task
2:20
very easily. But the random one, either
2:22
will over-feed completely. Because you have like
2:24
7 billion parameters and only 20,000 training
2:27
samples. Or maybe it will not learn
2:29
at all because training dynamics will be
2:31
completely. So we just launched this experiment
2:33
and then we were very surprised to
2:35
see that the 7 billion, or like
2:37
the hugely overparameterized model, not only train
2:40
out of the box, like you have
2:42
very nice training curves, almost like you
2:44
train the EMIST, but also they don't
2:46
overfit aggressively at all. Like they overfit
2:48
less than if you just train a
2:51
MLP on EMIS, basically. And this is
2:53
very surprising. And so, basically, from this,
2:55
we said, okay, actually, maybe there is
2:57
a deeper question. which could be how
2:59
much implicit bias you have in those
3:02
language model because already we knew from
3:04
computer visions that for example Imagineet you
3:06
can have a 50 million model on
3:08
a 1 million data set so you
3:10
have this 50 to 1 ratio and
3:13
you have the implicit bias that prevents
3:15
you from overfitting and just serving the
3:17
task right but still it's 50 to
3:19
1 so this may sound a lot
3:21
for you know like. statistician, but now
3:23
it's like 7 billion to 20,000. So
3:26
like the ratio is like gigantic, right?
3:28
And yeah, to me it was very
3:30
surprising that the scale, like the size
3:32
of this ratio, still... allows you to
3:34
learn some things that does not overfit.
3:37
This is very surprising because in vision,
3:39
for example, transformers are known to overfit
3:41
more resilience and resonate. So they seem
3:43
at least in vision to have actually
3:45
less implicit bias or implicit regularization, but
3:48
at least with this. type of next
3:50
token, causal architecture or LLLM. Yeah, you
3:52
don't seem to work easily to your
3:54
data. So this was quite surprising. Yeah,
3:56
and we should bring in the name.
3:58
So this was your workshop paper at
4:01
the self-supervised learning workshop here at New
4:03
York State School. Correct. For perception tasks,
4:05
is LLLM pre-training by next token prediction
4:07
worth the cost? So this is absolutely
4:09
fascinating, right? So we've been given this,
4:12
this belief that we need to have
4:14
these huge... pre-trained models, they're trained on
4:16
all the data on the internet, and
4:18
it turns out that certainly for discrimination
4:20
tasks, so things like classification, urban generation,
4:23
actually you can just start from scratch
4:25
with a fairly small model and you
4:27
get sometimes even better results. Yeah, yeah,
4:29
and even small or even large model,
4:31
like you just start from scratch, you
4:34
do this very simple supervised classification task,
4:36
right? Okay, given this prompt, is it
4:38
good or bad? sentiment or what type
4:40
of job is the problem describing. You
4:42
know, this type of, I will not
4:44
call it reasoning, but you have more
4:47
semantic classification and turns out that you
4:49
start from random. Even if you have
4:51
a small training data set, we'll have
4:53
performances that are sometimes as good than
4:55
a pre-train model. So this brings us
4:58
to the question, is it worth it
5:00
to spend so much money to gather
5:02
gigantic pre-training data set, spend like months
5:04
on many GPUs to produce those models?
5:06
And for some cases, so for generation...
5:09
Alright, there is no question this is
5:11
what you need to do. You have
5:13
your next second prediction, you learn how
5:15
to generate samples, but at least for
5:17
some application it seems to not be
5:19
much better than random. So it's quite
5:22
interesting. So what are the differences in
5:24
the land representations? So that's something we
5:26
did not really look at like low
5:28
dimensional representation for what you learn. It's
5:30
possible, so some work, try to look
5:33
at the attention, entropy and the like
5:35
you... those mechanistic interpretability viewpoint of LLLM's.
5:37
So it would be interesting to see
5:39
if you have this sort of non-normal
5:41
collapse thing that happens. So even if
5:44
you're like a 7 billion parameter, maybe
5:46
you end up learning a very, very,
5:48
very simple sub-network that does the task.
5:50
It will be like lotoreticate hypothesis as
5:52
well. And that naturally emerged from the
5:55
training dynamics. Or is it really exploiting
5:57
all the parameters? I think that's one
5:59
thing. So to extend the workshop paper
6:01
to conference we want to probe into.
6:03
more product. What are the useful parameters?
6:05
What did they learn? Are each layer
6:08
actually learning something or maybe the first
6:10
layers don't really learn anything just the
6:12
last few ones learning something? So yes,
6:14
I was like... lots of open questions
6:16
for this. What does it tell us
6:19
about the nature of understanding and maybe
6:21
even intelligence because we think that the
6:23
reason why these things understand is because
6:25
they just have all of these representations
6:27
to all of these different you know
6:30
things in their experience. And and now
6:32
we can shortcut to you know to
6:34
want the better word. What does that
6:36
tell us? Yeah I think that's a
6:38
good question. So in this case we
6:40
must look at very specific classification tasks.
6:43
So for example of a job or
6:45
job. is it like a good or
6:47
bad sentiment? And this you are able
6:49
to solve it good, but you are
6:51
not able to go out of distribution
6:54
to solve a new type of question.
6:56
For example, for this job description, then
6:58
you cannot answer, okay, is this job
7:00
paying you more than this job? Because
7:02
this was not present in the training
7:05
data, right? So I think you get
7:07
very good models, cheaply. quickly from a
7:09
domination, but they will be very specialized.
7:11
And I think the benefit of having
7:13
maybe the pre-training may come if you
7:16
want to do more of like open-ended
7:18
classification or reasoning. So I think it
7:20
really depends on the type of application
7:22
you want to solve, what's your downstream
7:24
task, and how much you want to
7:26
generalize to new scenarios. But at least
7:29
now it shows that it's not just
7:31
pre-training with next second prediction is better
7:33
for everything. five years data scientists used
7:35
to build. specific models for doing everything.
7:37
And now we're in this regime of,
7:40
we need these really big models and
7:42
we do in context learning and maybe
7:44
even some fine tuning and we get
7:46
them to do fairly specific discriminative tasks.
7:48
But now you're saying we should almost
7:51
go back to where we were five
7:53
years ago and start building specialized models
7:55
again. Only now, rather than building classification
7:57
models, we're actually we're still using the
7:59
transformers and the LLLMs, but we're making
8:01
them do specific tasks. specific tasks, use
8:04
this prior knowledge to have a nice
8:06
architecture, a supervised data set for that
8:08
and just do that from scratch. This
8:10
is something that's gonna probably work much
8:12
better, but again you need to make
8:15
sure that the downstream application will never
8:17
go. two out of distribution. So that's
8:19
why it really depends on the application
8:21
and the type of use cases that
8:23
you have. But I think at least
8:26
here it shows that there exists some
8:28
task where an exocan prediction is not
8:30
the answer. And in fact, it's not
8:32
just not the answer, but it's not
8:34
better than random initialization, which is really
8:36
sort of the worst case scenario. Interesting,
8:39
I mean from a fairness and bias
8:41
point of view a lot of people
8:43
say that you know large language models
8:45
are bad in a way because there's
8:47
the dominance of North American cultures and
8:50
so on. But you could also argue
8:52
the converse which is that the good
8:54
thing about them is that they do
8:56
have some awareness of value you know
8:58
so we can fine-tune them to have
9:01
guardrails and to sort of say the
9:03
right thing and so on is that
9:05
harder to do with this approach. Yeah,
9:07
so here because you are in a
9:09
fully supervised setting, you don't have as
9:12
much flexibility to, let's say, change the
9:14
behavior of your model or it will
9:16
have to take the form of supervised
9:18
fine tuning. But because you don't have
9:20
a generative capability, it's certainly restrict the
9:22
type of interaction you have with the
9:25
model and how you can improve it,
9:27
right? Because the output is just, okay,
9:29
is it a good or bad sentiment?
9:31
something that gives you a full answer
9:33
that then you can try to argue
9:36
against and generate a fine tuning that
9:38
I set from is just okay good
9:40
bad and that's it. Another thing is
9:42
training strategy so you know like the
9:44
big players building these LLLMs they have
9:47
lots of internalized knowledge around you know
9:49
even the order in which you train
9:51
the language models everything is important you
9:53
know certainly in the old days of
9:55
like basic models you know you just
9:57
stick a load of data and then
10:00
no one really cares yeah so you
10:02
know now do people need to be
10:04
sort of thinking about the specialized knowledge
10:06
maybe things thinking about curriculum learning and
10:08
all of this kind of stuff? Yeah,
10:11
so this is a good point. So
10:13
we did the paper recently, called the
10:15
Fair Language Model Paradox, where we show
10:17
that when you do this next token
10:19
prediction, because you have some tokens that
10:22
are very low frequency, it's very hard
10:24
to train on them and it takes
10:26
a very long training. So it's very
10:28
wasteful, right? And the problem is that
10:30
because you lose this next token prediction,
10:33
you need to really capture all your
10:35
distribution of tokens, and so you spend
10:37
a lot of time. If the low
10:39
frequency tokens are not useful to solve
10:41
your task, you actually don't need to
10:43
capture it at all. So in turn
10:46
of training dynamics, this is actually a
10:48
much simpler problem in many cases. And
10:50
what we found empirically is that we
10:52
just... out of the box, use typical
10:54
like supervised training, we don't have to
10:57
play with hyperparameter, optimizer, and you have
10:59
very very stable training. So that's one
11:01
thing that could be also interesting for
11:03
future work is to see, is this
11:05
something that is easier to optimize? And
11:08
maybe that's why those like 7 billion
11:10
parameter model can learn and not overfit
11:12
on like 10,000 samples. And then it's
11:14
also bringing other things that maybe this
11:16
on its own. could be a better
11:18
initiation for a next token prediction as
11:21
well. So this is a very open
11:23
up in the air, but maybe you
11:25
could think of a simpler supervised objective
11:27
that would be a better pre-training solution
11:29
that then you can use for next
11:32
token prediction if you wanted to. But
11:34
at least this would be a better
11:36
starting point from random. So you'll almost
11:38
reverse the train. So we've spoken about
11:40
two extremes. So on the one extreme
11:43
we have pre-training and you can like
11:45
use it for any downstream task. And
11:47
on the other extreme we have, you
11:49
know, you start from scratch just with
11:51
one task. Is there an intermediate solution?
11:54
So what if I did this new
11:56
approach but for multitask? Let's say for
11:58
five tasks. Yeah, yeah. So that's a
12:00
great question. So if you really think
12:02
about it in the limit, you could
12:04
formulate a next token, this one or
12:07
not. So in the... extreme case, you
12:09
could just recover an external prediction on
12:11
one end, and on the other end
12:13
you have what we have here, so
12:15
just one task, very coarse, high level,
12:18
predict if it's a good or bad
12:20
sentiment or whatever. So in between you
12:22
have a huge spectrum that you can
12:24
exploit, and if you can find, as
12:26
you said, maybe five very different representative
12:29
tasks, this should be enough to or
12:31
could be enough to learn the representation
12:33
that is as general as possible, and
12:35
then you can use this for maybe
12:37
new... task that comes on the go.
12:39
So I think the research question is
12:42
how to design the minimum amount of
12:44
task so that you have as diverse
12:46
representation as possible. And of course we
12:48
don't want to go to the extreme
12:50
of just doing again next token prediction.
12:53
But this is a very very nice
12:55
research question because If you have this
12:57
spectrum and you can control where you
12:59
want to be, then you can really
13:01
have a per-use case choice. So it's
13:04
not, okay, you are always here or
13:06
always here. Tell me what you want
13:08
to do, how much new tasks you
13:10
expect your model to be exposed to,
13:12
and I tell you where you need
13:15
to be in this spectrum. So this
13:17
could be like very interesting as well.
13:19
Very cool, very cool. It does make
13:21
me think though that. these models understand
13:23
through naive statistical alignment and is it
13:25
possible that the benchmarks we use just
13:28
don't cap you know they the gap
13:30
of understanding that we've lost from moving
13:32
from the pre-trained models isn't being captured.
13:34
Yeah I think because especially in the
13:36
recent years we focus a lot on
13:39
generative methods, all the evaluation and the
13:41
type of objectives we put on ourselves
13:43
is really about good generation, right? Even
13:45
if you want to answer a question,
13:47
you need to generate a good explanation,
13:50
you need to understand what are the
13:52
intermediate steps, and I think the fact
13:54
that we focus on generative models means
13:56
that we completely bias, the evaluation and
13:58
the way we approach this thing, and
14:00
maybe you could have still knowledge that
14:03
is learned without being able to generate
14:05
anything. So I think this is also
14:07
something that could be interesting to look
14:09
at, at least keep in mind when
14:11
we explore those models. But philosophically though,
14:14
isn't generation analogous to thinking in some
14:16
sense? So don't models that generate, aren't
14:18
they smarter in some deep way? Probably
14:20
what you want to do is maybe
14:22
imagine what could be, but I don't
14:25
think you want to do generation is...
14:27
with very granular details like next token
14:29
generation. Because if you think about it,
14:31
even just in terms of like classification
14:33
tasks, you have a lot of different
14:36
uncertainty depending on the token. If I
14:38
start the sentence, okay, I saw this
14:40
movie for minutes, there is no way
14:42
you can tell what was the next
14:44
token for after four, right? So this
14:46
means that you know a period would
14:49
be like a time. component, right? Maybe
14:51
it's like one hour, 10 minutes, two
14:53
hours, but do you really need to
14:55
be able to generate the, I don't
14:57
know, 52 minutes or whatever the answer
15:00
was to actually understand that I was
15:02
seeing a movie, therefore I was staying
15:04
in a place for at least more
15:06
than five seconds, right? So I think
15:08
token is way... to granula. And if
15:11
you had a concept token, that's where
15:13
you could start seeing, okay, this is
15:15
meaningful because that's closer to maybe what
15:17
we do. But right now we are
15:19
very, very, very low level because tokenization
15:21
is a loss less compression, right? So
15:24
this is too close to the raw
15:26
data. And yet, yet we have the
15:28
life easy compared to a computer vision
15:30
because already you work in language, which
15:32
is very compressed representation of knowledge. but
15:35
still token is probably too low level
15:37
still. Well that was a fascinating paper.
15:39
Let's move on to your next one.
15:41
So the birth of self-supervised learning as
15:43
supervised theory and that was with Yam
15:46
Lagoon. Yes. And yeah basically you said
15:48
that the observed differences between self-supervised learning
15:50
and supervised learning are not due to
15:52
the loss function themselves but rather the
15:54
labelling of the data set using training.
15:57
Give us the elevator picture. Yeah so
15:59
basically what we show in this paper
16:01
is that you can have a supervised
16:03
objective like let's say least squares to
16:05
make it simple. the labels and you
16:07
can turn this objective which tries to
16:10
predict sample XN to prediction YN into
16:12
a self-supervisor learning objective which tries to
16:14
compare samples with each other. So basically
16:16
you go from saying okay this image
16:18
is a car or a dog to
16:21
saying are those two images the same
16:23
or not which is like the self-supervised
16:25
type of joint ombudsmaning world. And so
16:27
you can show that If you have
16:29
labels or you have knowledge of this
16:32
perwise relationship, they are actually learning the
16:34
same representation up to some symmetries that
16:36
is irrelevant if you do linear probably.
16:38
So the loss function in itself, the
16:40
SSL one or the supervised one, try
16:42
to do the same thing. They just
16:45
operate on a different view of the
16:47
labeling. whether this image is that or
16:49
are those two images or two samples
16:51
representing the same thing. So given that,
16:53
then the next question is OCOM self-supervised
16:56
learning is able to generalize better than
16:58
supervised. And from this perspective, what you
17:00
can say is that it's because it's
17:02
as if they were solving a supervised
17:04
task, where the labels are not about
17:07
predicting all the cars to cars, but
17:09
are very, very fine grain label, where
17:11
in the limit, each image is its
17:13
own class, basically. If you think about
17:15
supervised learning in this extreme setting, you
17:17
also don't overfit to the task because
17:20
you don't collapse any image to another
17:22
one. And so theoretically speaking, you can
17:24
solve many downstream tasks as you want.
17:26
So this equivalence of loss at least
17:28
brings a slight new perspective on the
17:31
fact that it's not really about the
17:33
objective, it's more about how you design
17:35
the SSL pipeline, or you say, okay,
17:37
with this sample is related to this
17:39
sample, but it's not the objective that...
17:42
makes you learn a better representation. Okay,
17:44
and in the paper you were talking
17:46
about how SSL can maximize the worst
17:48
case downstream task performance. Can you sketch
17:50
that? Yeah, so basically if you think
17:53
about all the possible realization of downstream
17:55
tasks, you could have some very coarse
17:57
scale ones, we have some very coarse
17:59
scale ones, we have maybe different pictures
18:01
of cars and buses, and you just
18:03
want to circuit a car or a
18:06
bus, so no details need to be
18:08
uncoded to solve this. But then you
18:10
can have downstream tasks where you want
18:12
to solve or you. So the point
18:14
now is that you want to learn
18:17
a representation, so that if you look
18:19
at the distribution of downstream tasks performance,
18:21
you are able to be as good
18:23
as possible on most of them. So
18:25
you don't want to be very good
18:28
on some and then in the tail
18:30
you are very bad on the majority
18:32
of them. And so then from this
18:34
you can try to say okay what
18:36
will be the labelling that tries to
18:38
make your worst case as good as
18:41
possible and from this you can say
18:43
okay this is actually the labelling that
18:45
self-supervised is actually implicitly doing. How does
18:47
the class balance affect the difference in
18:49
the losses? Oh yeah so this is
18:52
a very good point actually in a
18:54
follow-up paper we are doing right. right
18:56
now, we show that current SSL objective
18:58
assume class balanceness. And this is something
19:00
we already highlighted quickly in this. as
19:03
self-supporting as a uniform cluster prior to
19:05
paper, we did a couple years ago,
19:07
and we show that current assessor objectives
19:09
assume balance representation of classes or concepts.
19:11
And this means that if you train
19:14
on ImageNet, things work out very well
19:16
because concepts are sort of equally represented.
19:18
But then if you go to other
19:20
data sets like I Naturalist, which are
19:22
very avi-tail, then you have a huge
19:24
bias in your representation. So until now,
19:27
people do not really know how to
19:29
solve this. One way people approach this
19:31
is through data curation. And they say,
19:33
OK, I'm just going to remove the
19:35
oversampled concepts to try to make it
19:38
more uniform. And then I do self-supervised
19:40
learning on this. But because now we
19:42
have this theoretical formulation and this equivalence
19:44
of losses, we can use the exact
19:46
same settings that people used in a
19:49
supervised learning to re-weight depending on the
19:51
frequency of classes. We can use that
19:53
to come up with a new self-supervised
19:55
learning loss that takes this imbalance into
19:57
account. This type of thing is enabled
19:59
from this mathematical formulation and its principle.
20:02
So the way we do this waiting,
20:04
you can prove that it is the
20:06
right way to do it from this
20:08
supervisory. And so this is really nice
20:10
because suddenly from this seemingly naive connection,
20:13
you cannot come up with new generation
20:15
of self-supervised learning models where you can...
20:17
actually match what the real world data
20:19
distribution is like. So non-uniform distribution of
20:21
classes, maybe even if you have some
20:24
samples that are more noises and others,
20:26
you can include that information as part
20:28
of the SSO objective as well. So
20:30
suddenly you have a new world of
20:32
possibilities that comes and because there is
20:35
this connection, you can actually prove, okay,
20:37
this is the right way to do
20:39
it, at least from this supervised theory
20:41
viewpoint. You also pointed out a connection
20:43
to V Craig. Exactly. So basically, what
20:45
we do in the paper that we
20:48
show if... you have a least square
20:50
supervised type of objective and you turn
20:52
it into a SSL one, what you
20:54
obtain is basically V Craig. So then
20:56
you have a few variations, it could
20:59
be V Craig or WMC, depending on
21:01
how you do this from supervised to
21:03
SSL, but you can show that depending
21:05
on the type of supervised loss, you
21:07
recover different type of SSL ones. If
21:10
you look maybe more at cross-anthropy, a
21:12
super-wise learning is going to be more
21:14
like a simpler type of loss, but
21:16
you have this one-to-to-to-one correspondence correspondence. And
21:18
this is also very nice. And this
21:20
is also very nice. because in supervised
21:23
learning at least you know when one
21:25
loss may be preferred compared to another
21:27
one and this has been studied for
21:29
a long time right because supervised learning
21:31
has been around forever and so now
21:34
we can we use those insights for
21:36
self-supervised learning. So this to me is
21:38
also a very very strong benefit of
21:40
this thing is that suddenly all the
21:42
theory and like the thousands of papers
21:45
that have been done in supervised learning
21:47
we can just take it and apply
21:49
it in SSL. Another example is a
21:51
neural collapse for example that has been
21:53
proven in supervised setting. Now it applies
21:56
like in five lines in a SSL
21:58
setting as well. So this connection is
22:00
really... beyond just trying to say, okay,
22:02
it's not the objectives that make SSL
22:04
better. It's really tying those two huge
22:06
communities together towards the goal where you
22:09
have a single unified objective to learn
22:11
representation. And this is nice too, because
22:13
if you speak to people, they will
22:15
think, okay, you have super-wise learning on
22:17
one side, SSL on the other side,
22:20
and basically you are either in one
22:22
camp or the other. But now what
22:24
we show is that you actually SSL
22:26
is... pretty much everything in representation learning
22:28
and supervise just one realization of SSL.
22:31
Then V Craig without labels, he knows
22:33
other ones, then this one is another
22:35
one. So you really have a better
22:37
understanding of this relationship and what representation
22:39
learning is trying to do. Galaxy brain
22:41
question incoming. Could you combine SSL and
22:44
supervised objectives in some way to improve
22:46
generalization? Yes, yes. So there is one
22:48
paper which is a supervised contrastive learning.
22:50
So the way they do it is
22:52
that they use the labels within a
22:55
simpler framework to try to basically do
22:57
fully supervise learning, but with a simpler
22:59
objective. So first of all, we can
23:01
show that indeed this makes sense and
23:03
that basically we can explain the empirical
23:06
result that they have. But actually we
23:08
can do a little bit more than
23:10
So if you are in a semi-supervised
23:12
setting, for example, it may not be
23:14
clear how to combine those two losses
23:17
anymore, or maybe you could say, you
23:19
have the two and have a coefficient
23:21
to weight them, but then you need
23:23
to cross-validation and so on. But now
23:25
from this perspective, you can combine them
23:27
in a very principled way, and you
23:30
can understand which weighting makes sense depending
23:32
on... how much sample you have in
23:34
one or the other. And you can
23:36
use all the literature, again, from supervised
23:38
learning, for this setting as well. So
23:41
this is something we can do very
23:43
easily with this formulation as well. Okay,
23:45
so if SSL and Supervis are two
23:47
sides of the same coin, I mean,
23:49
of course, we can use this theoretical
23:52
framework to design new forms of SSL
23:54
framework. Does it, but you know, is
23:56
the distinction relevant if they are the
23:58
same thing? I think it's not just
24:00
two sides of the same coin. SSL
24:02
is more general than supervised learning. So
24:05
it's really, SSL could be the more
24:07
general objective to learn representation. The more
24:09
prior knowledge you have, the more you
24:11
know about your labels and then SSL.
24:13
like slowly becomes supervised learning through the
24:16
labels that you use for the SSL
24:18
objective. But then, because as you said,
24:20
you have this hierarchy, now, it does
24:22
not really make sense to say you
24:24
have either supervised learning or SSL. Rather,
24:27
what makes sense is you say, okay.
24:29
what this relation matrix, what is per
24:31
wise matrix? If you build it from
24:33
labels, it's a supervised learning, if you
24:35
build it from other a prior knowledge,
24:38
for example, two consecutive frames in a
24:40
video, I basically have the same class,
24:42
then you are more in a unsupervised
24:44
SSL setting, but it's all about how
24:46
do you build this per wise relation
24:48
matrix? that's the main question. Very cool,
24:51
right, let's move on to your next
24:53
paper. So no location left behind, measuring
24:55
and improving the fairness of implicit representations
24:57
for Earth data. So there's loads and
24:59
loads of modeling frameworks now that do
25:02
these implicit neural representations of geospatial Earth
25:04
data. So things like, climate modeling, resource
25:06
allocation, environmental modeling. I was actually interviewing
25:08
Johannes from NxAI, yes, I don't know
25:10
if you know him, but he's working
25:13
on similar stuff. Okay. So basically what
25:15
we show is that when you want
25:17
to model, for example, let's say temperature
25:19
or precipitation to make it simple, and
25:21
you want to learn, for example, implicit
25:23
neural representation, it means that you want
25:26
a model so that if you give
25:28
a location and a... date, for example,
25:30
it can predict what was the temperature
25:32
there. So if you have this type
25:34
of implicit neural representation, it's very good,
25:37
because if you learn a nice model,
25:39
then you can actually interpolate those values,
25:41
so maybe estimate what the temperature was
25:43
in this part of the globe, or
25:45
you do not have a sensor. But
25:48
you can also do extra polation as
25:50
well. If you assume you really learn
25:52
the true physical model of the world,
25:54
you could say, okay, what the temperature
25:56
will be two years from now, right?
25:58
for all sorts of applications. The thing
26:01
is that when you do this nowadays,
26:03
depending on the architecture and the different
26:05
design choices that you do, you will
26:07
maybe have a very good prediction on
26:09
average, so when you look at the
26:12
rate performance around the world globe, but
26:14
actually if you look for example around
26:16
islands or coastal area, your prediction is
26:18
going to be very bad almost random.
26:20
So this is something that can be
26:23
very concerning because if you use this
26:25
type of model to decide about a
26:27
policy that will affect a specific island.
26:29
Using this model prediction is as good
26:31
as using random guesses. So it can
26:34
be very detrimental and people need to
26:36
be aware of those biases. So what
26:38
we found is that, for example, for
26:40
this type of climate data, islands are
26:42
often disregarded, coastal area, basically region where
26:44
you have a big gradient in the
26:47
type of data that you try to
26:49
model. How much of a responsibility do
26:51
modelers have to detect these kinds of
26:53
biases in the data? So I think
26:55
there is like two components, as you
26:58
said. So one could be that just
27:00
the dynamic of the data you are
27:02
trying to model is harder near island
27:04
or maybe it's even unpredictable because you
27:06
don't have enough observations to do that.
27:09
So you have some uncertainties that probably
27:11
you can never recover from good design.
27:13
But still what we found here is
27:15
that a lot of the biases now.
27:17
comes from the architecture and all you
27:19
want to do to encode those positions,
27:22
the type of basis you use to
27:24
do the prediction. So right now it
27:26
seems that a big chunk of the
27:28
bias... from the architecture, but I totally
27:30
agree that I don't think we can
27:33
remove the bias entirely because there is
27:35
maybe just different type of uncertainty, a
27:37
different part of the planet as well.
27:39
I mean the world is a very
27:41
very complicated place. I mean realistically to
27:44
what extent can we mathematically model it?
27:46
Yeah so that's a good question. So
27:48
I think it depends the type of
27:50
horizon that you have and the type
27:52
of data that you want to model.
27:55
If you have a system that is
27:57
much more chaotic or can vary very
27:59
quickly without... much changes in the past
28:01
observation. That's something that current models are
28:03
having a very hard time with. If
28:05
you want to predict something else, for
28:08
example... temperature in North America, not near
28:10
the coastal area, so really inland, maybe
28:12
that's where you have less, gradient dynamics,
28:14
things are a bit more stationary, especially
28:16
in through time, so then it can
28:19
become much better. But I think at
28:21
this point, we don't have an architecture
28:23
that is really able to understand that
28:25
you have different physics, different dynamics models
28:27
at different parts of the globe. And
28:30
so because of this, you just... see
28:32
what's the best on average and it
28:34
means you miss out a lot of
28:36
details. Can you tell us about some
28:38
of the technical framework? So one thing
28:40
we showed, for example, at least for
28:43
this type of globe data representation, is
28:45
that people use a four-year basis to
28:47
model the prediction. And this is something
28:49
that is better than not using any
28:51
basis at all. But what it means
28:54
that you imply the type of signal
28:56
you're predicting is very stationary and not
28:58
localized at all. And this is a
29:00
very strong prior, right? So this may
29:02
be true for some things, but for
29:05
other things like precipitation precipitation or... temperature
29:07
where you have localized very high gradients
29:09
then it's a strong bias and if
29:11
you come from signal processing community you
29:13
know very well that to have better
29:16
localization you go from four years to
29:18
wavelets and so that's one thing we
29:20
did in this paper and we show
29:22
that using wavelet bases to encod those
29:24
data allows you to have better localization
29:26
and this removes of the biases. And
29:29
here it's more of proof of concept
29:31
that different design choices give you a
29:33
different type of bias trade-off. Wevelette is
29:35
not the answer to everything, right? But
29:37
I think the next step is to
29:40
really be able to encode less and
29:42
less a priori which basis to use
29:44
and let the model learn from the
29:46
data on its own. And we are
29:48
not yet at this point, at least
29:51
for this type of climate data. How
29:53
could it handle noisy or missing data?
29:55
This depends really on the type of
29:57
model you use. So for example, if
29:59
you have INR, then you will not
30:01
choose the missing data as part of
30:04
your training pipeline and that's one of
30:06
the benefit of them. So if one
30:08
of your sensors stopped recording during some
30:10
years, you just don't choose that as
30:12
part of your training data because you
30:15
really control where do you have the
30:17
data and when you have it, what
30:19
the prediction should be. So these earth
30:21
models, they are now informing policy around
30:23
the world. Who should we hold accountable?
30:26
I mean, is it the technology, is
30:28
it the scientists who design the models,
30:30
is it the policy makers who interpret
30:32
the results? I think it's very hard
30:34
for the person who designs the model
30:37
to know priori what is going to
30:39
be used for. So I think it's
30:41
more downstream. When you know clearly what
30:43
you want to do with it, you
30:45
should first set up a nice evaluation
30:47
pipeline to make sure that it's something
30:50
you can actually use to make those
30:52
decisions. And then you can report any
30:54
type of... your model for people to
30:56
improve on the design, but prior to
30:58
it's very hard to imagine what this
31:01
model will be used for. So in
31:03
an ideal setting you wish that there
31:05
would be no bias at all, but
31:07
in practice the world of possibilities being
31:09
so large it needs to be more
31:12
of a feedback loop and then... iterate
31:14
until you have something that you can
31:16
really trust and then you can act
31:18
on it. Modeling data is very anthropocentric,
31:20
right? So, you know, we focus on
31:22
human populations and so on. Should we
31:25
also focus on, you know, like just
31:27
ecosystems and places that have got nothing
31:29
to do with humans? Oh yeah, that's
31:31
a great question and in fact that's
31:33
one of the big issue with... a
31:36
lot of the data set which is
31:38
a crowd source set because by definition
31:40
the amount of data that you get
31:42
is proportional to the number of users
31:44
you have depending on the location. And
31:47
this means you have a huge bias
31:49
in what your model is learning and
31:51
what your model is focusing on, which
31:53
means you miss out on a lot
31:55
of things. So I think that's also
31:58
one thing that, okay, crowdsourcing can give
32:00
you a lot of data quickly, but
32:02
it's very biased data. So then the
32:04
question is, how much of this bias
32:06
data versus maybe paying a lot more
32:08
and capturing other part of the two
32:11
you should have? And maybe you could
32:13
be able to show that under some
32:15
specific condition, just having 10% of the
32:17
data, which is high quality, uniformly sample,
32:19
and then 90% which is crowd sources,
32:22
you can try to use those 10%
32:24
to incur your representation and then use
32:26
all that data together. But there is
32:28
a huge amount of research question in
32:30
that, because that's a very big source
32:33
of bias. And there's a bit of
32:35
a policy question, but we are using
32:37
these things, you know, to do resource
32:39
allocation, right? giving more resources to some
32:41
populations might be taking it away from
32:43
others and then there's the fairness over
32:46
time thing as well which is that
32:48
what is fair like now might not
32:50
be fair in a hundred years time
32:52
so how should we think about it?
32:54
Yeah that's a good question I think
32:57
this is also very application. If you
32:59
want to predict where to build a
33:01
house to solve some specific problem, maybe
33:03
you don't really mind having bad prediction
33:05
where there is no population anyway because
33:08
you are not going to build a
33:10
house there. So in this case, maybe
33:12
the crowdsourcing type of data is actually
33:14
good, but this could really be dependent
33:16
on the type of application. And just
33:19
one thing I would say regarding the
33:21
point you made before, this type of
33:23
bias actually is something that you have
33:25
in computer vision. So there is like
33:27
a very nice... paper done by a
33:29
Marquis brain. Basically, they showed that most
33:32
of the data we have, like from
33:34
ImageNet, is from North America. And so
33:36
maybe you reach like 90% state of
33:38
the art performance to predict, for example,
33:40
type of chairs. but only
33:43
for North American models.
33:45
And when you
33:47
start looking at type
33:49
of cars or
33:51
chairs cars or Africa or
33:54
East Asia, the model
33:56
model performance is extremely
33:58
bad. So type of
34:00
problem is something you
34:02
have across across modalities
34:04
that's something that's very big,
34:07
big issue. it's it's
34:09
always a pleasure
34:11
and an honor to
34:13
have you on
34:15
the show. the you
34:18
so much. you so much.
34:20
Thank you so much.
34:22
thank you so much.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More