Prof. Randall Balestriero - LLMs without pretraining and SSL by Machine Learning Street Talk (MLST) | Podchaser

Episode from the podcastMachine Learning Street Talk (MLST)

Prof. Randall Balestriero - LLMs without pretraining and SSL

Released Wednesday, 23rd April 2025

Good episode? Give it some love!

Prof. Randall Balestriero - LLMs without pretraining and SSL

Prof. Randall Balestriero - LLMs without pretraining and SSL

Wednesday, 23rd April 2025

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

We just launched this experiment and then

0:02

we are very surprised to see that

0:04

the hugely overparameterized model not only train

0:06

out of the box, like you have

0:09

very nice training curves, but also they

0:11

don't overfit aggressively at all. And what

0:13

we found empirically is that we just

0:15

out of the box, use typical like

0:18

supervised training. We don't have to play

0:20

with the hyperparameter optimizer, and you have

0:22

very very stable training. So

0:26

this brings us to

0:28

the question, is it

0:30

worth it to spend

0:32

so much money to

0:34

gather gigantic pre-training data

0:36

set, spend like months

0:38

on many GPUs to

0:40

produce those models, but

0:42

at least for some

0:44

application it seems to not be

0:46

much better than from there. based in

0:49

Switzerland. They have an amazing team. You've

0:51

seen many of the folks on the

0:53

team. They acquired Mine's AI, of course.

0:55

They did a lot of great work

0:57

on arc. They're now working on O1-style

0:59

models and reasoning and thinking and test

1:01

time computation. The reason you want to

1:03

work for them is you get loads

1:05

of autonomy, you get visibility, you can

1:07

publish your research, and also they are

1:10

hiring as well as ML engineers, the

1:12

hiring a chief scientist. They really, really

1:14

want to find the best possible person

1:16

for this role for this role. as

1:18

a joining bonus. So if you're interested

1:20

in working for them as an MO

1:22

engineer or their chief scientist, get in

1:24

touch with Benjamin Cruzier, go to two

1:26

for labs.a.i. and see what happens. Originally,

1:29

the main motivation was to see,

1:31

okay, how much information you gain

1:34

by doing play training, right? And

1:36

is this next second prediction really

1:38

making your network learn something about

1:40

language and reasoning? And so then

1:42

we are saying, one way to

1:45

compare this at least empirically is

1:47

to just take a randomly initialized

1:49

model. train it from scratch on

1:51

a supervised task like sentiment prediction,

1:53

sentiment analysis. And then in theory,

1:56

because we have a very very

1:58

small training data, let's say... samples

2:00

and because of Smodel, I've like... 7

2:02

billion parameters. The pre-train one will perform

2:05

very nicely with a little bit of

2:07

lower-off and tuning because it already knows

2:09

how to reason about the world, right?

2:11

So maybe you just adjust a little

2:13

bit to the specific task that you

2:16

want, but since you have so many

2:18

prior knowledge, you will serve the task

2:20

very easily. But the random one, either

2:22

will over-feed completely. Because you have like

2:24

7 billion parameters and only 20,000 training

2:27

samples. Or maybe it will not learn

2:29

at all because training dynamics will be

2:31

completely. So we just launched this experiment

2:33

and then we were very surprised to

2:35

see that the 7 billion, or like

2:37

the hugely overparameterized model, not only train

2:40

out of the box, like you have

2:42

very nice training curves, almost like you

2:44

train the EMIST, but also they don't

2:46

overfit aggressively at all. Like they overfit

2:48

less than if you just train a

2:51

MLP on EMIS, basically. And this is

2:53

very surprising. And so, basically, from this,

2:55

we said, okay, actually, maybe there is

2:57

a deeper question. which could be how

2:59

much implicit bias you have in those

3:02

language model because already we knew from

3:04

computer visions that for example Imagineet you

3:06

can have a 50 million model on

3:08

a 1 million data set so you

3:10

have this 50 to 1 ratio and

3:13

you have the implicit bias that prevents

3:15

you from overfitting and just serving the

3:17

task right but still it's 50 to

3:19

1 so this may sound a lot

3:21

for you know like. statistician, but now

3:23

it's like 7 billion to 20,000. So

3:26

like the ratio is like gigantic, right?

3:28

And yeah, to me it was very

3:30

surprising that the scale, like the size

3:32

of this ratio, still... allows you to

3:34

learn some things that does not overfit.

3:37

This is very surprising because in vision,

3:39

for example, transformers are known to overfit

3:41

more resilience and resonate. So they seem

3:43

at least in vision to have actually

3:45

less implicit bias or implicit regularization, but

3:48

at least with this. type of next

3:50

token, causal architecture or LLLM. Yeah, you

3:52

don't seem to work easily to your

3:54

data. So this was quite surprising. Yeah,

3:56

and we should bring in the name.

3:58

So this was your workshop paper at

4:01

the self-supervised learning workshop here at New

4:03

York State School. Correct. For perception tasks,

4:05

is LLLM pre-training by next token prediction

4:07

worth the cost? So this is absolutely

4:09

fascinating, right? So we've been given this,

4:12

this belief that we need to have

4:14

these huge... pre-trained models, they're trained on

4:16

all the data on the internet, and

4:18

it turns out that certainly for discrimination

4:20

tasks, so things like classification, urban generation,

4:23

actually you can just start from scratch

4:25

with a fairly small model and you

4:27

get sometimes even better results. Yeah, yeah,

4:29

and even small or even large model,

4:31

like you just start from scratch, you

4:34

do this very simple supervised classification task,

4:36

right? Okay, given this prompt, is it

4:38

good or bad? sentiment or what type

4:40

of job is the problem describing. You

4:42

know, this type of, I will not

4:44

call it reasoning, but you have more

4:47

semantic classification and turns out that you

4:49

start from random. Even if you have

4:51

a small training data set, we'll have

4:53

performances that are sometimes as good than

4:55

a pre-train model. So this brings us

4:58

to the question, is it worth it

5:00

to spend so much money to gather

5:02

gigantic pre-training data set, spend like months

5:04

on many GPUs to produce those models?

5:06

And for some cases, so for generation...

5:09

Alright, there is no question this is

5:11

what you need to do. You have

5:13

your next second prediction, you learn how

5:15

to generate samples, but at least for

5:17

some application it seems to not be

5:19

much better than random. So it's quite

5:22

interesting. So what are the differences in

5:24

the land representations? So that's something we

5:26

did not really look at like low

5:28

dimensional representation for what you learn. It's

5:30

possible, so some work, try to look

5:33

at the attention, entropy and the like

5:35

you... those mechanistic interpretability viewpoint of LLLM's.

5:37

So it would be interesting to see

5:39

if you have this sort of non-normal

5:41

collapse thing that happens. So even if

5:44

you're like a 7 billion parameter, maybe

5:46

you end up learning a very, very,

5:48

very simple sub-network that does the task.

5:50

It will be like lotoreticate hypothesis as

5:52

well. And that naturally emerged from the

5:55

training dynamics. Or is it really exploiting

5:57

all the parameters? I think that's one

5:59

thing. So to extend the workshop paper

6:01

to conference we want to probe into.

6:03

more product. What are the useful parameters?

6:05

What did they learn? Are each layer

6:08

actually learning something or maybe the first

6:10

layers don't really learn anything just the

6:12

last few ones learning something? So yes,

6:14

I was like... lots of open questions

6:16

for this. What does it tell us

6:19

about the nature of understanding and maybe

6:21

even intelligence because we think that the

6:23

reason why these things understand is because

6:25

they just have all of these representations

6:27

to all of these different you know

6:30

things in their experience. And and now

6:32

we can shortcut to you know to

6:34

want the better word. What does that

6:36

tell us? Yeah I think that's a

6:38

good question. So in this case we

6:40

must look at very specific classification tasks.

6:43

So for example of a job or

6:45

job. is it like a good or

6:47

bad sentiment? And this you are able

6:49

to solve it good, but you are

6:51

not able to go out of distribution

6:54

to solve a new type of question.

6:56

For example, for this job description, then

6:58

you cannot answer, okay, is this job

7:00

paying you more than this job? Because

7:02

this was not present in the training

7:05

data, right? So I think you get

7:07

very good models, cheaply. quickly from a

7:09

domination, but they will be very specialized.

7:11

And I think the benefit of having

7:13

maybe the pre-training may come if you

7:16

want to do more of like open-ended

7:18

classification or reasoning. So I think it

7:20

really depends on the type of application

7:22

you want to solve, what's your downstream

7:24

task, and how much you want to

7:26

generalize to new scenarios. But at least

7:29

now it shows that it's not just

7:31

pre-training with next second prediction is better

7:33

for everything. five years data scientists used

7:35

to build. specific models for doing everything.

7:37

And now we're in this regime of,

7:40

we need these really big models and

7:42

we do in context learning and maybe

7:44

even some fine tuning and we get

7:46

them to do fairly specific discriminative tasks.

7:48

But now you're saying we should almost

7:51

go back to where we were five

7:53

years ago and start building specialized models

7:55

again. Only now, rather than building classification

7:57

models, we're actually we're still using the

7:59

transformers and the LLLMs, but we're making

8:01

them do specific tasks. specific tasks, use

8:04

this prior knowledge to have a nice

8:06

architecture, a supervised data set for that

8:08

and just do that from scratch. This

8:10

is something that's gonna probably work much

8:12

better, but again you need to make

8:15

sure that the downstream application will never

8:17

go. two out of distribution. So that's

8:19

why it really depends on the application

8:21

and the type of use cases that

8:23

you have. But I think at least

8:26

here it shows that there exists some

8:28

task where an exocan prediction is not

8:30

the answer. And in fact, it's not

8:32

just not the answer, but it's not

8:34

better than random initialization, which is really

8:36

sort of the worst case scenario. Interesting,

8:39

I mean from a fairness and bias

8:41

point of view a lot of people

8:43

say that you know large language models

8:45

are bad in a way because there's

8:47

the dominance of North American cultures and

8:50

so on. But you could also argue

8:52

the converse which is that the good

8:54

thing about them is that they do

8:56

have some awareness of value you know

8:58

so we can fine-tune them to have

9:01

guardrails and to sort of say the

9:03

right thing and so on is that

9:05

harder to do with this approach. Yeah,

9:07

so here because you are in a

9:09

fully supervised setting, you don't have as

9:12

much flexibility to, let's say, change the

9:14

behavior of your model or it will

9:16

have to take the form of supervised

9:18

fine tuning. But because you don't have

9:20

a generative capability, it's certainly restrict the

9:22

type of interaction you have with the

9:25

model and how you can improve it,

9:27

right? Because the output is just, okay,

9:29

is it a good or bad sentiment?

9:31

something that gives you a full answer

9:33

that then you can try to argue

9:36

against and generate a fine tuning that

9:38

I set from is just okay good

9:40

bad and that's it. Another thing is

9:42

training strategy so you know like the

9:44

big players building these LLLMs they have

9:47

lots of internalized knowledge around you know

9:49

even the order in which you train

9:51

the language models everything is important you

9:53

know certainly in the old days of

9:55

like basic models you know you just

9:57

stick a load of data and then

10:00

no one really cares yeah so you

10:02

know now do people need to be

10:04

sort of thinking about the specialized knowledge

10:06

maybe things thinking about curriculum learning and

10:08

all of this kind of stuff? Yeah,

10:11

so this is a good point. So

10:13

we did the paper recently, called the

10:15

Fair Language Model Paradox, where we show

10:17

that when you do this next token

10:19

prediction, because you have some tokens that

10:22

are very low frequency, it's very hard

10:24

to train on them and it takes

10:26

a very long training. So it's very

10:28

wasteful, right? And the problem is that

10:30

because you lose this next token prediction,

10:33

you need to really capture all your

10:35

distribution of tokens, and so you spend

10:37

a lot of time. If the low

10:39

frequency tokens are not useful to solve

10:41

your task, you actually don't need to

10:43

capture it at all. So in turn

10:46

of training dynamics, this is actually a

10:48

much simpler problem in many cases. And

10:50

what we found empirically is that we

10:52

just... out of the box, use typical

10:54

like supervised training, we don't have to

10:57

play with hyperparameter, optimizer, and you have

10:59

very very stable training. So that's one

11:01

thing that could be also interesting for

11:03

future work is to see, is this

11:05

something that is easier to optimize? And

11:08

maybe that's why those like 7 billion

11:10

parameter model can learn and not overfit

11:12

on like 10,000 samples. And then it's

11:14

also bringing other things that maybe this

11:16

on its own. could be a better

11:18

initiation for a next token prediction as

11:21

well. So this is a very open

11:23

up in the air, but maybe you

11:25

could think of a simpler supervised objective

11:27

that would be a better pre-training solution

11:29

that then you can use for next

11:32

token prediction if you wanted to. But

11:34

at least this would be a better

11:36

starting point from random. So you'll almost

11:38

reverse the train. So we've spoken about

11:40

two extremes. So on the one extreme

11:43

we have pre-training and you can like

11:45

use it for any downstream task. And

11:47

on the other extreme we have, you

11:49

know, you start from scratch just with

11:51

one task. Is there an intermediate solution?

11:54

So what if I did this new

11:56

approach but for multitask? Let's say for

11:58

five tasks. Yeah, yeah. So that's a

12:00

great question. So if you really think

12:02

about it in the limit, you could

12:04

formulate a next token, this one or

12:07

not. So in the... extreme case, you

12:09

could just recover an external prediction on

12:11

one end, and on the other end

12:13

you have what we have here, so

12:15

just one task, very coarse, high level,

12:18

predict if it's a good or bad

12:20

sentiment or whatever. So in between you

12:22

have a huge spectrum that you can

12:24

exploit, and if you can find, as

12:26

you said, maybe five very different representative

12:29

tasks, this should be enough to or

12:31

could be enough to learn the representation

12:33

that is as general as possible, and

12:35

then you can use this for maybe

12:37

new... task that comes on the go.

12:39

So I think the research question is

12:42

how to design the minimum amount of

12:44

task so that you have as diverse

12:46

representation as possible. And of course we

12:48

don't want to go to the extreme

12:50

of just doing again next token prediction.

12:53

But this is a very very nice

12:55

research question because If you have this

12:57

spectrum and you can control where you

12:59

want to be, then you can really

13:01

have a per-use case choice. So it's

13:04

not, okay, you are always here or

13:06

always here. Tell me what you want

13:08

to do, how much new tasks you

13:10

expect your model to be exposed to,

13:12

and I tell you where you need

13:15

to be in this spectrum. So this

13:17

could be like very interesting as well.

13:19

Very cool, very cool. It does make

13:21

me think though that. these models understand

13:23

through naive statistical alignment and is it

13:25

possible that the benchmarks we use just

13:28

don't cap you know they the gap

13:30

of understanding that we've lost from moving

13:32

from the pre-trained models isn't being captured.

13:34

Yeah I think because especially in the

13:36

recent years we focus a lot on

13:39

generative methods, all the evaluation and the

13:41

type of objectives we put on ourselves

13:43

is really about good generation, right? Even

13:45

if you want to answer a question,

13:47

you need to generate a good explanation,

13:50

you need to understand what are the

13:52

intermediate steps, and I think the fact

13:54

that we focus on generative models means

13:56

that we completely bias, the evaluation and

13:58

the way we approach this thing, and

14:00

maybe you could have still knowledge that

14:03

is learned without being able to generate

14:05

anything. So I think this is also

14:07

something that could be interesting to look

14:09

at, at least keep in mind when

14:11

we explore those models. But philosophically though,

14:14

isn't generation analogous to thinking in some

14:16

sense? So don't models that generate, aren't

14:18

they smarter in some deep way? Probably

14:20

what you want to do is maybe

14:22

imagine what could be, but I don't

14:25

think you want to do generation is...

14:27

with very granular details like next token

14:29

generation. Because if you think about it,

14:31

even just in terms of like classification

14:33

tasks, you have a lot of different

14:36

uncertainty depending on the token. If I

14:38

start the sentence, okay, I saw this

14:40

movie for minutes, there is no way

14:42

you can tell what was the next

14:44

token for after four, right? So this

14:46

means that you know a period would

14:49

be like a time. component, right? Maybe

14:51

it's like one hour, 10 minutes, two

14:53

hours, but do you really need to

14:55

be able to generate the, I don't

14:57

know, 52 minutes or whatever the answer

15:00

was to actually understand that I was

15:02

seeing a movie, therefore I was staying

15:04

in a place for at least more

15:06

than five seconds, right? So I think

15:08

token is way... to granula. And if

15:11

you had a concept token, that's where

15:13

you could start seeing, okay, this is

15:15

meaningful because that's closer to maybe what

15:17

we do. But right now we are

15:19

very, very, very low level because tokenization

15:21

is a loss less compression, right? So

15:24

this is too close to the raw

15:26

data. And yet, yet we have the

15:28

life easy compared to a computer vision

15:30

because already you work in language, which

15:32

is very compressed representation of knowledge. but

15:35

still token is probably too low level

15:37

still. Well that was a fascinating paper.

15:39

Let's move on to your next one.

15:41

So the birth of self-supervised learning as

15:43

supervised theory and that was with Yam

15:46

Lagoon. Yes. And yeah basically you said

15:48

that the observed differences between self-supervised learning

15:50

and supervised learning are not due to

15:52

the loss function themselves but rather the

15:54

labelling of the data set using training.

15:57

Give us the elevator picture. Yeah so

15:59

basically what we show in this paper

16:01

is that you can have a supervised

16:03

objective like let's say least squares to

16:05

make it simple. the labels and you

16:07

can turn this objective which tries to

16:10

predict sample XN to prediction YN into

16:12

a self-supervisor learning objective which tries to

16:14

compare samples with each other. So basically

16:16

you go from saying okay this image

16:18

is a car or a dog to

16:21

saying are those two images the same

16:23

or not which is like the self-supervised

16:25

type of joint ombudsmaning world. And so

16:27

you can show that If you have

16:29

labels or you have knowledge of this

16:32

perwise relationship, they are actually learning the

16:34

same representation up to some symmetries that

16:36

is irrelevant if you do linear probably.

16:38

So the loss function in itself, the

16:40

SSL one or the supervised one, try

16:42

to do the same thing. They just

16:45

operate on a different view of the

16:47

labeling. whether this image is that or

16:49

are those two images or two samples

16:51

representing the same thing. So given that,

16:53

then the next question is OCOM self-supervised

16:56

learning is able to generalize better than

16:58

supervised. And from this perspective, what you

17:00

can say is that it's because it's

17:02

as if they were solving a supervised

17:04

task, where the labels are not about

17:07

predicting all the cars to cars, but

17:09

are very, very fine grain label, where

17:11

in the limit, each image is its

17:13

own class, basically. If you think about

17:15

supervised learning in this extreme setting, you

17:17

also don't overfit to the task because

17:20

you don't collapse any image to another

17:22

one. And so theoretically speaking, you can

17:24

solve many downstream tasks as you want.

17:26

So this equivalence of loss at least

17:28

brings a slight new perspective on the

17:31

fact that it's not really about the

17:33

objective, it's more about how you design

17:35

the SSL pipeline, or you say, okay,

17:37

with this sample is related to this

17:39

sample, but it's not the objective that...

17:42

makes you learn a better representation. Okay,

17:44

and in the paper you were talking

17:46

about how SSL can maximize the worst

17:48

case downstream task performance. Can you sketch

17:50

that? Yeah, so basically if you think

17:53

about all the possible realization of downstream

17:55

tasks, you could have some very coarse

17:57

scale ones, we have some very coarse

17:59

scale ones, we have maybe different pictures

18:01

of cars and buses, and you just

18:03

want to circuit a car or a

18:06

bus, so no details need to be

18:08

uncoded to solve this. But then you

18:10

can have downstream tasks where you want

18:12

to solve or you. So the point

18:14

now is that you want to learn

18:17

a representation, so that if you look

18:19

at the distribution of downstream tasks performance,

18:21

you are able to be as good

18:23

as possible on most of them. So

18:25

you don't want to be very good

18:28

on some and then in the tail

18:30

you are very bad on the majority

18:32

of them. And so then from this

18:34

you can try to say okay what

18:36

will be the labelling that tries to

18:38

make your worst case as good as

18:41

possible and from this you can say

18:43

okay this is actually the labelling that

18:45

self-supervised is actually implicitly doing. How does

18:47

the class balance affect the difference in

18:49

the losses? Oh yeah so this is

18:52

a very good point actually in a

18:54

follow-up paper we are doing right. right

18:56

now, we show that current SSL objective

18:58

assume class balanceness. And this is something

19:00

we already highlighted quickly in this. as

19:03

self-supporting as a uniform cluster prior to

19:05

paper, we did a couple years ago,

19:07

and we show that current assessor objectives

19:09

assume balance representation of classes or concepts.

19:11

And this means that if you train

19:14

on ImageNet, things work out very well

19:16

because concepts are sort of equally represented.

19:18

But then if you go to other

19:20

data sets like I Naturalist, which are

19:22

very avi-tail, then you have a huge

19:24

bias in your representation. So until now,

19:27

people do not really know how to

19:29

solve this. One way people approach this

19:31

is through data curation. And they say,

19:33

OK, I'm just going to remove the

19:35

oversampled concepts to try to make it

19:38

more uniform. And then I do self-supervised

19:40

learning on this. But because now we

19:42

have this theoretical formulation and this equivalence

19:44

of losses, we can use the exact

19:46

same settings that people used in a

19:49

supervised learning to re-weight depending on the

19:51

frequency of classes. We can use that

19:53

to come up with a new self-supervised

19:55

learning loss that takes this imbalance into

19:57

account. This type of thing is enabled

19:59

from this mathematical formulation and its principle.

20:02

So the way we do this waiting,

20:04

you can prove that it is the

20:06

right way to do it from this

20:08

supervisory. And so this is really nice

20:10

because suddenly from this seemingly naive connection,

20:13

you cannot come up with new generation

20:15

of self-supervised learning models where you can...

20:17

actually match what the real world data

20:19

distribution is like. So non-uniform distribution of

20:21

classes, maybe even if you have some

20:24

samples that are more noises and others,

20:26

you can include that information as part

20:28

of the SSO objective as well. So

20:30

suddenly you have a new world of

20:32

possibilities that comes and because there is

20:35

this connection, you can actually prove, okay,

20:37

this is the right way to do

20:39

it, at least from this supervised theory

20:41

viewpoint. You also pointed out a connection

20:43

to V Craig. Exactly. So basically, what

20:45

we do in the paper that we

20:48

show if... you have a least square

20:50

supervised type of objective and you turn

20:52

it into a SSL one, what you

20:54

obtain is basically V Craig. So then

20:56

you have a few variations, it could

20:59

be V Craig or WMC, depending on

21:01

how you do this from supervised to

21:03

SSL, but you can show that depending

21:05

on the type of supervised loss, you

21:07

recover different type of SSL ones. If

21:10

you look maybe more at cross-anthropy, a

21:12

super-wise learning is going to be more

21:14

like a simpler type of loss, but

21:16

you have this one-to-to-to-one correspondence correspondence. And

21:18

this is also very nice. And this

21:20

is also very nice. because in supervised

21:23

learning at least you know when one

21:25

loss may be preferred compared to another

21:27

one and this has been studied for

21:29

a long time right because supervised learning

21:31

has been around forever and so now

21:34

we can we use those insights for

21:36

self-supervised learning. So this to me is

21:38

also a very very strong benefit of

21:40

this thing is that suddenly all the

21:42

theory and like the thousands of papers

21:45

that have been done in supervised learning

21:47

we can just take it and apply

21:49

it in SSL. Another example is a

21:51

neural collapse for example that has been

21:53

proven in supervised setting. Now it applies

21:56

like in five lines in a SSL

21:58

setting as well. So this connection is

22:00

really... beyond just trying to say, okay,

22:02

it's not the objectives that make SSL

22:04

better. It's really tying those two huge

22:06

communities together towards the goal where you

22:09

have a single unified objective to learn

22:11

representation. And this is nice too, because

22:13

if you speak to people, they will

22:15

think, okay, you have super-wise learning on

22:17

one side, SSL on the other side,

22:20

and basically you are either in one

22:22

camp or the other. But now what

22:24

we show is that you actually SSL

22:26

is... pretty much everything in representation learning

22:28

and supervise just one realization of SSL.

22:31

Then V Craig without labels, he knows

22:33

other ones, then this one is another

22:35

one. So you really have a better

22:37

understanding of this relationship and what representation

22:39

learning is trying to do. Galaxy brain

22:41

question incoming. Could you combine SSL and

22:44

supervised objectives in some way to improve

22:46

generalization? Yes, yes. So there is one

22:48

paper which is a supervised contrastive learning.

22:50

So the way they do it is

22:52

that they use the labels within a

22:55

simpler framework to try to basically do

22:57

fully supervise learning, but with a simpler

22:59

objective. So first of all, we can

23:01

show that indeed this makes sense and

23:03

that basically we can explain the empirical

23:06

result that they have. But actually we

23:08

can do a little bit more than

23:10

So if you are in a semi-supervised

23:12

setting, for example, it may not be

23:14

clear how to combine those two losses

23:17

anymore, or maybe you could say, you

23:19

have the two and have a coefficient

23:21

to weight them, but then you need

23:23

to cross-validation and so on. But now

23:25

from this perspective, you can combine them

23:27

in a very principled way, and you

23:30

can understand which weighting makes sense depending

23:32

on... how much sample you have in

23:34

one or the other. And you can

23:36

use all the literature, again, from supervised

23:38

learning, for this setting as well. So

23:41

this is something we can do very

23:43

easily with this formulation as well. Okay,

23:45

so if SSL and Supervis are two

23:47

sides of the same coin, I mean,

23:49

of course, we can use this theoretical

23:52

framework to design new forms of SSL

23:54

framework. Does it, but you know, is

23:56

the distinction relevant if they are the

23:58

same thing? I think it's not just

24:00

two sides of the same coin. SSL

24:02

is more general than supervised learning. So

24:05

it's really, SSL could be the more

24:07

general objective to learn representation. The more

24:09

prior knowledge you have, the more you

24:11

know about your labels and then SSL.

24:13

like slowly becomes supervised learning through the

24:16

labels that you use for the SSL

24:18

objective. But then, because as you said,

24:20

you have this hierarchy, now, it does

24:22

not really make sense to say you

24:24

have either supervised learning or SSL. Rather,

24:27

what makes sense is you say, okay.

24:29

what this relation matrix, what is per

24:31

wise matrix? If you build it from

24:33

labels, it's a supervised learning, if you

24:35

build it from other a prior knowledge,

24:38

for example, two consecutive frames in a

24:40

video, I basically have the same class,

24:42

then you are more in a unsupervised

24:44

SSL setting, but it's all about how

24:46

do you build this per wise relation

24:48

matrix? that's the main question. Very cool,

24:51

right, let's move on to your next

24:53

paper. So no location left behind, measuring

24:55

and improving the fairness of implicit representations

24:57

for Earth data. So there's loads and

24:59

loads of modeling frameworks now that do

25:02

these implicit neural representations of geospatial Earth

25:04

data. So things like, climate modeling, resource

25:06

allocation, environmental modeling. I was actually interviewing

25:08

Johannes from NxAI, yes, I don't know

25:10

if you know him, but he's working

25:13

on similar stuff. Okay. So basically what

25:15

we show is that when you want

25:17

to model, for example, let's say temperature

25:19

or precipitation to make it simple, and

25:21

you want to learn, for example, implicit

25:23

neural representation, it means that you want

25:26

a model so that if you give

25:28

a location and a... date, for example,

25:30

it can predict what was the temperature

25:32

there. So if you have this type

25:34

of implicit neural representation, it's very good,

25:37

because if you learn a nice model,

25:39

then you can actually interpolate those values,

25:41

so maybe estimate what the temperature was

25:43

in this part of the globe, or

25:45

you do not have a sensor. But

25:48

you can also do extra polation as

25:50

well. If you assume you really learn

25:52

the true physical model of the world,

25:54

you could say, okay, what the temperature

25:56

will be two years from now, right?

25:58

for all sorts of applications. The thing

26:01

is that when you do this nowadays,

26:03

depending on the architecture and the different

26:05

design choices that you do, you will

26:07

maybe have a very good prediction on

26:09

average, so when you look at the

26:12

rate performance around the world globe, but

26:14

actually if you look for example around

26:16

islands or coastal area, your prediction is

26:18

going to be very bad almost random.

26:20

So this is something that can be

26:23

very concerning because if you use this

26:25

type of model to decide about a

26:27

policy that will affect a specific island.

26:29

Using this model prediction is as good

26:31

as using random guesses. So it can

26:34

be very detrimental and people need to

26:36

be aware of those biases. So what

26:38

we found is that, for example, for

26:40

this type of climate data, islands are

26:42

often disregarded, coastal area, basically region where

26:44

you have a big gradient in the

26:47

type of data that you try to

26:49

model. How much of a responsibility do

26:51

modelers have to detect these kinds of

26:53

biases in the data? So I think

26:55

there is like two components, as you

26:58

said. So one could be that just

27:00

the dynamic of the data you are

27:02

trying to model is harder near island

27:04

or maybe it's even unpredictable because you

27:06

don't have enough observations to do that.

27:09

So you have some uncertainties that probably

27:11

you can never recover from good design.

27:13

But still what we found here is

27:15

that a lot of the biases now.

27:17

comes from the architecture and all you

27:19

want to do to encode those positions,

27:22

the type of basis you use to

27:24

do the prediction. So right now it

27:26

seems that a big chunk of the

27:28

bias... from the architecture, but I totally

27:30

agree that I don't think we can

27:33

remove the bias entirely because there is

27:35

maybe just different type of uncertainty, a

27:37

different part of the planet as well.

27:39

I mean the world is a very

27:41

very complicated place. I mean realistically to

27:44

what extent can we mathematically model it?

27:46

Yeah so that's a good question. So

27:48

I think it depends the type of

27:50

horizon that you have and the type

27:52

of data that you want to model.

27:55

If you have a system that is

27:57

much more chaotic or can vary very

27:59

quickly without... much changes in the past

28:01

observation. That's something that current models are

28:03

having a very hard time with. If

28:05

you want to predict something else, for

28:08

example... temperature in North America, not near

28:10

the coastal area, so really inland, maybe

28:12

that's where you have less, gradient dynamics,

28:14

things are a bit more stationary, especially

28:16

in through time, so then it can

28:19

become much better. But I think at

28:21

this point, we don't have an architecture

28:23

that is really able to understand that

28:25

you have different physics, different dynamics models

28:27

at different parts of the globe. And

28:30

so because of this, you just... see

28:32

what's the best on average and it

28:34

means you miss out a lot of

28:36

details. Can you tell us about some

28:38

of the technical framework? So one thing

28:40

we showed, for example, at least for

28:43

this type of globe data representation, is

28:45

that people use a four-year basis to

28:47

model the prediction. And this is something

28:49

that is better than not using any

28:51

basis at all. But what it means

28:54

that you imply the type of signal

28:56

you're predicting is very stationary and not

28:58

localized at all. And this is a

29:00

very strong prior, right? So this may

29:02

be true for some things, but for

29:05

other things like precipitation precipitation or... temperature

29:07

where you have localized very high gradients

29:09

then it's a strong bias and if

29:11

you come from signal processing community you

29:13

know very well that to have better

29:16

localization you go from four years to

29:18

wavelets and so that's one thing we

29:20

did in this paper and we show

29:22

that using wavelet bases to encod those

29:24

data allows you to have better localization

29:26

and this removes of the biases. And

29:29

here it's more of proof of concept

29:31

that different design choices give you a

29:33

different type of bias trade-off. Wevelette is

29:35

not the answer to everything, right? But

29:37

I think the next step is to

29:40

really be able to encode less and

29:42

less a priori which basis to use

29:44

and let the model learn from the

29:46

data on its own. And we are

29:48

not yet at this point, at least

29:51

for this type of climate data. How

29:53

could it handle noisy or missing data?

29:55

This depends really on the type of

29:57

model you use. So for example, if

29:59

you have INR, then you will not

30:01

choose the missing data as part of

30:04

your training pipeline and that's one of

30:06

the benefit of them. So if one

30:08

of your sensors stopped recording during some

30:10

years, you just don't choose that as

30:12

part of your training data because you

30:15

really control where do you have the

30:17

data and when you have it, what

30:19

the prediction should be. So these earth

30:21

models, they are now informing policy around

30:23

the world. Who should we hold accountable?

30:26

I mean, is it the technology, is

30:28

it the scientists who design the models,

30:30

is it the policy makers who interpret

30:32

the results? I think it's very hard

30:34

for the person who designs the model

30:37

to know priori what is going to

30:39

be used for. So I think it's

30:41

more downstream. When you know clearly what

30:43

you want to do with it, you

30:45

should first set up a nice evaluation

30:47

pipeline to make sure that it's something

30:50

you can actually use to make those

30:52

decisions. And then you can report any

30:54

type of... your model for people to

30:56

improve on the design, but prior to

30:58

it's very hard to imagine what this

31:01

model will be used for. So in

31:03

an ideal setting you wish that there

31:05

would be no bias at all, but

31:07

in practice the world of possibilities being

31:09

so large it needs to be more

31:12

of a feedback loop and then... iterate

31:14

until you have something that you can

31:16

really trust and then you can act

31:18

on it. Modeling data is very anthropocentric,

31:20

right? So, you know, we focus on

31:22

human populations and so on. Should we

31:25

also focus on, you know, like just

31:27

ecosystems and places that have got nothing

31:29

to do with humans? Oh yeah, that's

31:31

a great question and in fact that's

31:33

one of the big issue with... a

31:36

lot of the data set which is

31:38

a crowd source set because by definition

31:40

the amount of data that you get

31:42

is proportional to the number of users

31:44

you have depending on the location. And

31:47

this means you have a huge bias

31:49

in what your model is learning and

31:51

what your model is focusing on, which

31:53

means you miss out on a lot

31:55

of things. So I think that's also

31:58

one thing that, okay, crowdsourcing can give

32:00

you a lot of data quickly, but

32:02

it's very biased data. So then the

32:04

question is, how much of this bias

32:06

data versus maybe paying a lot more

32:08

and capturing other part of the two

32:11

you should have? And maybe you could

32:13

be able to show that under some

32:15

specific condition, just having 10% of the

32:17

data, which is high quality, uniformly sample,

32:19

and then 90% which is crowd sources,

32:22

you can try to use those 10%

32:24

to incur your representation and then use

32:26

all that data together. But there is

32:28

a huge amount of research question in

32:30

that, because that's a very big source

32:33

of bias. And there's a bit of

32:35

a policy question, but we are using

32:37

these things, you know, to do resource

32:39

allocation, right? giving more resources to some

32:41

populations might be taking it away from

32:43

others and then there's the fairness over

32:46

time thing as well which is that

32:48

what is fair like now might not

32:50

be fair in a hundred years time

32:52

so how should we think about it?

32:54

Yeah that's a good question I think

32:57

this is also very application. If you

32:59

want to predict where to build a

33:01

house to solve some specific problem, maybe

33:03

you don't really mind having bad prediction

33:05

where there is no population anyway because

33:08

you are not going to build a

33:10

house there. So in this case, maybe

33:12

the crowdsourcing type of data is actually

33:14

good, but this could really be dependent

33:16

on the type of application. And just

33:19

one thing I would say regarding the

33:21

point you made before, this type of

33:23

bias actually is something that you have

33:25

in computer vision. So there is like

33:27

a very nice... paper done by a

33:29

Marquis brain. Basically, they showed that most

33:32

of the data we have, like from

33:34

ImageNet, is from North America. And so

33:36

maybe you reach like 90% state of

33:38

the art performance to predict, for example,

33:40

type of chairs. but only

33:43

for North American models.

33:45

And when you

33:47

start looking at type

33:49

of cars or

33:51

chairs cars or Africa or

33:54

East Asia, the model

33:56

model performance is extremely

33:58

bad. So type of

34:00

problem is something you

34:02

have across across modalities

34:04

that's something that's very big,

34:07

big issue. it's it's

34:09

always a pleasure

34:11

and an honor to

34:13

have you on

34:15

the show. the you

34:18

so much. you so much.

34:20

Thank you so much.

34:22

thank you so much.

Rate

Get this podcast via API

From The Podcast

Machine Learning Street Talk (MLST)

Welcome! We engage in fascinating discussions with pre-eminent figures in the AI field. Our flagship show covers current affairs in AI, cognitive science, neuroscience and philosophy of mind with in-depth analysis. Our approach is unrivalled in terms of scope and rigour – we believe in intellectual diversity in AI, and we touch on all of the main ideas in the field with the hype surgically removed. MLST is run by Tim Scarfe, Ph.D (https://www.linkedin.com/in/ecsquizor/) and features regular appearances from MIT Doctor of Philosophy Keith Duggar (https://www.linkedin.com/in/dr-keith-duggar/).

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More