The AI revolution is running out of data. What can researchers do?

The AI revolution is running out of data. What can researchers do?

Released Friday, 31st January 2025
Good episode? Give it some love!
The AI revolution is running out of data. What can researchers do?

The AI revolution is running out of data. What can researchers do?

The AI revolution is running out of data. What can researchers do?

The AI revolution is running out of data. What can researchers do?

Friday, 31st January 2025
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:01

Acast powers the world's best

0:04

podcast. Here's the show that

0:06

we recommend. Hey

0:09

everybody, I'm Jen. I'm Jess and we're fat

0:12

mascara the only beauty podcast you need in

0:14

your life We're beauty editors by day and

0:16

podcasters by night and we've got all the

0:18

industry gossip for you like insider product reviews

0:21

and advice You're not going to find online

0:23

and the best part is interviews with the

0:25

most sought-after experts in the beauty biz Charlotte

0:28

Tilbury Jen Atkins and makeup artist Sir

0:30

John That's just a taste of what you're

0:32

going to get on the fat mascaraara

0:34

podcast. So come hang with us new episodes

0:36

drop every Tuesday and Thursday and Thursday and

0:38

Thursday ACAST

0:41

helps creators launch, grow

0:44

and monetize their podcast

0:47

everywhere. acast.com This

0:49

is an audio long

0:51

read from Nature. In this

0:53

episode, the AI Revolution

0:55

is running out of

0:57

data. What can researchers

0:59

do? Written by Nicola

1:01

Jones and read by

1:03

me Benjamin Thompson.

1:08

The internet is a vast

1:10

ocean of human knowledge, but it

1:13

isn't infinite. And artificial

1:15

intelligence researchers have nearly

1:17

sucked it dry. The past

1:19

decade of explosive improvement

1:22

in AI has been driven

1:24

in large parts by making

1:26

neural networks bigger and training

1:28

them on ever more data.

1:30

This scaling has proved surprisingly

1:32

effective at making large language

1:35

models or LLLMs. such as

1:37

those that power the chat

1:40

spot chat-GPT, both more capable

1:42

of replicating conversational language and

1:45

of developing emergent properties such

1:47

as reasoning. But some

1:49

specialists say that we are now

1:52

approaching the limits of scaling. That's

1:54

in part because of the

1:56

ballooning energy requirements for

1:58

computing. But it's also... also

2:00

because LLM developers are running

2:02

out of the conventional data

2:04

sets used to train their

2:07

models. A prominent study made

2:09

headlines last year by putting

2:11

a number on this problem.

2:13

Researchers at EPOC AI, a

2:15

virtual research institute, projected that

2:17

by around 2028, the typical

2:19

size of a data set

2:22

used to train an AI

2:24

model will reach the same

2:26

size as the total estimated

2:28

stock of public online text.

2:30

In other words, AI is

2:32

likely to run out of

2:34

training data in about three

2:37

years' time. At the same

2:39

time, data owners, such as

2:41

newspaper publishers, are starting to

2:43

crack down on how their

2:45

content can be used, tightening

2:47

access even more. That's causing

2:49

a crisis in the size

2:52

of the data commons, says

2:54

Shane Longprey, an AI researcher

2:56

at the Massachusetts Institute of

2:58

Technology in Cambridge who leads

3:00

the Data Providence Initiative, a

3:02

grassroots organisation that conducts audits

3:04

of AI data sets. The

3:07

imminent bottleneck in training data

3:09

could be starting to pinch.

3:11

I strongly suspect that's already

3:13

happening, says Longprey. Although

3:15

specialists say there's a chance

3:17

that these restrictions might slow

3:20

down the rapid improvement in

3:22

AI systems, developers are finding

3:24

workarounds. I don't think anyone

3:26

is panicking at the large

3:28

AI companies, says Pablo Villalobos,

3:31

a Madrid-based researcher at Epoch

3:33

AI and lead author of

3:35

the study forecasting a 2028

3:37

data crash, or at least

3:39

they don't email me if

3:42

they are, he says. For

3:44

example, prominent AI companies such

3:46

as Open AI and Anthropic,

3:48

both in San Francisco, California,

3:50

have publicly acknowledged the issue

3:53

while suggesting that they have

3:55

plans to work around it,

3:57

including generating new data and

3:59

finding unconventional data sources. for

4:01

open AI told nature, quote,

4:04

we use numerous sources, including

4:06

publicly available data and partnerships

4:08

for non-public data, synthetic data

4:10

generation and data from AI

4:12

trainers, end quote. Even so,

4:15

the data crunch might force

4:17

an upheaval in the types

4:19

of generative AI model that

4:21

people build, possibly shifting the

4:23

landscape away from big all-purpose

4:26

LLLMs to smaller, more specialized

4:28

models. LLLM development over

4:30

the past decade has shown

4:33

its voracious appetite for data.

4:35

Although some developers don't publish

4:37

the specifications of their latest

4:40

models, Violobos estimates that the

4:42

number of tokens, or parts

4:44

of words, used to train

4:47

LLLM's, has risen 100 fold

4:49

since 2020, from hundreds of

4:51

billions to tens of trillions.

4:54

That could be a good

4:56

chunk of what's on the

4:59

internet, although the grand total

5:01

is so vast that it's

5:03

hard to pin down. Via

5:05

Lobos estimates the total internet

5:07

stock of text data today

5:10

at 3,100 trillion tokens. Various

5:12

services use web crawlers to

5:14

scrape this content, then eliminate

5:16

duplications and filter out undesirable

5:18

content, such as pornography, to

5:20

produce cleaner data sets. A

5:23

common one, called red pajama,

5:25

contains tens of trillions of

5:27

words. Some companies or academics

5:29

do the crawling and cleaning

5:31

themselves to make bespoke data

5:33

sets to train LLLMs. A

5:36

small proportion of the internet

5:38

is considered to be of

5:40

high quality, such as human-edited

5:42

socially acceptable text that might

5:44

be found in books or

5:47

journalism. The rate

5:49

at which usable internet content

5:51

is increasing is surprisingly slow.

5:53

Via Lobos' paper estimates that

5:55

it is growing at less

5:57

than 10% per year, while

5:59

the size of... AI training

6:01

data sets is more than

6:03

doubling annually. Projecting these trends

6:06

shows the lines converging around

6:08

2028. At the same time,

6:10

content providers are increasingly including

6:12

software code or refining their

6:14

terms of use to block

6:16

web crawlers or AI companies

6:18

from scraping their data for

6:20

training. Long prey and his

6:22

colleagues released a pre-print last

6:25

July showing a sharp increase

6:27

in how many data providers

6:29

block specific crawlers from accessing

6:31

their websites. In the highest

6:33

quality, most often used web

6:35

content across three main cleaned

6:37

data sets, the number of

6:39

tokens restricted from crawlers rose

6:41

from less than 3% in

6:44

20 to 23% in 20

6:46

to 33% in 2024. Several

6:48

lawsuits are now underway, attempting

6:50

to win compensation for the

6:52

providers of data being used

6:54

in AI training. In December

6:56

2023, the New York Times

6:58

sued Open AI and its

7:00

partner Microsoft for copyright infringement.

7:03

In April last year, eight

7:05

newspapers owned by Alden Global

7:07

Capital in New York City

7:09

jointly filed a similar lawsuit.

7:11

The counter-argument is that an

7:13

AI should be allowed to

7:15

read and learn from online

7:17

content in the same way

7:19

as a person, and that

7:21

this constitutes fair use of

7:24

the material. Open AI has

7:26

said publicly that it thinks

7:28

the New York Times lawsuit

7:30

is, quote, without merit, end

7:32

quote. If courts uphold the

7:34

idea that content providers deserve

7:36

financial compensation, it will make

7:38

it harder for both AI

7:40

developers and researchers to get

7:43

what they need. including academics

7:45

who don't have deep pockets.

7:47

Academics will be most hit

7:49

by these deals, says Long

7:51

prey. There are many very

7:53

pro-social, pro-democratic benefits of having

7:55

an open web, he adds.

7:57

The data crunch poses a

7:59

potentially big problem for the

8:02

conventional strategy of AI scaling.

8:04

Although it's possible to scale

8:06

up a model's computing power

8:08

or number of parameters without

8:10

scaling up the training data,

8:12

that tends to make for

8:14

slow and expensive AI, says

8:16

Long prey, something that isn't

8:18

usually preferred. If the goal

8:21

is to find more data...

8:23

One option might be to

8:25

harvest non-public data, such as

8:27

WhatsApp messages or transcripts of

8:29

YouTube videos. Although the legality

8:31

of scraping third-party content in

8:33

this manner is untested, companies

8:35

do have access to their

8:37

own data, and several social

8:39

media firms say they use

8:42

their own material to train

8:44

their AI models. For example,

8:46

Meta in Menlo Park California

8:48

says that audio and images

8:50

collected by its virtual reality

8:52

headset, Meta Quest, are used

8:54

to train its AI. Yet

8:56

policies vary. The terms of

8:58

service for the video conferencing

9:01

platform Zoom, say the firm

9:03

will not use customer content

9:05

to train AI systems. Whereas

9:07

Otter AI, a transcription service,

9:09

says it does use de-identified

9:11

and encrypted audio and transcripts

9:13

for training. For now, however,

9:15

such proprietary content probably holds

9:17

only another quadrillion text tokens

9:20

in total, estimate to be

9:22

a low boss. Considering that

9:24

a lot of this is

9:26

low quality or duplicated content,

9:28

he says this is enough

9:30

to delay the data bottleneck

9:32

by a year and a

9:34

half, even assuming that a

9:36

single AI gets access to

9:38

all of it without causing

9:41

copyright infringement or privacy concerns.

9:43

Even a ten times increase

9:45

in the stock of data

9:47

only buys you around three

9:49

years of scaling, he says.

9:51

Another option might be to

9:53

focus on specialized data sets,

9:55

such as astronomical or genomic

9:57

data, which are growing rapidly.

10:00

The prominent AI researcher at

10:02

Stanford University in California has

10:04

publicly backed this strategy. She

10:06

said at a Bloomberg Technology

10:08

Summit last May that worries

10:10

about data running out take

10:12

too narrow a view of

10:14

what constitutes data, given the

10:16

untapped information available across fields

10:19

such as healthcare, the environment

10:21

and education. But it's

10:23

unclear, says Violobos, how available

10:25

or useful such data sets

10:27

would be for training NLMs.

10:29

There seems to be some

10:32

degree of transfer learning between

10:34

many types of data, says

10:36

Violobos. That said, I'm not

10:38

very hopeful about that approach.

10:40

The possibilities are broader if

10:42

generative AI is trained on

10:45

other data types, not just

10:47

text. Some models are already

10:49

capable of training, to some

10:51

extent, on unlabeled video or

10:53

images. Expanding and improving such

10:56

capabilities could open a floodgate

10:58

to richer data. Jan Lekun,

11:00

chief AI scientist at Meta,

11:02

and a computer scientist at

11:04

New York University, who was

11:06

considered one of the founders

11:09

of modern AI, highlighted these

11:11

possibilities in a presentation last

11:13

February, as an AI meeting

11:15

in Vancouver, Canada. The 10-to-the-power-13

11:17

tokens used to train a

11:19

modern LLLM sounds like a

11:22

lot. It would take a

11:24

person 170,000 years to read

11:26

that much, Lacon calculates. But,

11:28

he says, a four-year-old child

11:30

has absorbed a data volume

11:32

50 times greater than this

11:35

just by looking at objects

11:37

during his or her waking

11:39

hours. Lacon presented this data

11:41

at the annual meeting of

11:43

the Association for the Advancement

11:45

of Artificial Intelligence. Similar

11:48

data richness might eventually be

11:50

harnessed by having AI systems

11:52

in robotic form that learn

11:54

from their own sensory experiences.

11:57

We're never going to get

11:59

to human level AI by...

12:01

training on language. That's just

12:03

not happening," Lecun said. If

12:06

data can't be found, more

12:08

could be made. Some AI

12:10

companies pay people to generate

12:12

content for AI training. Others

12:15

use synthetic data generated by

12:17

AI for AI. This is

12:19

a potentially massive source. Earlier

12:21

this year, Open AI said

12:24

it generates 100 billion words

12:26

per day. That's more than

12:28

36 trillion words a year,

12:30

which is about the same

12:33

size as current AI training

12:35

data sets. And this output

12:37

is growing rapidly. In general

12:39

specialists agree synthetic data seems

12:42

to work well for regimes

12:44

in which there are firm

12:46

identifiable rules, such as chess,

12:48

mathematics or computer coding. One

12:51

AI tool... Alpha Geometry was

12:53

successfully trained to solve geometry

12:55

problems using 100 million synthetic

12:57

examples and no human demonstrations.

13:00

Synthetic data are already being

13:02

used in areas for which

13:04

real data are limited or

13:06

problematic. This includes medical data

13:08

because synthetic data are free

13:11

of privacy concerns and training

13:13

grounds for self-driving cars because

13:15

synthetic car crashes don't harm

13:17

anyone. The problem with synthetic

13:20

data is that recursive loops

13:22

might entrench falsehoods, magnify misconceptions,

13:24

and generally degrade the quality

13:26

of learning. A 2023 study

13:29

coined the phrase model autophagee

13:31

disorder, or MAD, to describe

13:33

how an AI model might,

13:35

quote, go mad, end quote,

13:38

in this way. A face

13:40

generating AI model trained in

13:42

part on synthetic data, for

13:44

example, started to draw faces

13:47

embedded with strange hash markings.

13:49

The alternative strategy is to

13:51

abandon the... bigger is better

13:53

concept. Although developers continue to

13:56

build larger models and lean

13:58

into scaling to improve their

14:00

LLLMs, many are pursuing more

14:02

efficient small models that focus

14:05

on individual tasks. These require

14:07

refined specialized data and better

14:09

training techniques. In general AI

14:11

efforts are already doing more

14:14

with less. One 2020-24 study

14:16

concluded that because of improvements

14:18

in algorithms, the computing power

14:20

needed for an LLM to

14:23

achieve the same performance has

14:25

halved every eight months or

14:27

so. That, along with computer

14:29

chips specialised for AI and

14:31

other hardware improvements, opens the

14:34

door to using computing resources

14:36

differently. One strategy is to

14:38

make an AI model reread

14:40

its training data set multiple

14:43

times. Although many people assume

14:45

that a computer has perfect

14:47

recall and only needs to

14:49

read material once, AI systems

14:52

work in a statistical fashion

14:54

that means re-reading boosts performance,

14:56

says Nicholas Mooninkoff, a PhD

14:58

student at Stanford University, and

15:01

a member of the Data

15:03

Providence Initiative. In a 2023

15:05

paper published while he was

15:07

at the AI firm Hugging

15:10

Face in New York City,

15:12

he and his colleagues showed

15:14

that a model learnt just

15:16

as much from re-reading a

15:19

given data set four times

15:21

as by reading the same

15:23

amount of unique data, although

15:25

the benefits of re-reading dropped

15:28

off quickly after that. Although

15:30

Open AI hasn't disclosed information

15:32

about the size of its

15:34

model or training data set

15:37

for its LLM01, the firm

15:39

has emphasised that this model

15:41

leans into a new approach,

15:43

spending more time on reinforcement

15:46

learning, the process by which

15:48

the model gets feedback on

15:50

its best answers, and more

15:52

time thinking about each response.

15:54

Observers say this model shifts

15:57

the emphasis. is away from

15:59

pre-training with massive data sets

16:01

and relies more on training

16:03

and inference. This adds a

16:06

new dimension to scaling approaches,

16:08

says Long prey, although it's

16:10

a computationally expensive strategy. It's

16:12

possible that LLLM's, having read

16:15

most of the internet, no

16:17

longer need more data to

16:19

get smarter. Andy Zoh, a

16:21

graduate student at Carnegie Mellon

16:24

University in Pittsburgh, Pennsylvania, who

16:26

studies AI security, says that

16:28

advances might soon come through

16:30

self-reflection by an AI. Now

16:33

it's got a foundational knowledge

16:35

base that's probably greater than

16:37

any single person could have,

16:39

says so, meaning it just

16:42

needs to sit and think.

16:44

I think we're probably pretty

16:46

close to that point, Zoh

16:48

says. Via Lobos

16:50

thinks that all of these

16:52

factors, from synthetic data to

16:54

specialized data sets, to re-reading

16:56

and self-reflection, will help. The

16:58

combination of models, being able

17:00

to think by themselves and

17:02

being able to interact with

17:04

the real world in various

17:06

ways, that's probably going to

17:08

be pushing the frontier, he

17:10

says. To

17:15

read more

17:19

of Nature's

17:22

long form

17:25

journalism, head

17:28

over to

17:31

nature.com/news. Hey

17:34

everybody, I'm Jen. I'm Jess and we're

17:37

fat mascara the only beauty podcast you

17:39

need in your life We're beauty editors

17:41

by day and podcasters by night and

17:43

we've got all the industry gossip for

17:45

you like insider product reviews and advice

17:47

You're not going to find online and

17:50

the best part is interviews with the

17:52

most sought-after experts in the beauty biz

17:54

Charlotte Tylberry Jen Atkins and makeup artist

17:56

Sir John That's just a taste of

17:58

what you're going to get on the

18:00

fat mascara So come hang

18:03

with us. us. New

18:05

episodes drop every Tuesday

18:07

and Thursday. Tuesday and Thursday.

18:09

ACAST helps creators launch, grow, and monetize

18:11

their podcast everywhere. acast.com

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features