Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:01
Acast powers the world's best
0:04
podcast. Here's the show that
0:06
we recommend. Hey
0:09
everybody, I'm Jen. I'm Jess and we're fat
0:12
mascara the only beauty podcast you need in
0:14
your life We're beauty editors by day and
0:16
podcasters by night and we've got all the
0:18
industry gossip for you like insider product reviews
0:21
and advice You're not going to find online
0:23
and the best part is interviews with the
0:25
most sought-after experts in the beauty biz Charlotte
0:28
Tilbury Jen Atkins and makeup artist Sir
0:30
John That's just a taste of what you're
0:32
going to get on the fat mascaraara
0:34
podcast. So come hang with us new episodes
0:36
drop every Tuesday and Thursday and Thursday and
0:38
Thursday ACAST
0:41
helps creators launch, grow
0:44
and monetize their podcast
0:47
everywhere. acast.com This
0:49
is an audio long
0:51
read from Nature. In this
0:53
episode, the AI Revolution
0:55
is running out of
0:57
data. What can researchers
0:59
do? Written by Nicola
1:01
Jones and read by
1:03
me Benjamin Thompson.
1:08
The internet is a vast
1:10
ocean of human knowledge, but it
1:13
isn't infinite. And artificial
1:15
intelligence researchers have nearly
1:17
sucked it dry. The past
1:19
decade of explosive improvement
1:22
in AI has been driven
1:24
in large parts by making
1:26
neural networks bigger and training
1:28
them on ever more data.
1:30
This scaling has proved surprisingly
1:32
effective at making large language
1:35
models or LLLMs. such as
1:37
those that power the chat
1:40
spot chat-GPT, both more capable
1:42
of replicating conversational language and
1:45
of developing emergent properties such
1:47
as reasoning. But some
1:49
specialists say that we are now
1:52
approaching the limits of scaling. That's
1:54
in part because of the
1:56
ballooning energy requirements for
1:58
computing. But it's also... also
2:00
because LLM developers are running
2:02
out of the conventional data
2:04
sets used to train their
2:07
models. A prominent study made
2:09
headlines last year by putting
2:11
a number on this problem.
2:13
Researchers at EPOC AI, a
2:15
virtual research institute, projected that
2:17
by around 2028, the typical
2:19
size of a data set
2:22
used to train an AI
2:24
model will reach the same
2:26
size as the total estimated
2:28
stock of public online text.
2:30
In other words, AI is
2:32
likely to run out of
2:34
training data in about three
2:37
years' time. At the same
2:39
time, data owners, such as
2:41
newspaper publishers, are starting to
2:43
crack down on how their
2:45
content can be used, tightening
2:47
access even more. That's causing
2:49
a crisis in the size
2:52
of the data commons, says
2:54
Shane Longprey, an AI researcher
2:56
at the Massachusetts Institute of
2:58
Technology in Cambridge who leads
3:00
the Data Providence Initiative, a
3:02
grassroots organisation that conducts audits
3:04
of AI data sets. The
3:07
imminent bottleneck in training data
3:09
could be starting to pinch.
3:11
I strongly suspect that's already
3:13
happening, says Longprey. Although
3:15
specialists say there's a chance
3:17
that these restrictions might slow
3:20
down the rapid improvement in
3:22
AI systems, developers are finding
3:24
workarounds. I don't think anyone
3:26
is panicking at the large
3:28
AI companies, says Pablo Villalobos,
3:31
a Madrid-based researcher at Epoch
3:33
AI and lead author of
3:35
the study forecasting a 2028
3:37
data crash, or at least
3:39
they don't email me if
3:42
they are, he says. For
3:44
example, prominent AI companies such
3:46
as Open AI and Anthropic,
3:48
both in San Francisco, California,
3:50
have publicly acknowledged the issue
3:53
while suggesting that they have
3:55
plans to work around it,
3:57
including generating new data and
3:59
finding unconventional data sources. for
4:01
open AI told nature, quote,
4:04
we use numerous sources, including
4:06
publicly available data and partnerships
4:08
for non-public data, synthetic data
4:10
generation and data from AI
4:12
trainers, end quote. Even so,
4:15
the data crunch might force
4:17
an upheaval in the types
4:19
of generative AI model that
4:21
people build, possibly shifting the
4:23
landscape away from big all-purpose
4:26
LLLMs to smaller, more specialized
4:28
models. LLLM development over
4:30
the past decade has shown
4:33
its voracious appetite for data.
4:35
Although some developers don't publish
4:37
the specifications of their latest
4:40
models, Violobos estimates that the
4:42
number of tokens, or parts
4:44
of words, used to train
4:47
LLLM's, has risen 100 fold
4:49
since 2020, from hundreds of
4:51
billions to tens of trillions.
4:54
That could be a good
4:56
chunk of what's on the
4:59
internet, although the grand total
5:01
is so vast that it's
5:03
hard to pin down. Via
5:05
Lobos estimates the total internet
5:07
stock of text data today
5:10
at 3,100 trillion tokens. Various
5:12
services use web crawlers to
5:14
scrape this content, then eliminate
5:16
duplications and filter out undesirable
5:18
content, such as pornography, to
5:20
produce cleaner data sets. A
5:23
common one, called red pajama,
5:25
contains tens of trillions of
5:27
words. Some companies or academics
5:29
do the crawling and cleaning
5:31
themselves to make bespoke data
5:33
sets to train LLLMs. A
5:36
small proportion of the internet
5:38
is considered to be of
5:40
high quality, such as human-edited
5:42
socially acceptable text that might
5:44
be found in books or
5:47
journalism. The rate
5:49
at which usable internet content
5:51
is increasing is surprisingly slow.
5:53
Via Lobos' paper estimates that
5:55
it is growing at less
5:57
than 10% per year, while
5:59
the size of... AI training
6:01
data sets is more than
6:03
doubling annually. Projecting these trends
6:06
shows the lines converging around
6:08
2028. At the same time,
6:10
content providers are increasingly including
6:12
software code or refining their
6:14
terms of use to block
6:16
web crawlers or AI companies
6:18
from scraping their data for
6:20
training. Long prey and his
6:22
colleagues released a pre-print last
6:25
July showing a sharp increase
6:27
in how many data providers
6:29
block specific crawlers from accessing
6:31
their websites. In the highest
6:33
quality, most often used web
6:35
content across three main cleaned
6:37
data sets, the number of
6:39
tokens restricted from crawlers rose
6:41
from less than 3% in
6:44
20 to 23% in 20
6:46
to 33% in 2024. Several
6:48
lawsuits are now underway, attempting
6:50
to win compensation for the
6:52
providers of data being used
6:54
in AI training. In December
6:56
2023, the New York Times
6:58
sued Open AI and its
7:00
partner Microsoft for copyright infringement.
7:03
In April last year, eight
7:05
newspapers owned by Alden Global
7:07
Capital in New York City
7:09
jointly filed a similar lawsuit.
7:11
The counter-argument is that an
7:13
AI should be allowed to
7:15
read and learn from online
7:17
content in the same way
7:19
as a person, and that
7:21
this constitutes fair use of
7:24
the material. Open AI has
7:26
said publicly that it thinks
7:28
the New York Times lawsuit
7:30
is, quote, without merit, end
7:32
quote. If courts uphold the
7:34
idea that content providers deserve
7:36
financial compensation, it will make
7:38
it harder for both AI
7:40
developers and researchers to get
7:43
what they need. including academics
7:45
who don't have deep pockets.
7:47
Academics will be most hit
7:49
by these deals, says Long
7:51
prey. There are many very
7:53
pro-social, pro-democratic benefits of having
7:55
an open web, he adds.
7:57
The data crunch poses a
7:59
potentially big problem for the
8:02
conventional strategy of AI scaling.
8:04
Although it's possible to scale
8:06
up a model's computing power
8:08
or number of parameters without
8:10
scaling up the training data,
8:12
that tends to make for
8:14
slow and expensive AI, says
8:16
Long prey, something that isn't
8:18
usually preferred. If the goal
8:21
is to find more data...
8:23
One option might be to
8:25
harvest non-public data, such as
8:27
WhatsApp messages or transcripts of
8:29
YouTube videos. Although the legality
8:31
of scraping third-party content in
8:33
this manner is untested, companies
8:35
do have access to their
8:37
own data, and several social
8:39
media firms say they use
8:42
their own material to train
8:44
their AI models. For example,
8:46
Meta in Menlo Park California
8:48
says that audio and images
8:50
collected by its virtual reality
8:52
headset, Meta Quest, are used
8:54
to train its AI. Yet
8:56
policies vary. The terms of
8:58
service for the video conferencing
9:01
platform Zoom, say the firm
9:03
will not use customer content
9:05
to train AI systems. Whereas
9:07
Otter AI, a transcription service,
9:09
says it does use de-identified
9:11
and encrypted audio and transcripts
9:13
for training. For now, however,
9:15
such proprietary content probably holds
9:17
only another quadrillion text tokens
9:20
in total, estimate to be
9:22
a low boss. Considering that
9:24
a lot of this is
9:26
low quality or duplicated content,
9:28
he says this is enough
9:30
to delay the data bottleneck
9:32
by a year and a
9:34
half, even assuming that a
9:36
single AI gets access to
9:38
all of it without causing
9:41
copyright infringement or privacy concerns.
9:43
Even a ten times increase
9:45
in the stock of data
9:47
only buys you around three
9:49
years of scaling, he says.
9:51
Another option might be to
9:53
focus on specialized data sets,
9:55
such as astronomical or genomic
9:57
data, which are growing rapidly.
10:00
The prominent AI researcher at
10:02
Stanford University in California has
10:04
publicly backed this strategy. She
10:06
said at a Bloomberg Technology
10:08
Summit last May that worries
10:10
about data running out take
10:12
too narrow a view of
10:14
what constitutes data, given the
10:16
untapped information available across fields
10:19
such as healthcare, the environment
10:21
and education. But it's
10:23
unclear, says Violobos, how available
10:25
or useful such data sets
10:27
would be for training NLMs.
10:29
There seems to be some
10:32
degree of transfer learning between
10:34
many types of data, says
10:36
Violobos. That said, I'm not
10:38
very hopeful about that approach.
10:40
The possibilities are broader if
10:42
generative AI is trained on
10:45
other data types, not just
10:47
text. Some models are already
10:49
capable of training, to some
10:51
extent, on unlabeled video or
10:53
images. Expanding and improving such
10:56
capabilities could open a floodgate
10:58
to richer data. Jan Lekun,
11:00
chief AI scientist at Meta,
11:02
and a computer scientist at
11:04
New York University, who was
11:06
considered one of the founders
11:09
of modern AI, highlighted these
11:11
possibilities in a presentation last
11:13
February, as an AI meeting
11:15
in Vancouver, Canada. The 10-to-the-power-13
11:17
tokens used to train a
11:19
modern LLLM sounds like a
11:22
lot. It would take a
11:24
person 170,000 years to read
11:26
that much, Lacon calculates. But,
11:28
he says, a four-year-old child
11:30
has absorbed a data volume
11:32
50 times greater than this
11:35
just by looking at objects
11:37
during his or her waking
11:39
hours. Lacon presented this data
11:41
at the annual meeting of
11:43
the Association for the Advancement
11:45
of Artificial Intelligence. Similar
11:48
data richness might eventually be
11:50
harnessed by having AI systems
11:52
in robotic form that learn
11:54
from their own sensory experiences.
11:57
We're never going to get
11:59
to human level AI by...
12:01
training on language. That's just
12:03
not happening," Lecun said. If
12:06
data can't be found, more
12:08
could be made. Some AI
12:10
companies pay people to generate
12:12
content for AI training. Others
12:15
use synthetic data generated by
12:17
AI for AI. This is
12:19
a potentially massive source. Earlier
12:21
this year, Open AI said
12:24
it generates 100 billion words
12:26
per day. That's more than
12:28
36 trillion words a year,
12:30
which is about the same
12:33
size as current AI training
12:35
data sets. And this output
12:37
is growing rapidly. In general
12:39
specialists agree synthetic data seems
12:42
to work well for regimes
12:44
in which there are firm
12:46
identifiable rules, such as chess,
12:48
mathematics or computer coding. One
12:51
AI tool... Alpha Geometry was
12:53
successfully trained to solve geometry
12:55
problems using 100 million synthetic
12:57
examples and no human demonstrations.
13:00
Synthetic data are already being
13:02
used in areas for which
13:04
real data are limited or
13:06
problematic. This includes medical data
13:08
because synthetic data are free
13:11
of privacy concerns and training
13:13
grounds for self-driving cars because
13:15
synthetic car crashes don't harm
13:17
anyone. The problem with synthetic
13:20
data is that recursive loops
13:22
might entrench falsehoods, magnify misconceptions,
13:24
and generally degrade the quality
13:26
of learning. A 2023 study
13:29
coined the phrase model autophagee
13:31
disorder, or MAD, to describe
13:33
how an AI model might,
13:35
quote, go mad, end quote,
13:38
in this way. A face
13:40
generating AI model trained in
13:42
part on synthetic data, for
13:44
example, started to draw faces
13:47
embedded with strange hash markings.
13:49
The alternative strategy is to
13:51
abandon the... bigger is better
13:53
concept. Although developers continue to
13:56
build larger models and lean
13:58
into scaling to improve their
14:00
LLLMs, many are pursuing more
14:02
efficient small models that focus
14:05
on individual tasks. These require
14:07
refined specialized data and better
14:09
training techniques. In general AI
14:11
efforts are already doing more
14:14
with less. One 2020-24 study
14:16
concluded that because of improvements
14:18
in algorithms, the computing power
14:20
needed for an LLM to
14:23
achieve the same performance has
14:25
halved every eight months or
14:27
so. That, along with computer
14:29
chips specialised for AI and
14:31
other hardware improvements, opens the
14:34
door to using computing resources
14:36
differently. One strategy is to
14:38
make an AI model reread
14:40
its training data set multiple
14:43
times. Although many people assume
14:45
that a computer has perfect
14:47
recall and only needs to
14:49
read material once, AI systems
14:52
work in a statistical fashion
14:54
that means re-reading boosts performance,
14:56
says Nicholas Mooninkoff, a PhD
14:58
student at Stanford University, and
15:01
a member of the Data
15:03
Providence Initiative. In a 2023
15:05
paper published while he was
15:07
at the AI firm Hugging
15:10
Face in New York City,
15:12
he and his colleagues showed
15:14
that a model learnt just
15:16
as much from re-reading a
15:19
given data set four times
15:21
as by reading the same
15:23
amount of unique data, although
15:25
the benefits of re-reading dropped
15:28
off quickly after that. Although
15:30
Open AI hasn't disclosed information
15:32
about the size of its
15:34
model or training data set
15:37
for its LLM01, the firm
15:39
has emphasised that this model
15:41
leans into a new approach,
15:43
spending more time on reinforcement
15:46
learning, the process by which
15:48
the model gets feedback on
15:50
its best answers, and more
15:52
time thinking about each response.
15:54
Observers say this model shifts
15:57
the emphasis. is away from
15:59
pre-training with massive data sets
16:01
and relies more on training
16:03
and inference. This adds a
16:06
new dimension to scaling approaches,
16:08
says Long prey, although it's
16:10
a computationally expensive strategy. It's
16:12
possible that LLLM's, having read
16:15
most of the internet, no
16:17
longer need more data to
16:19
get smarter. Andy Zoh, a
16:21
graduate student at Carnegie Mellon
16:24
University in Pittsburgh, Pennsylvania, who
16:26
studies AI security, says that
16:28
advances might soon come through
16:30
self-reflection by an AI. Now
16:33
it's got a foundational knowledge
16:35
base that's probably greater than
16:37
any single person could have,
16:39
says so, meaning it just
16:42
needs to sit and think.
16:44
I think we're probably pretty
16:46
close to that point, Zoh
16:48
says. Via Lobos
16:50
thinks that all of these
16:52
factors, from synthetic data to
16:54
specialized data sets, to re-reading
16:56
and self-reflection, will help. The
16:58
combination of models, being able
17:00
to think by themselves and
17:02
being able to interact with
17:04
the real world in various
17:06
ways, that's probably going to
17:08
be pushing the frontier, he
17:10
says. To
17:15
read more
17:19
of Nature's
17:22
long form
17:25
journalism, head
17:28
over to
17:31
nature.com/news. Hey
17:34
everybody, I'm Jen. I'm Jess and we're
17:37
fat mascara the only beauty podcast you
17:39
need in your life We're beauty editors
17:41
by day and podcasters by night and
17:43
we've got all the industry gossip for
17:45
you like insider product reviews and advice
17:47
You're not going to find online and
17:50
the best part is interviews with the
17:52
most sought-after experts in the beauty biz
17:54
Charlotte Tylberry Jen Atkins and makeup artist
17:56
Sir John That's just a taste of
17:58
what you're going to get on the
18:00
fat mascara So come hang
18:03
with us. us. New
18:05
episodes drop every Tuesday
18:07
and Thursday. Tuesday and Thursday.
18:09
ACAST helps creators launch, grow, and monetize
18:11
their podcast everywhere. acast.com
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More