AI Industry Faces Data Limits For Large Language Model Training

Large language models have been trained on roughly 20% of the world's publicly available data according to projections analyzed by Forbes. New high-quality data is not being generated quickly enough to sustain current growth rates while restrictions on web data use have increased.

1 source·May 14, 6:45 AM(22 days ago)·2m read

AI Industry Faces Data Limits For Large Language Model Training

Audio version

Tap play to generate a narrated version.

Large language models have consumed roughly 20% of the world's publicly available data based on analysis of publicly available projections, Forbes reported. New data continues to be created daily but not at the pace required by current AI training demands once low-quality or repetitive content is excluded.

Web sources have begun restricting access to their data for AI training, adding to supply constraints. Training larger models also requires substantial computing power and electricity, with data centers expanding and energy capacity becoming limited in parts of the United States.

This process has been compared to a Ponzi scheme that could lead to declining performance over time. The computing demands of ever-larger models have prompted a reassessment of development priorities. Rather than continuing to scale models through additional data and power, the focus could shift toward smaller, more targeted systems trained on carefully selected datasets.

AI developer released models that delivered competitive results in certain areas while using significantly less computing resources. This approach demonstrated that resource limitations can encourage innovations in model architecture and training methods that reduce hardware and energy requirements.

Historical precedent exists for similar technology constraints. Computer clock speeds reached physical limits around three to four gigahertz more than 15 years ago, after which manufacturers improved performance through multi-core designs and other architectural changes rather than higher frequencies.

Current generative AI systems function primarily as advanced predictive tools for text, images and audio. These systems can produce confident but inaccurate responses when queries extend beyond their training data boundaries. Analytical AI applications, which have existed longer and reached greater maturity, offer alternatives for business use.

These include predictive analytics for demand forecasting and optimization tools that allocate resources based on inventory and customer patterns. Organizations may achieve better results by matching specific AI approaches to defined needs rather than pursuing general-purpose systems.

Not all data holds equal value for every objective, and targeted selection of training information can improve outcomes without requiring larger volumes.

ai data-resources computing-infrastructure ai-development

Transparency

1 source · single source

CorroborationModerate · 1 source

Story details

OpenAI Offers U.S. Government Equity Stake in Public AI Wealth Fund

The White House and OpenAI have held talks for more than a year on a potential U.S. equity position. President Trump addressed the concept on Air Force One Friday.

3 sources

Frontier AI Labs Report Rapid Capability Gains, AI-Written Code; Anthropic Calls for Coordinated Slowdown

Business Insider

ai8 hrs agoUpdated

Frontier AI Labs Report Rapid Capability Gains, AI-Written Code; Anthropic Calls for Coordinated Slowdown

Anthropic called for a slowdown or pause among labs developing the most advanced AI systems. The request appeared in a Thursday blog post from its research institute.

4 sources

Small Phase 1 Safety Trial Tests First AI-Designed Coronavirus Vaccine Candidate

Euronews

ai10 hrs ago

Small Phase 1 Safety Trial Tests First AI-Designed Coronavirus Vaccine Candidate

A Phase 1 safety study of an AI-designed protein vaccine against multiple coronaviruses enrolled 39 volunteers and reported no significant safety issues. A Phase 2 trial is now planned.