XcessAI
Posts
Under the Surface

Under the Surface

How Much of the World’s Data Is Inside AI?

Fabio Lopes
September 14, 2025

Welcome Back to XcessAI

When people talk about AI, they often assume it has access to “all the world’s data.” It feels that way when a chatbot explains quantum physics or drafts a marketing plan in seconds.

But the reality is very different. Today’s AI models are built on only a small fraction of global data. Understanding what’s in, what’s out, and why it matters is critical for leaders who want to use AI intelligently — and for individuals thinking about how to learn in this new era.

Quick Read

AI doesn’t train on “all the internet” — only a small, curated slice of global data.
Training corpora represent likely under 1% of the world’s digital information.
What’s missing? Most enterprise “dark data,” regulated datasets, and real-time industrial streams.
One exception: educational material, which is widely available online — giving AI unusual strength as a self-learning tool.
For businesses, the lesson is clear: AI is powerful, but not omniscient. Advantage comes from combining generalist AI with your own specialist data.

The Myth of “All Data”

There’s a common misconception that AI knows everything. It doesn’t.

IDC estimates the world produced around 120 zettabytes of data in 2023, a number expected to triple by 2030. AI training datasets, by comparison, are measured in terabytes to petabytes. In relative terms, that’s not even 1%.

The vast majority of the world’s data is never touched by AI training pipelines.

What AI Actually Trains On

So what’s in the mix?

Open web text: Wikipedia, blogs, news, forums, Common Crawl.
Books and research: public-domain literature, licensed academic sets.
Code: GitHub repositories, open-source libraries.
Images and video: curated archives, licensed datasets.

These sources are filtered to remove spam, duplicates, and low-quality text. Private data — emails, corporate documents, or your cloud drive — doesn’t make it in unless companies opt-in through fine-tuning or custom integrations.

How Much of the Public Web Is Inside AI?

Even within the open internet, coverage is less than most people imagine.

Common Crawl, one of the largest open web scrapes, contains about 250–300 billion pages — but that’s still only a portion of the live web.
Much of the internet is behind paywalls, login screens, or APIs (think Netflix, Bloomberg, academic journals). None of this is in the base training sets.
Estimates suggest models like GPT-4 and Claude are trained on single-digit percentages of the public web — heavily filtered for quality and deduplicated.
Entire domains, like scientific literature or specialized industry sites, may only appear if licensed or specifically included.

So even in the “public” world, AI has gaps. That’s why answers sometimes feel incomplete, generic, or biased toward mainstream sources.

And those gaps aren’t evenly distributed. In some areas — like law, education, or software — AI feels remarkably capable because large volumes of digitized material are available online. In others — such as healthcare, finance, or specialized sciences — coverage is far thinner, constrained by regulation, paywalls, or lack of digitization. This unevenness explains why AI can sound authoritative in one domain and strangely vague in another.

The Proprietary Data Question

One area of controversy is whether AI companies have trained on proprietary or copyrighted material without explicit permission.

In recent years, firms like OpenAI, Anthropic, and Google have signed licensing deals with content owners — Reddit, Shutterstock, Axel Springer, and others — to access high-quality data legally.
But earlier versions of large models may potentially have ingested copyrighted text and images scraped from the web, which has led to ongoing lawsuits from news outlets, authors, and artists. We don’t know the outcome yet.
The reality is evolving: what was once “scrape first, apologize later” is becoming a world of negotiated data pipelines and licensing markets.

For business leaders, the message is simple: the data that fuels AI is not static, and the rules of ownership are still being defined.

What Gets Left Out

The bigger story is what’s missing:

Enterprise “dark data”: Estimates suggest 80–90% of corporate data is never analysed, let alone shared.
Real-time operational data: IoT, industrial sensors, logistics streams.
Regulated or sensitive datasets: healthcare, finance, government records.
Non-digitized knowledge: oral cultures, proprietary training, paper archives.

In short, AI knows a lot about the public internet, but far less about the specialized, proprietary, or confidential worlds where business value is created.

The Education Exception

One fascinating outlier is educational material. Unlike corporate or industrial data, knowledge resources — textbooks, tutorials, academic papers, training manuals — are widely digitized and shared online.

That’s why today’s AI can explain algebra, draft essays, or teach you the basics of Python coding almost instantly. Models have absorbed a huge portion of the world’s general learning resources, making them powerful tools for self-education.

For individuals, this is transformative: the barriers to accessing structured knowledge have collapsed. While AI can’t replace lived experience or mentorship, it does offer a near-universal tutor — available at any time, on almost any subject.

Why It Matters

This imbalance — a sliver of total data, but heavy on public and educational content — has big implications:

Bias: AI leans toward English-speaking, developed-world perspectives.
Blind spots: Industry-specific nuances are often missing unless you bring your own data.
Opportunity: Companies can gain an edge by plugging proprietary datasets into AI, whether through fine-tuning or retrieval-augmented generation (RAG).

The models are generalists. Your advantage comes from adding the specialist context.

The Data Power Dynamics

Another factor shaping what goes into AI isn’t just what’s public or private — it’s who controls the richest datasets.

A handful of technology giants sit on vast reservoirs of search queries, social graphs, e-commerce activity, maps, and cloud logs. These are among the most valuable datasets in the world, and they’re largely inaccessible to outsiders.

This concentration creates two dynamics:

Asymmetry: A small number of firms can train frontier models with unparalleled breadth and depth.
Barriers for start-ups: New entrants often lack access to the same raw material, pushing them to innovate in niche domains, proprietary customer data, or smarter architectures instead.

The exact percentages are debated, but the principle is clear: data monopolies matter. The future of AI won’t just be shaped by algorithms, but by who controls the pipelines of information they learn from.

What This Means for Business Leaders

Don’t assume omniscience: AI won’t “know” your business unless you feed it your data.
Invest in data readiness: Cleaning, structuring, and tagging your information is the real unlock.
Leverage education: Encourage teams to self-train with AI. It’s one domain where models truly shine.
Seek differentiation: Use proprietary datasets to create insights competitors can’t replicate.

Closing Thoughts

AI hasn’t swallowed the world’s data. Far from it. It’s trained on a sliver — but a sliver that’s powerful enough to draft contracts, summarize research, and tutor millions of people.

That paradox is the real story: AI feels omniscient because it’s mastered what’s public, but its limits show where the next opportunities lie.

For leaders, that means combining AI’s generalist abilities with your own private data. For individuals, it means seizing the chance to learn faster than ever before.

The world’s data is too vast for any single model to capture. But in the gap between what’s included and what’s excluded, business and personal advantage is waiting to be claimed — by those who can connect the right data to the right questions.

Until next time,
Stay adaptive. Stay strategic.
And keep exploring the frontier of AI.

Fabio Lopes
XcessAI

P.S.: Sharing is caring - pass this knowledge on to a friend or colleague. Let’s build a community of AI aficionados at www.xcessai.com.

Read our previous episodes online!

Reply

or to participate.