Training data

By Paul Brock·Updated on 22-04-2026

TL;DR

Training data is the collection of texts, images or other examples an AI model learns patterns from before deployment.

The quality and breadth of training data determine LLM quality more than any other factor. GPT-5 is trained on trillions of tokens from web pages, books, code repositories, papers and (increasingly) synthetic data. The training cut-off — the moment the training set ends — determines what the model 'natively' knows without web search. For GEO this means: content that exists before the cut-off and is widely crawled influences what AI 'by default' says about a topic.

Example

An LLM with September-2024 cut-off has no idea the April 2024 Bitcoin halving led to ASIC shutdowns — unless you feed that via RAG. That's why live retrieval and grounding are crucial for current info.

Frequently asked questions

Can I remove my content from training data?

Partially. OpenAI and Anthropic honour opt-outs via robots.txt (GPTBot, ClaudeBot disallow) for future training. Already-trained models can't selectively 'forget' what they've seen — that would require full retraining.

Training data

Example

Frequently asked questions

Related terms

Further reading

Need help with SEO or GEO?