ChatGPT, Gemini, Copilot, and other AI tools are capable of generating impressive sentences and paragraphs from just a simple prompt. These AI models have been trained on vast amounts of human-generated text scraped from the internet. However, as generative AI tools proliferate online, generating large quantities of synthetic content, there’s a concerning possibility: training future generations of AI models with their own data could lead to significant issues.
Model Collapse: A Growing Concern
Researchers argue that training large language models (LLMs) on their own outputs could lead to model collapse. In a recent study published in Nature, Ilia Shumailov and colleagues from the University of Oxford warn that this could result in biased outputs or, more worryingly, complete breakdowns into gibberish.
In their experiments, Shumailov and his team observed that training a language model on text it had already generated led to errors compounding over generations. The OPT-125m model, initially fine-tuned on Wikipedia articles, was fed its own output for further training. By the ninth generation, the model’s responses had become nonsensical — a prompt about 14th-century architecture devolved into a list of jackrabbits.
Why Does This Happen?
Shumailov compares this issue to the game of telephone, where a message passed along by many people gets distorted. As models make small errors in generating content, those mistakes are passed along to the next iteration of the model, compounding the issue and slowly erasing the diversity and accuracy present in the original data. Bias could also be exacerbated, and text may degrade into meaningless strings over time.
The Risks of Generative AI Collapsing
Training AI models on their own generated data could result in the loss of important nuanced information, potentially amplifying biases. For instance, a model may become excellent at describing furry cats but fail to represent hairless ones due to skewed training data. Similarly, minority voices, which are often underrepresented in datasets, could disappear over time as their expressions are marginalized.
Leqi Liu, an AI researcher at the University of Texas at Austin, highlights how such collapses could result in more homogeneous and biased AI-generated content. This could undermine the diversity we often seek in text, reducing the uniqueness of AI-generated outputs.
How to Prevent Model Collapse
To avoid this, experts suggest that AI models should not be trained on AI-generated data alone. It’s crucial to maintain a balance by including both human-generated and AI-generated content in training data. This helps ensure the model retains diversity and accuracy. Additionally, capturing the “tail of the distribution”, which includes rare or unique data (such as descriptions of hairless cats or marginalized voices), can help prevent the model from becoming overly skewed.
Despite these risks, big AI companies have systems in place to monitor and correct for issues like data drift, which means the likelihood of catastrophic collapse in large-scale commercial models is relatively low. However, individuals attempting to build their own models on a smaller scale should be aware of these risks and take steps to mitigate them.
This article outlines the potential downfalls of using AI to train more AI, a growing concern for future AI development. As generative AI becomes more common, ensuring data diversity and accuracy remains critical to preventing the collapse of AI models.