HOME NEWS BLOG

AI Model Collapse

The mythological Ouroboros is an apt symbol for AI Model Collapse.

One aspect of modern Artificial Intelligence (AI) — see my blog about how this term is pretty much meaningless from a scientific point of view — technologies that big tech and AI specialists don't want to talk about is the phenomenon of Model Collapse. In a nut shell, Model Collapse refers to a dramatic degradation of LLM and CNN performance due to training on data produced by LLMs and CNNs.

No Problem! No one would train on data like that, right?

It is tempting to think no one would train on data that would cause Model Collapse, but it is actually unavoidable. Consider the state of training data today. ChatGPT, Gemini, Leo, Sora, and pretty much all of the AI systems deployed today are trained on data scraped from the World Wide Web, and rightly so. The majority of the data is there and much of it is labeled by humans in some way already to train on. Purpose built datasets are prohibitively expensive for most companies. Prior to 2024 there were billions of lines of text, images, videos, audio files, uploaded to the internet and partially annotated by the uploaders. A treasure trove of data to train on. After 2024, there are now billions of AI generated lines of text, images, videos, audio files, etc uploaded annually. So any AI model trained on internet sourced data after 2024 will contain some fraction of AI generated data. This is exactly the cause of model collapse. So it is reasonable to assume that AI models will start to exhibit symptoms of model collapse from 2024 onward and the better their output gets, the more of their data will end up back in their training data pool. It is a problem that is here to stay. It is a wealth of all the worlds images, videos, books, blogs, and a steady stream of pet photos.

Another concerning signal that has been growing this year is major AI tech companies are making statements that they have exhausted input data to train on. Of course they can still improve their labeling pipelines and tune their training regimen, but there is a pretty strong signal that they are starting to be desperate for more data to ingest. This will make it even harder to keep AI produced data out of their input pipelines.

I'm definitely not a technology doomsayer. Human ingenuity typically keeps us one step from the cliff. What people are calling AI today is pattern classification and generation software from deep networks, and they are here to stay. Likely they will be a boon to creative processes and reduce the cost for things like stock photos or product descriptions, advertising videos, etc.

However, there are some pretty strong signals that current "AI" systems have peaked and make even start being plagued by Model Collapse in coming years.

References

© 2024

expert curated independent technology news