Over the last 30 years, the internet has accumulated the breadth of human knowledge and experience and placed it at our fingertips. You can find almost anything on the internet—the answer to any question, the text of any book, or a survey of the research of any particular field. AI enthusiasts and tech companies poised to profit off of AI promise that, having been trained on this breadth of knowledge, AI will deliver new ways of retrieving data from the internet to answer just about any question.
However, if AI chatbots truly are successful in wresting a significant number of our queries from the grasp of search engines, they will kick off their own death-spiral, taking much of the web with them. The success of AI will entail its own degradation.
The degradation of a dataset
Many AI-watchers have already pointed out that the popularity of AI chat products like OpenAI’s ChatGPT or Google’s Bard will trigger a negative feedback loop. Marketers and spammers will saturate the web with AI-generated content. Future models will be trained in this new environment, meaning that inevitably, researchers will scoop up much AI-generated content as they trawl the internet for new data to feed their model. That means new models will be trained on the hallucinations of their forbearers. Without vigorous fine-tuning and careful data sanitation, newer models will end up further removed from reality than earlier models.
This is scary enough, but there’s another way that AI chatbots may create a vicious cycle whose end result is to rot their training dataset (the internet).
The destruction of the web
Knowledge on the internet exists in a virtuous cycle: a user is interested in the answer to a particular question, so they ask a search engine to answer their question. The search engine returns websites with the answer to this question, and when one of these websites gets traffic from a search, they monetize that traffic in some way. The user gets their question answered and the website gets rewarded for adding a source of new knowledge to the web.
AI disrupts this cycle. When a user asks their question directly to an AI chatbot, no one—other than the website hosting the chatbot—gets traffic from this query. Without traffic, there will no longer be an incentive for a publisher to host a website with the answer to a particular query.
LLMs have been fed high-quality sources of data from the internet. Their purpose is to internalize this data and regurgitate it in some other way, which inherently removes the data from its initial context. Since the initial reason for sharing this data was monetization, and since no monetization can happen when data is accessed indirectly through an LLM, then the redirection of traffic from the initial data source to the LLM will snuff out that data source.
There will be no unaltruistic reason to post new data on the internet. Websites hosting knowledge will blink out, starved of their traffic and with them, high-quality sources of training data for AI models will disappear.
You might argue that products like Bing Chat include citations, which place data back into its context and continue the virtuous cycle of knowledge sharing. While I would agree that this is a step in the right direction, the purpose of the chat bot is still to be the main presenter of the data, while the purpose of the chatbot’s citation is to build trust with the chat bot. If LLM-based chat bots improve to the point that we often trust them without following the link to their citation (like how most people interact with Wikipedia), the cycle will be disrupted once again.
Until AI can be trusted not to hallucinate the answers to questions where the truth matters, I expect people will continue to use search engines rather than AI chatbots for their important queries. Without a systematic solution to the problem of hallucinations, it would be unwise to trust an AI over a search engine for the answer to an important question like “how many benadryls should I take for my allergies”. However, even if AI does find a way to guarantee truthful answers to most questions, this very use case will eat its own training data—and the internet.