The AI Generated Elephant in the Room
The last couple weeks has seen the news aggregator and community forum Reddit engage in a virtual revolution between the company who owns the site and the non-paid moderators who foster the communities and commentary. There is a lot of speculation about what motivated the changes that led to this revolt, the timing lines up with information coming to light that OpenAI, the company behind ChatGPT, used the Reddit API to train the impressive chatbots that have been shaking up every industry.
The reaction of locking down the site to prevent bot scraping triggered a series of events that has removed moderators from popular communities, triggered a migration of those communities to other sites and will possibly make the training data they desire to sell less valuable. It’s a bit early to say whether these actions will lead to a death spiral in users and engagement, but it has already made the internet a bit less useful and caused a 7% markdown of valuation. That said, let’s dig into what makes Reddit’s data in particular so valuable.
Garbage In, Garbage Out
In computer science there is a notion of “garbage in, garbage out”. This basically means, if you feed a system bad data you will get bad results, there is no extra magic in the machine. Machine learning only amplifies bad data and mixes in some unearned confidence.
Our AI overlords are hungry but there is growing research showing that by poisoning the training well with AI generated content the second generation of training models start to break down. Additionally, generative AI is becoming so pervasive that it’s going to start to be impossible to find training sets without AI generated conversations mixed in. This could lead to a situation where we get one generation of clean AI language models before the second generation, trained on a new chat bot army of posters “poisons the well”.
Reddit has been flooded for years with live humans who post incorrect information, instigate unrest and occasionally post valuable insights. The secret sauce has always been smaller communities with moderators who cultivate a strong culture and enforce community agreed on rules to filter through the noise. Bots spouting generic ChatGPT style responses are not new and can be rooted out in strong communities. Throwing these moderator’s good will out the window with the metaphorical bath water in order to chase a legacy social media business model has been difficult to watch.
Processes as a Service
Every organization that survives is motivated to get very good at loops that boost their business. Machine learning has shown itself to be very good at locking in those loops (either good or bad). In order to train a model you first have to be very good at a process and achieve success with your existing processes before trying to train an AI model to lock in those patterns.
We’re starting to see places where other businesses who are highly focused and specialized in one thing train AI models and sell those as services. In a way we’ve finally found a way to monetize business processes in a SaaS (Software as a Service) environment. The implication here is when you pick up these models, you are inserting another company’s business processes into your own. All of the flows in their workflows become your flaws as well.
Reddit had a super valuable moderation loop that generated highly filtered content which is exactly why it was targeted by this first generation of AI training. Other companies like Stack Overflow and Github have their own process loops that contribute to their AI offerings. As we start bringing these models into our lives and businesses it will be helpful to continue to consider the communities where they were trained.
Letting the Elephant in
We have a lot to learn about where to effectively deploy Large Language Models and AI. That starts by not looking at them as magic boxes that can solve our problems without taking the time to understand the problem.
The industry is used to inviting consultants in to evaluate and reform internal processes. With these groups we build trust, consider effectiveness and proceed in stages. AI models should probably be considered along similar lines. You have an opinionated set of electronic workers who were trained in a different environment with a certain understanding of the world. It’s also going to take some work and consideration before we invite the entire zeitgeist of these Internet communities into our lives.