Stack Overflow and Reddit have announced that they will start charging companies for using their data to train artificial intelligence (AI) algorithms and chatbots, joining a growing number of websites seeking compensation for the content they provide.
The move is part of a broader strategy for generative AI, which requires large amounts of training data to improve the quality of AI applications such as ChatGPT and image generator Dall-E.
Stack Overflow, which has more than 20 million registered users, will begin charging large AI developers in the middle of 2023 for access to its 50 million questions and answers.
Table of Contents
ToggleStack Overflow joins Reddit in seeking compensation for its data
Stack Overflow CEO Prashanth Chandrasekar says that the potential additional revenue is vital to maintaining the quality of information on the site and ensuring that it can keep attracting users.
However, critics argue that fencing off valuable data could deter some AI training and slow the improvement of large language models (LLMs). These models are a threat to any service that people turn to for information and conversation.
Stack Overflow and Reddit are not alone in wanting a share of the profits from the use of their data.
The News/Media Alliance, a US trade group of publishers, including Condé Nast, which owns WIRED, recently unveiled principles calling on generative AI developers to negotiate any use of their data for training and other purposes and respect their right to fair compensation.
LLMs are used to generate strings of text based on word patterns learned from web pages, books, and other bodies of text in their training data. Besides ChatGPT, the programs make up the guts of search chatbots such as Microsoft Bing chat and Google’s Bard.
They underlie a growing number of applications that produce professional and creative copy in a flash. Their counterparts that generate AI-composed illustrations and videos draw on patterns from image datasets such as photos gathered from Pinterest and Flickr.
Data sets used in AI development are often built through unofficial means such as dispatching software that scrapes content from websites. In the US that is typically considered legal, though copyright issues and websites’ terms of use against the practice have left it in dispute.
Stack Overflow and Reddit offer downloadable “data dumps” or real-time data portals to help software to access their content known as APIs. In Stack Overflow’s case, LLM developers are getting their hands on data through a mix of dumps, APIs, and scraping, Chandrasekar says, all of which today can be done for free.
Stack Overflow’s and Reddit’s approach
Stack Overflow and Reddit will continue to license data for free to some people and companies. Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes.
Reddit CEO Steve Huffman told The New York Times that he didn’t want to give a freebie to the world’s largest companies.
“Crawling Reddit, generating value, and not returning any of that value to our users is something we have a problem with,” he said.
The move by Stack Overflow and Reddit to charge for their data could extend the already unclear timelines to turning a profit on large-scale AI systems.
Every AI developer is seeking to bring down the huge costs of developing these systems, which take enormous amounts of expensive computers to power. Having to pay for data they once grabbed for free could make the process even more costly.
However, Chandrasekar says proper licensing will only help accelerate the development of high-quality LLMs. In Stack Overflow’s case, an assistant function could help guide people as they compose questions to post.
Wrapping Up
Stack Overflow’s decision to charge for its data marks a significant shift in the AI development landscape. It remains to be seen whether other websites will follow suit, but the move could help to ensure that AI developers have access to high-quality training data. However, charging for data could also deter some developers and slow the pace of progress in the field.
Ultimately, the success of the strategy will depend on how much AI developers are willing to pay for access to Stack Overflow’s and Reddit’s data. If prices are too high, developers may choose to look elsewhere for training data, which could limit the growth of AI applications and chatbots.
Despite these challenges, the move by Stack Overflow and Reddit to seek compensation for their data is an important step forward in the development of generative AI.
As the use of AI continues to grow, it will become increasingly important for companies to ensure that they are fairly compensated for the use of their content. This will help to ensure that AI development remains sustainable and that developers have access to the data they need to create the next generation of AI applications and chatbots.