AI developers are running out of data. Where can they get more?

OpenAI sparked a technological revolution with the debut of ChatGPT in November 2022, and millions of amazing users worldwide were amazed by its iconic chatbot’s ability to engage in human-like conversations on just about any topic they could dream up.

It kicked off a craze for AI that has only gotten more manic by the day, with every tech company worth its salt looking to get in on the act with their own generative AI models. We quickly saw a response from Google and Meta with their Gemini and Llama large language models, and Microsoft, which is already closely associated with OpenAI, has forged ahead in building its own models too.

Add to that the host of AI startups, ranging from Anthropic to Cohere to AI21 Labs and now DeepSeek, and it’s clear that the industry has become a mad free-for-all, with dozens of competing players scrambling to cash in on the insane level of demand for next-generation AI tools.

AI models are trained and built using vast amounts of data, and they need ever-increasing amounts of it to improve. To obtain this data, most AI developers go to the most obvious source of it – the public internet, where they freely scrape massive amounts of information.

Crawling and scraping

One thing that most people don’t realize is that there’s no easy place where you can go to just “download the internet”. So, what AI developers do is rely on tools known as “web crawlers”, which scour the world wide web, moving from link to link as they index all of the information they see within a database. Then they use “web scrapers”, which go through that database and download all the information it leads them to.

Companies with immense resources, like Google and Microsoft, possess the money and the expertise to create these web crawlers and scrapers themselves, and that ability likely gives them an edge over rivals that cannot. For the rest, they tend to turn to existing resources such as Common Crawl, which is a non-profit organization that crawls the web and downloads it, compiling the information into a massive, open-source database that’s updated every few months. Another resource is the Large-Scale Artificial Intelligence Open Network, known as LAION, which is full of links to images it finds on the web, and any captions posted alongside them.

In addition, there are other nonprofits that have an interest in promoting the development of AI, such as the Allen Institute for AI. It works to compile open datasets for large language model developers, such as the Dolma database that contains more than three trillion tokens from various web pages, books, codebases, academic papers and encyclopedias found online.

Content creators push back

These databases are all created by web crawlers and scrapers, but there’s a lot of controversy over this common practice, as it raises questions about the accuracy and reliability of the models trained using this information. After all, there’s plenty of junk information and rumors and hearsay posted online. Of course, it has also led to disputes regarding copyright, with many content creators arguing that they should be compensated, given that AI is perceived as a threat to their livelihoods.

Some companies have tried to get around this by paying to access data. For instance, OpenAI has come to terms with news organizations such as Axel Springer and the internet forum Reddit, paying to access their content. Others, such as Meta, are using their own data, such as the millions of public posts on Facebook and Instagram, to train their models. Elon Musk, the owner of X, says his company is doing the same to train its Grok family of LLMs. Amazon has stated it will use voice data from customers who converse with its digital assistant Alexa.

AI developers are running out of data Where can they get more — (Image credit)

However, these practices aren’t all that popular either, as many social media users are quite uncomfortable with the idea that their posts and comments are being used to train AI models.

There have been an awful lot of pushbacks, but AI developers are unable to quench their thirst for ever-more data, given that it’s the lifeblood of their algorithms. As such, there are questions about where they can obtain the information they need to keep creating innovative new AI applications.

Synthesizing data might be a solution

One possible solution to this question might be “synthetic data”, which is information that’s artificially generated by machines that first consume enormous amounts of real-world data.

If you have plenty of real-world data to begin with, it’s possible to create as much synthetic information based on that as you’ll ever need, but there are questions about the quality of this artificial information. After all, it’s all sourced from real data made by humans, and if that source data is inaccurate or biased, the resulting synthetic information will likely magnify those issues.

As a result, the more synthetic information that’s used to train AI models, the worse their biases and inaccuracies will become, leading to more “hallucinations”, which refers to instances where AI makes mistakes or simply creates facts out of thin air.

If synthetic data is to become a viable solution to the soaring demand for training datasets, then there’s a need to ensure it meets a baseline of quality standards, which will only be possible if some kind of human input remains.

Boosting data quality with competition

This is where Fraction AI could make a difference. It’s the creator of a unique, blockchain-based protocol that has transformed the task of generating synthetic data into a competition, where human developers create AI agents that compete to generate new datasets. By creating a successful AI agent that excels in synthetic data creation, the developers can earn substantial rewards for their participation.

Fraction AI hosts regular competitions between AI agents, which compete to create the most accurate and reliable datasets, according to the specific requirements. They pay a fee in cryptocurrency to enter these competitions, but only the best performers are rewarded, pushing developers to create better AI agents.

Builders can create these agents without any coding skills, simply by entering prompts. This approach ensures it’s accessible to anyone.

In addition, Fraction AI also relies on “stakers”, who stake ETH cryptocurrency tokens to secure the network. They too earn rewards for doing this, through a combination of a share of the competition fees, protocol fees and any revenue that comes from licensing its synthetic datasets.

The ingenious thing about Fraction AI is that it introduces a completely new approach to data labeling that should ensure it produces much better-quality synthetic information. Traditionally, data labeling has either been done by humans only, which is accurate but far too slow, or by AI models, which are much faster but less accurate.

Fraction AI allows humans to tell agents how they should be labeling data, so those agents can do it more accurately at much higher scales. It’s an approach that combines the advantages of both methods, and it provides value for all three ecosystem participants.

The builders, or creators of the AI agents, are rewarded for creating more effective agents, which ensures better quality data. Because only the best agents are rewarded, those whose agents lose are forced to improve their agents so they can start earning. Stakers get the opportunity to earn a regular yield on their investments while supporting the creation of more training data. Lastly, AI developers benefit from a continuous stream of new, high-quality synthetic data that can be used to train more capable AI models.

The need for humans in the loop

It’s a novel approach that shows it has real potential. Already, Fraction AI has demonstrated its ability to tweak a small multimodal LLM to enable it to perform on a par with OpenAI’s GPT-4, at a fraction of the cost of that larger model.

The protocol demonstrates the importance of ensuring that humans remain in the loop during the synthetic data creation process. Humans are one of the main reasons behind the early success of ChatGPT. While it was under development, OpenAI hired hundreds of workers to experiment with an early version of ChatGPT and provide feedback, which was then used to improve its performance. This ultimately had a transformative impact on the quality of the chatbot’s responses, sparking the mad scramble for AI that exists today.

As AI models become more pervasive and more sophisticated, the world is fast running out of reliable data. Synthetic data, created with humans in the loop, has emerged as the most viable solutions to this problem, and its importance to the AI industry will continue to grow.

Featured image credit: Maxim Berg/Unsplash