OpenAI reportedly used YouTube data in the development of GPT-4

To develop its advanced language model, GPT-4, OpenAI reportedly utilized a massive amount of YouTube video data.

The company is said to have transcribed over a million hours of video content.

This news arrives alongside a broader trend in the artificial intelligence (AI) industry, where tech giants are finding increasingly creative (and sometimes controversial) ways to gather the fuel their AI models crave – data.

Why YouTube whispers matter to AI

The New York Times recently shed light on this concerning development as YouTube asked if the videos on their platform whether used as a training data source for SORA or not a few days ago.

So why turn to YouTube for training data? It’s simple, really. YouTube offers a practically limitless treasure trove of spoken language. Every vlog, unboxing video, and rambling tutorial includes human speech in all its diverse and messy glory. Since large language models like GPT-4 learn by ‘ingesting’ and analyzing huge quantities of text, transcribed audio from videos becomes invaluable fodder.

However, turning YouTube’s audio into useable training data raises complex questions. OpenAI’s speech recognition tool ‘Whisper‘ played a crucial role in transcribing the vast amount of video material. This transcription process, though necessary, brings copyright and fair use considerations into focus.

Data, data everywhere… But is it okay OpenAI to snare?

The quest for robust datasets to power AI is by no means unique to OpenAI. Tech giants across the board grapple with the same challenge. After all, AI models are notoriously data-hungry. The more diverse and high-quality the input data, the better equipped the models are to handle real-world complexity.

The pressure to find creative data sources is understandable. In OpenAI’s case, the company reportedly explored options like podcasts and audiobooks after facing a shortage of more conventional training materials in 2021. But this hunt for data has a potential downside – pushing the boundaries of what’s considered legally and ethically acceptable.

openai reportedly used youtube data — **OpenAI reportedly utilized over a million hours of YouTube video data to develop its advanced language model, GPT-4** (Image credit)

The gray zone where AI data and copyright collide

YouTube has its own clear terms of service, which typically restrict how its content can be used. While ‘fair use’ provisions in copyright law do exist (with varying interpretations across countries), relying on them as justification for extensive data scraping can be a legal gamble.

The issue is far from straightforward. When tech companies use existing content to train their AI systems, questions arise:

Does this potentially limit the ability of the original content creators to profit from their work?
Are the creators sufficiently compensated if their material fuels the development of commercial AI tools?
Should there be clearer guidelines or regulations for large-scale training data collection?

AI’s big appetite raises even bigger questions

The OpenAI case highlights a broader trend – the insatiable need for data in the modern AI industry. As AI technologies get more sophisticated, ethical and legal concerns surrounding how training data is sourced will take center stage.

Whether it’s YouTube videos, code repositories, or other types of user-generated content, ensuring fair and responsible use of data will become crucial to maintaining public trust in this rapidly evolving technology.

Featured image credit: Zac Wolff/Unsplash

OpenAI reportedly used YouTube data in the development of GPT-4

Is this open season on AI data?

Emre Çıtak

Related Posts

OpenAI updates ChatGPT for macOS with new app integrations

Instagram will have its own AI video editor in 2025

Google Chrome’s AI update aims to outsmart cybercriminals

Bybit Lists LUNAI as AI influencer Luna makes her Web3 livestream debut

LATEST

OpenAI updates ChatGPT for macOS with new app integrations

Microsoft blocks Windows 11 upgrade for Dirac audio users

Instagram will have its own AI video editor in 2025

Google Chrome’s AI update aims to outsmart cybercriminals

Bybit Lists LUNAI as AI influencer Luna makes her Web3 livestream debut

Snowfall and new vehicles git GTA Online’s holiday update

Epic Games files lawsuit against Fortnite cheater

Pixel 9 gets screen-off fingerprint unlock in Android 16 Beta

ChatGPT comes to WhatsApp and landlines

The best digital product agencies to collaborate with

© 2021 TechBriefly is a Linkmedya brand.

OpenAI reportedly used YouTube data in the development of GPT-4

Is this open season on AI data?

Why YouTube whispers matter to AI

Data, data everywhere… But is it okay OpenAI to snare?

The gray zone where AI data and copyright collide

AI’s big appetite raises even bigger questions

Related Posts

LATEST

© 2021 TechBriefly is a Linkmedya brand.

Follow Us