Home Tech Tech Companies Illegally Use YouTube Subtitles for AI Training, Including Top Creators...

Tech Companies Illegally Use YouTube Subtitles for AI Training, Including Top Creators and Institutions

Tech Companies Using YouTube Subtitles to Train AI Models: A Controversial Investigation

In a recent investigation conducted by Proof News and published in conjunction with Wired, it has been revealed that several tech companies, including Anthropic, Nvidia, Apple, and Salesforce, have used subtitles from over 48,000 YouTube channels to train their AI models. This dataset includes content from top creators like MrBeast and Marques Brownlee, as well as prestigious institutions such as MIT and Harvard. However, this practice raises ethical concerns as YouTube prohibits the harvesting of platform content without permission.

The investigation uncovered that the dataset used by these tech companies consists of 173,536 YouTube videos, including those from renowned sources like Khan Academy, The Wall Street Journal, NPR, the BBC, and popular late-night shows. Notably, the dataset also includes material from other sources such as the European Parliament, English Wikipedia, and emails from the Enron Corporation.

One of the non-profit AI research labs responsible for scraping and disseminating the YouTube dataset is EleutherAI. However, Wired reports that the lab did not respond to their requests for comment. The dataset is part of a larger compilation called The Pile, which is accessible to anyone on the internet with sufficient space and computing power. Tech giants like Apple, Nvidia, Salesforce, Bloomberg, and Databricks have publicly acknowledged their use of The Pile to train their AI models.

Marques Brownlee, a prominent tech creator, expressed his concerns about this issue on Instagram. He highlighted that Apple and other tech companies are training their AI models using data bought from third-party data scraping companies, some of which obtain data in questionable ways. Brownlee emphasized that Apple could technically claim they aren’t at fault for this practice.

Jennifer Martinez, a spokesperson for AI startup Anthropic, acknowledged their use of The Pile but distinguished it from direct use of YouTube’s platform. Martinez stated that while YouTube’s terms cover direct usage, their use of The Pile dataset is separate. She redirected questions about potential violations of YouTube’s terms to the authors of The Pile.

Brownlee further emphasized the issue, mentioning that he personally pays for accurate manual transcriptions for his videos. Therefore, the stolen transcriptions used by tech companies constitute paid content being stolen multiple times. This concern is shared by creators worldwide who fear that their work will be exploited by AI without permission or compensation. Many creators have already filed lawsuits against tech companies for unauthorized use of their work.

Wired reports that The Pile is still accessible on file-sharing services, although it has been removed from its official download site. In response to this revelation, Proof News has developed a tool to search for creators in the YouTube AI training dataset, allowing individuals to check if their work has been included.

This investigation sheds light on the ethical implications of using YouTube subtitles to train AI models without permission. It raises questions about the responsibility of tech companies in ensuring they acquire data in a legal and ethical manner. Moreover, it highlights the need for stronger protections for creators and their intellectual property in the age of AI.

Exit mobile version