A new investigation by Proof News and Wired has revealed that major technology companies, including Apple, Anthropic, Nvidia, and Salesforce, have been using a massive dataset of YouTube subtitles to train their AI systems.
The dataset, known as “YouTube Subtitles,” contains transcripts from over 170,000 videos across 48,000 channels, including content from popular creators like MrBeast and Marques Brownlee (MKBHD), as well as from major news outlets such as ABC News, BBC, and The New York Times. The dataset does not include the actual video content but focuses solely on the subtitles extracted from these videos.
This revelation has sparked significant controversy, as the data was reportedly collected without permission, violating YouTube’s terms of service. Marques Brownlee, a well-known tech reviewer, highlighted the issue on social media, expressing concerns about the unauthorized use of his and other creators’ content for AI training. He emphasized that while companies like Apple may not directly be at fault for scraping the data, they are nonetheless beneficiaries of this questionable practice.
Apple has sourced data for their AI from several companies
One of them scraped tons of data/transcripts from YouTube videos, including mine
Apple technically avoids "fault" here because they're not the ones scraping
But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024
The dataset in question is part of a larger collection called The Pile, created by the nonprofit EleutherAI. The Pile is an open-source dataset that includes various materials such as books, Wikipedia articles, and now, YouTube subtitles. This compilation has been used by several tech giants to enhance their AI models. Apple, for instance, used The Pile to train its OpenELM model, which was announced just before the introduction of Apple Intelligence, a suite of AI-powered features set to launch with iOS 18.
The use of this dataset has raised ethical and legal questions. YouTube’s CEO, Neal Mohan, and Alphabet’s CEO, Sundar Pichai, have both stated that using YouTube content for AI training without permission violates the platform’s terms of service. Despite these assertions, companies like Apple and Nvidia have not publicly commented on their involvement with The Pile dataset.
Furthermore, this situation highlights a broader issue within the AI industry: the lack of transparency regarding the sources of training data. Companies often keep the details of their data sources under wraps, leading to concerns about the potential misuse of content and the implications for content creators. This lack of transparency is not new. Earlier this year, OpenAI’s CTO, Mira Murati, avoided directly addressing whether YouTube videos were used to train their AI tools, citing the use of publicly available or licensed data instead.
The Proof News investigation also highlighted that the Pile dataset includes potentially problematic content, such as biases against certain genders and religious groups, as well as profanity. Despite these issues, companies like Salesforce have defended their use of the dataset, claiming it is publicly available and used for academic and research purposes.
(via Wired)