More than 170,000 YouTube videos are part of a massive dataset that was used to train AI systems for some of the biggest technology companies, according to an investigation by Proof News and copublished with Wired. Apple, Anthropic, Nvidia, and Salesforce are among the tech firms that used the “YouTube Subtitles” data that was ripped from the video platform without permission. The training dataset is a collection of subtitles taken from YouTube videos belonging to more than 48,000 channels — it does not include imagery from the videos.
Videos from popular creators like MrBeast and Marques Brownlee appear in the dataset, as do clips from news outlets like ABC News, the BBC, and The New York Times. More than 100 videos from The Verge appear in the dataset, along with many other videos from Vox.
“Apple has sourced data for their AI from several companies,” Brownlee, known by his handle MKBHD, wrote in a post on X. “One of them scraped tons of data/transcripts from YouTube videos, including mine.” He added: “This is going to be an evolving problem for a long time.”
YouTube didn’t immediately respond to The Verge’s request for comment.
As part of its investigation, Proof News also released an interactive lookup tool. You can use its search feature to see if your content — or your favorite YouTuber’s — appears in the dataset.
The subtitles dataset is part of a larger collection of material from the nonprofit EleutherAI called The Pile, an open-source collection that also contains datasets of books, Wikipedia articles, and more. Last year, an analysis of one dataset called Books3 revealed which authors’ work had been used to train AI systems, and the dataset has been cited in lawsuits by authors against the companies that used it to train AI.
AI companies are rarely willingly transparent about the data that goes into their AI systems; how YouTube content specifically is being used has been a key question in recent months. In March, when OpenAI unveiled its powerful video generation tool, Sora, CTO Mira Murati repeatedly dodged questions about whether the system was trained on YouTube videos.
“I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” she told The Wall Street Journal at the time. When pressed by the Journal about YouTube content specifically, Murati said she “wasn’t sure about that.”
In previous interviews, YouTube CEO Neal Mohan has said that the use of video content to train AI — including transcripts — would violate the platform’s terms. And in May on an episode of Decoder, Google CEO Sundar Pichai agreed with Mohan’s assessment that if OpenAI had indeed trained Sora on YouTube content, it would have broken YouTube’s terms.
“We have terms and conditions, and we would expect people to abide by those terms and conditions when you build a product, so that’s how I felt about it,” Pichai said.