Apple’s recent publication of a technical paper has shed light on the models behind Apple Intelligence, the upcoming generative AI features for iOS, macOS, and iPadOS. The paper addresses concerns about Apple’s approach to training these models, specifically refuting accusations that it used private user data. According to Apple, the training data for Apple Intelligence was a combination of licensed data, publicly available information, and data crawled by its web crawler, Applebot. The company emphasizes its commitment to user privacy, making it clear that no private Apple user data was included in the training data.
One controversial data set mentioned in the paper is “The Pile,” which contains subtitles from YouTube videos. In July, it was reported that Apple used this data set without the knowledge or consent of many YouTube creators. However, Apple clarified that it did not intend to utilize these models for its AI features. This incident highlights the importance of obtaining proper consent and transparency when using data.
The technical paper also introduces Apple Foundation Models (AFM), which were first announced at WWDC 2024 in June. Apple highlights its responsible approach to sourcing the training data for these models. The data includes publicly available web data, licensed data from undisclosed publishers, and open-source code from GitHub. While Apple reached out to several publishers for multi-year deals to train models on their news archives, training models on code without permission remains a contentious issue among developers. However, Apple states that it made efforts to use repositories with minimal usage restrictions.
To enhance the AFM models’ mathematics skills, Apple included math questions and answers from various sources such as webpages, forums, blogs, tutorials, and seminars in the training set. The company also incorporated additional data, including human feedback and synthetic data, to refine the models and mitigate undesirable behaviors like toxicity. Apple emphasizes that its models are designed to assist users with everyday activities while aligning with the company’s core values and responsible AI principles.
The paper acknowledges that it doesn’t reveal any groundbreaking insights, as companies often avoid disclosing too much due to competitive pressures and potential legal issues. The legal aspects surrounding training models with scraped public web data are currently a subject of debate and lawsuits. Apple mentions that it allows webmasters to block its crawler from scraping their data, but this leaves individual creators in a difficult position if their content is hosted on a site that refuses to block Apple’s data scraping.
Ultimately, the fate of generative AI models and their training methods will be determined by courtroom battles. For now, Apple aims to position itself as an ethical player, prioritizing user privacy and responsible AI practices while navigating the evolving legal landscape.