Apple Expands its Family of Small Models with Open DCLM Models on Hugging Face
Apple has recently released a family of open DataComp for Language Models (DCLM) models on Hugging Face. Led by a team of researchers from Apple, University of Washington, Tel Aviv University, and Toyota Institute of Research, the DataComp project aims to design high-quality datasets for training AI models. The project follows a standardized framework with fixed model architectures, training code, hyperparameters, and evaluations to determine the best data curation strategy for training highly performant models.
The DCLM models released by Apple include two main models: one with 7 billion parameters and another with 1.4 billion parameters. According to Vaishaal Shankar from the Apple ML team, these models are the “best-performing” open-source models available. What sets these models apart is that they are truly open source, providing access to the model weights, training code, and pretraining dataset.
The 7 billion parameter model, trained on 2.5 trillion tokens using the OpenLM framework, has shown impressive results. It delivers a 63.7% 5-shot accuracy on MMLU, which is a 6.6 percentage point improvement compared to the previous state-of-the-art model, MAP-Neo. Additionally, it uses 40% less compute for training while performing similarly to other leading open models in the market, such as Mistral-7B-v0.3, Llama3 8B, Gemma, and Phi-3.
Moreover, the performance of the model improved when its context length was extended to 8K through additional training on the same dataset using the Dataset Decomposition technique. This highlights the importance of dataset design for training language models and offers a starting point for further research on data curation.
Apple also released a smaller version of the model with 1.4 billion parameters. This model, trained jointly with Toyota Research Institute, delivers impressive performance across MMLU, Core, and Extended tests. In the 5-shot MMLU test, it scored 41.9%, outperforming other models in the category, including Hugging Face’s SmolLM.
The larger model is available under Apple’s Sample Code License, while the smaller one has been released under Apache 2.0, allowing for commercial use, distribution, and modification. It’s important to note that these models are early research and may exhibit biases from test training data or produce harmful responses.
Apple’s release of these open DCLM models showcases their commitment to advancing AI technology and promoting collaboration within the industry. By providing access to their models’ weights, code, and dataset, Apple is fostering innovation and encouraging further research in the field of language models.