The Struggle to Define “Open Source AI” and the Role of the Open Source Initiative

The debate over open source versus proprietary software has now spilled over into the realm of artificial intelligence (AI). The New York Times recently published an article praising Meta CEO Mark Zuckerberg for embracing “open source AI,” but the reality is that Meta’s Llama-branded large language models are not truly open source. This raises questions about what exactly constitutes “open source AI” and how it should be defined.

The Open Source Initiative (OSI) has been working to address this issue for the past two years. Led by executive director Stefano Maffulli, the OSI has organized conferences, workshops, panels, webinars, and reports in an effort to develop a definition for open source AI. However, applying traditional software licensing and naming conventions to AI is challenging.

Open source evangelist Joseph Jacks argues that there is no such thing as open-source AI because open source was originally created for software source code. Neural network weights, which are used in AI to describe the parameters through which the network learns, are not comparable to software code. They are unreadable by humans and cannot be debugged like software. The principles of open source software do not apply to neural network weights in the same way.

To address this issue, Jacks and Heather Meeker of OSS Capital have proposed their own definition of “open weights” for AI. The challenge lies in agreeing on a definition when there is disagreement about whether open source AI even exists.

Maffulli acknowledges this challenge and admits that there was initial debate about whether to use the term “open source AI” at all. However, since the term was already in use, they decided to work towards a definition. This mirrors the broader debates within the AI field about what exactly constitutes AI.

The OSI, founded in 1998, has been a steward of the Open Source Definition (OSD) for over 25 years. The organization relies on sponsorships from companies like Amazon, Google, Microsoft, Cisco, Intel, Salesforce, and Meta. Meta’s involvement with the OSI is significant because the company claims to embrace open-source AI while imposing restrictions on the use of its Llama models.

Meta’s language around its Llama models is ambiguous. While the company initially referred to Llama 2 as open source, it has since used phrases like “openly available” and “openly accessible” to describe Llama 3. However, in some instances, Meta still refers to the model as open source. There is a potential conflict of interest as Meta provides funding to the stewards of the open source definition.

To diversify its funding and reduce reliance on corporate support, the OSI recently secured a grant from the Sloan Foundation. This grant of around $250,000 will help fund the organization’s efforts to develop an Open Source AI Definition. Maffulli believes this will improve the perception of the OSI’s independence from corporate interests.

The current draft of the Open Source AI Definition includes three parts: a preamble, the definition itself, and a checklist for open source-compliant AI systems. One of the challenges in defining open source AI is determining whether companies should make their training datasets available to others. Maffulli argues that it is more important to know how the data was labeled and filtered and to have access to the code used to assemble the dataset.

While having access to the full dataset is desirable, it may not always be practical or possible due to confidentiality or copyright issues. Additionally, there are techniques such as federated learning and differential privacy that allow for training without sharing the data itself.

The fundamental difference between open source software and open source AI lies in their replicability. Software source code and binary code are two different representations of the same program, while training datasets and trained models are distinct entities. Replicating an AI model is not as straightforward as replicating software due to the statistical and random logic involved in the training process.

The checklist aspect of the Open Source AI Definition is based on a classification system called the Model Openness Framework (MOF). The MOF rates machine learning models based on their completeness and openness, requiring the release of specific components under open licenses.

The OSI plans to launch the stable version of the Open Source AI Definition at the All Things Open conference in October. They will then embark on a global roadshow to gather input on defining open source AI. While some minor tweaks may be made, Maffulli believes they have reached a feature-complete version of the definition.

In conclusion, the debate over open source AI is complex and still ongoing. The OSI is working to develop a definition that addresses the unique challenges posed by AI. Their efforts to diversify funding and involve multiple stakeholders demonstrate their commitment to independence and transparency. As technology evolves, so too will the definition of open source AI.