The Impact of Code Data on Large Language Models: Improving Performance on Non-Code Tasks

Large language models (LLMs) are powerful AI models that are typically pre-trained on massive datasets containing both text and code. While code is crucial for training models designed for programming tasks, recent research has shown that including code in the pre-training data of models intended for non-code tasks can significantly improve their performance.

Researchers at Cohere have conducted a systematic investigation into the impact of code data in LLM pre-training on general performance beyond coding tasks. Their findings demonstrate the crucial role that code plays in enhancing the performance of LLMs across a wide range of tasks.

Understanding the Impact of Code

To understand the impact of code on general LLM performance, the researchers conducted a series of experiments. They considered factors such as the amount of code in the training data, where code is added during the training process, the quality of the code, and the scale of the models.

The researchers used a two-phase training process. In the first phase, they performed “continued pre-training,” where pre-trained models were further trained on new datasets with varying ratios of text and code. Then, in the “cooldown” phase, higher weights were given to higher-quality datasets during the final stages of training.

The researchers evaluated the performance of the models at different scales, from 470 million to 2.8 billion parameters, using various benchmarks that measured the models’ abilities in world knowledge, natural language reasoning, and code performance.

The Benefits of Code for Non-Code Tasks

The experiments revealed that code consistently improved the performance of LLMs on non-code-related tasks. On natural language reasoning tasks, models trained on code consistently outperformed text-only models. Surprisingly, pre-training the model with 100% code data resulted in the best performance on these benchmarks.

For world knowledge tasks, a balanced mixture of code and text in the pre-training data led to the best performance. This suggests that performance on world knowledge tasks depends on a more balanced data mixture for initialization and a larger proportion of text during the continual pre-training stage.

On generative tasks, both code-only and balanced models outperformed text-only models, indicating that code data in the pre-training mix improves not only reasoning but also the quality of generated output.

The researchers also observed that the performance gains from adding code to pre-training data increased with model size. The improvements were most significant in world knowledge and code performance, with modest gains in natural language reasoning.

Limitations and Future Directions

The researchers acknowledge that their study focused on models with a parameter range of 470 million to 2.8 billion due to cost limitations. However, they believe that their findings should hold true for larger model sizes and token budgets.

Additionally, the researchers found that adding high-quality synthetic code to the pre-training data significantly boosted performance. This is particularly useful as it doesn’t rely on limited quantities of human-generated code.

Moreover, incorporating code-adjacent data, such as GitHub pull requests and commits, can further enhance the models’ abilities on reasoning tasks.

Implications for Enterprises

Incorporating code into the cooldown phase of training resulted in additional improvements in the LLM’s performance on non-code tasks. This finding is particularly relevant to enterprises, which often fine-tune models with their data rather than training from scratch.

The researchers recommend including code in the training mix, especially high-quality code from internal code bases and code-adjacent data, to achieve better performance during the cooldown phase.

Looking Ahead

As Cohere focuses on providing LLMs for enterprise applications, these findings are likely to influence their future model and product rollouts. They may offer a wider range of pre-trained models on different mixtures of code and text, tailored for specific types of tasks. Enterprises can then fine-tune these models with proprietary data to optimize performance for their specific applications.

The researchers believe that their findings will drive the release of more performant models and highlight the unexpected impact of code on performance gains outside of coding tasks. These insights are already shaping how state-of-the-art models are trained and deployed.