Scaling Large Language Models with Parameter Efficient Expert Retrieval: Improving Performance and Efficiency with Millions of Experts

Scaling large language models (LLMs) has become increasingly important in recent years, as it allows for improved performance and new capabilities. However, there are limitations to how much a model can be scaled before running into computational and memory bottlenecks. Mixture-of-Experts (MoE) architectures have emerged as a solution to this challenge, as they route data to specialized “expert” modules instead of using the entire model capacity for every input. This allows LLMs to increase their parameter count while keeping inference costs low.

Despite the benefits of MoE, current techniques are limited to a relatively small number of experts. To address this, Google DeepMind has introduced Parameter Efficient Expert Retrieval (PEER), a novel architecture that can scale MoE models to millions of experts. PEER improves the performance-compute tradeoff of large language models, further enhancing their capabilities.

One of the challenges in scaling LLMs is the computational and memory requirements of the feedforward (FFW) layers in transformer blocks. These layers account for two-thirds of the model’s parameters and are a bottleneck when scaling transformers. MoE replaces the FFW with sparsely activated expert modules, each containing a fraction of the parameters of the full dense layer. By increasing the number of experts, MoE can increase the capacity of the LLM without increasing computational costs.

Recent studies have shown that the optimal number of experts for an MoE model depends on factors such as the number of training tokens and the compute budget. MoEs have consistently outperformed dense models when these variables are balanced. Increasing the “granularity” of an MoE model, which refers to the number of experts, can lead to performance gains, especially when accompanied by an increase in model size and training data. High-granularity MoEs also enable models to learn new knowledge more efficiently and adapt to continuously changing data streams.

Current approaches to MoE have limitations, such as fixed routers that need to be readjusted when new experts are added. PEER addresses these limitations by replacing the fixed router with a learned index. This allows input data to be efficiently routed to a vast pool of experts without slowing down the system. PEER uses tiny experts with a single neuron in the hidden layer, enabling the model to share hidden neurons among experts and improve knowledge transfer and parameter efficiency.

PEER can be added to an existing transformer model or used to replace an FFW layer. It is also related to parameter-efficient fine-tuning (PEFT) techniques, which modify a model’s parameters for a new task. PEER reduces the number of active parameters in the MoE layer, affecting computation and activation memory consumption during pre-training and inference. Additionally, PEER has the potential to dynamically add new knowledge and features to LLMs at runtime.

The performance of PEER was evaluated on different benchmarks, including transformer models with dense feedforward layers and other MoE architectures. The experiments showed that PEER models achieve a better performance-compute tradeoff, reaching lower perplexity scores with the same computational budget as their counterparts. Increasing the number of experts in a PEER model further reduced perplexity.

These findings challenge the belief that MoE models reach peak efficiency with a limited number of experts. PEER demonstrates that by applying the right retrieval and routing mechanisms, MoE can be scaled to millions of experts, reducing the cost and complexity of training and serving very large language models. This advancement in MoE architecture has implications for the future of AI and language understanding, opening up new possibilities for more powerful and efficient models.