Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models

Nov 25, 2024 | By Bud Ecosystem

As LLMs continue to grow, boasting billions to trillions of parameters, they offer unprecedented capabilities in natural language understanding and generation. However, their immense size also introduces major challenges related to memory usage, processing power, and energy consumption. To tackle these issues, researchers have turned to strategies like the Sparse Mixture-of-Experts (SMoE) architecture, which has emerged as a promising solution for more efficient model deployment.

The research paper, Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models, introduces a new method to optimise inference in large language models (LLMs). Building on the Sparse Mixture-of-Experts (SMoE) architecture, which itself offers promising advantages, the paper addresses the limitations of SMoE related to large parameter counts and substantial GPU memory requirements. The authors present Efficient Expert Pruning (EEP), a novel approach that enhances inference efficiency and demonstrates notable improvements in performance. For instance, EEP achieves a remarkable increase in accuracy on the SQuAD dataset, rising from 53.4% to 75.4%, showcasing its effectiveness in enhancing model performance.

What is Sparse Mixture-of-Experts?

Research has shown that only a subset of a model’s parameters are essential for its performance on specific tasks. To streamline model inference, unnecessary parts of a neural network can be removed through a technique called pruning.

Sparse Mixture-of-Experts (SMoE) is a specialised architecture used in LLMs to boost efficiency during inference. Unlike traditional LLMs that use all their parameters for every task, SMoE activates only a small, targeted subset of parameters for each individual task or piece of data (referred to as a “token”). This selective activation not only speeds up processing but also maintains high performance, making the model more efficient overall.

By engaging only a portion of the model’s parameters at a time, SMoE models greatly reduce the demands on memory and processing power. For instance, although the Mixtral 8 × 7B model contains 47 billion parameters, it activates only 13 billion parameters to handle certain tasks and outperforms or matches the Llama-2 70 billion model and GPT-3.5 on many benchmarks. This targeted activation makes the model faster and more resource-efficient, enabling it to perform complex tasks without the extensive computational costs typically associated with such large models.

Even though SMoE models use less processing power for each individual task, they still have a lot of parameters, which means they require a lot of memory. This can make them challenging to use in various situations because they need a lot of memory to work properly.

Additionally, these models might not handle large batches of tasks efficiently because the amount of memory on a device limits how many tasks can be processed at once. So, it’s important to find new ways to make SMoE models smaller and more manageable without losing their effectiveness.

One effective approach is expert pruning for SMoE models, which is a form of structured pruning. This method targets specific parts of the model to improve overall efficiency. Recent advances in expert pruning have achieved up to 25%-50% reduction in parameters and faster processing times. However, this approach can sometimes lead to a drop in performance or require additional fine-tuning, which demands a lot of GPU memory and resources. Therefore, it is crucial to develop more efficient pruning methods that can effectively reduce the size of SMoE models while staying within the limits of available resources.

Efficient Expert Pruning

To tackle the aforementioned challenges, the authors developed a new technique called Efficient Expert Pruning (EEP). EEP is a gradient-free evolutionary strategy designed to optimise the pruning process for SMoE models. EEP involves two main steps: expert pruning and expert merging.

  1. Expert Pruning: This step removes less important parts of the model, or &quotexperts,&quot to reduce the model&aposs size and memory usage. They do this by searching for and keeping only the most important experts without changing the model’s other parameters.
  2. Expert Merging: After pruning, the team consolidates the knowledge from the removed experts into the remaining ones. This helps the model retain its performance even after reducing its size.

The team found that reducing the number of experts can actually improve performance. For example, in their tests, reducing the number of experts from 8 to 2 led to a significant performance boost on some tasks.

  1. Improved Efficiency: By reducing the number of active experts (those used for processing data), the model becomes more efficient. They observed that fewer active experts can speed up processing times, making the model faster.
  2. Better Generalisation: The new method works well across various types of data and tasks, including those that the model hasn’t seen before. It consistently performs better than other methods in their tests.
  3. Gradient-Free Approach: Unlike traditional methods that require lots of computation and memory to fine-tune the model, EEP doesn’t need these complex calculations. This makes it more practical and affordable to use.

The EEP method introduces an innovative approach for transferring knowledge from an old model to a pruned one by merging weights. This new technique effectively reduces the size of large language models while maintaining their performance. It stands out for its efficiency, requiring less computational power and resources, and is versatile enough to be applied across a wide range of tasks.

Evolutionary Strategies For Enhancing performance of SMoE architectures

Evolutionary Strategies are a type of optimization technique inspired by the natural process of evolution. It doesn’t rely on gradient information (which measures how to adjust parameters based on changes) but instead evolves solutions over time through processes similar to biological evolution.

In Evolutionary Strategies, the process begins by generating a diverse set of possible solutions to the problem at hand. Each of these solutions is then evaluated to determine how well it performs. The best-performing solutions are selected to create new ones. This is done by mixing features from the top solutions, akin to combining traits in offspring, and introducing small random changes, or mutations, to explore new possibilities. This cycle of generating, evaluating, combining, and mutating solutions is repeated over several rounds, or generations, allowing the solutions to gradually improve and evolve.

The research team used Evolutionary Strategy because it is well-suited for handling the complex and large search space of the router mapping and expert merging matrices in SMoE models, without the need for expensive gradient computations. This approach allows for effective optimization and adaptation, making it practical and efficient for their specific problem.

Key Results and Observations

The key results and observations from the experiments are;

  1. The application of EEP while maintaining only 4 experts resulted in improved performance and cost-effectiveness for most tasks compared to using all available experts.
  2. When the model was constrained to just 2 experts, EEP frequently matched or exceeded the performance of the full model across five tasks.
  3. EEP is more effective than other methods for deciding which experts to keep. It maintains high accuracy and avoids problems that occur when too many experts are removed.
  4. After pruning experts, combining or merging the remaining experts results in substantial performance boosts. For example, improvements of 5%-7% were seen in certain tasks (like WIC, CB, and SQuAD), helping to retain important knowledge from the removed experts.
  5. EEP works well not just on smaller models but also on larger ones, including advanced versions of models like Mixtral and Qwen. This suggests that EEP is a versatile and effective method for optimising models of different sizes.

Efficient Expert Pruning (EEP) represents a major advancement in the field of AI model optimization. By addressing the dual challenges of memory usage and inference speed, EEP makes Sparse Mixture-of-Experts models more practical for deployment in real-world scenarios. The ability to prune experts effectively and achieve better or equivalent performance on downstream tasks marks a significant step forward in AI research and application.

Tags:

Efficient Expert Pruning

Evolutionary Strategies

SMoE models

Sparse Mixture-of-Experts

Bud Ecosystem

Our vision is to simplify intelligence—starting with understanding and defining what intelligence is, and extending to simplifying complex models and their underlying infrastructure.

Related Blogs

I Built BlazeText — It’s 10X Faster Than HuggingFace’s Tokenizer
I Built BlazeText — It’s 10X Faster Than HuggingFace’s Tokenizer

A few weeks ago, while working on implementing a guardrail engine, I found myself staring at a performance graph that didn’t make any sense. Guardrail actions, like input sanitization, policy enforcement, hallucination checks, bias mitigation, audit logging: each layer adds complexity and latency. Left unchecked, those extra hops can nudge your p95 from tolerable to […]

Open Source Update : Bud Symbolic AI
Open Source Update : Bud Symbolic AI

This week we published a new open-source project — Bud Symbolic AI, an open-source framework designed to bridge traditional pattern matching (like regex and Cucumber expressions) with semantic understanding driven by embeddings. It delivers a unified expression framework that intelligently handles single words, multi‑word phrases, dynamic parameters, and context‑aware validation by leveraging FAISS for efficient […]

What’s New in LLM Inference Optimization: Recent Advances and Techniques
What’s New in LLM Inference Optimization: Recent Advances and Techniques

Large Language Models (LLMs) are resource-intensive. Open-source models like LLaMA 2, Mistral 7B, Falcon 40B, and others offer flexibility for deployment on cloud, edge, or on-premise setups. However, for cost-effective deployments, inference optimization is a necessity. This report surveys recent inference optimization methods and best practices, focusing on open-source LLMs. We cover techniques to reduce […]

A Survey of parallelism strategies that can deliver better efficiency for your GenAI deployments.
A Survey of parallelism strategies that can deliver better efficiency for your GenAI deployments.

Generative AI unlocks incredible capabilities, but it doesn’t come cheap. Training and deploying large models like LLMs or diffusion models demand massive compute, making the total cost of ownership (TCO) a serious concern for teams building production-grade systems. To make GenAI cost-effective and scalable, you need to squeeze out every bit of performance from your […]