#2 Mergekit: Unlocking the Power of Large Language Model Fusion
In today's rapidly advancing field of artificial intelligence, the capabilities of Large Language Models (LLMs) are ever-increasing, demonstrating astonishing potential across various domains of natural language processing. However, a single pretrained model often excels in specific tasks or domains, making it challenging to cater to all needs comprehensively. To overcome this limitation, model merging technology has emerged, allowing us to combine the strengths of different models to create more powerful and versatile ones. mergekit is precisely such an open-source toolkit dedicated to the merging of pretrained large language models.
What is mergekit?
mergekit
is a powerful toolkit developed by the Arcee-AI team, designed to help users merge different pretrained language models. Its core advantage lies in its out-of-core approach, which means that even with limited computational resources (e.g., CPU-only or minimal VRAM, as low as 8GB), users can perform complex and elaborate model merging operations. This design significantly lowers the barrier to entry for model fusion, enabling more researchers and developers to explore and leverage its potential.
Why Choose Model Merging?
Model merging is a technique that operates directly in the weight space of models, offering numerous benefits without the computational overhead of ensembling or the need for additional training:
Integrate Strengths: Fuse multiple specialized models, each excelling in specific domains or tasks, into a single, comprehensively capable general-purpose model.
Transfer Capabilities: Enable the transfer of capabilities between different models without requiring access to the original training data.
Optimize Trade-offs: Find the optimal balance between the behavioral characteristics of different models.
Enhance Performance: Effectively improve the overall performance of models while maintaining the same inference cost.
Foster Innovation: Explore and realize entirely new model capabilities through creative combinations of different models.
Unlike traditional ensembling methods (which require running multiple models પાણી inference time), merged models have an inference cost comparable to that of a single model but often achieve comparable or even superior performance.
Core Features of mergekit
mergekit offers a rich set of features that establish it as a leader in the model fusion domain:
Broad Model Support: Supports a wide range of popular model architectures, including Llama, Mistral, GPT-NeoX, and StableLM.
Diverse Merging Algorithms: Includes various established merging methods and continuously incorporates new cutting-edge algorithms.
Flexible Execution Environment: Supports both CPU-only execution and GPU acceleration, catering to diverse hardware setups.
Efficient Memory Management: Employs lazy loading of tensors, significantly reducing memory consumption.
Interpolated Parameter Gradients: Inspired by Gryphe's BlockMerge_Gradient script, it implements interpolated gradient calculations for parameter values.
Piecewise Model Assembly: Allows for "Frankenmerging" – assembling models from specific layers of different parent models for finer-grained customization.
Mixture of Experts (MoE) Merging: Supports the merging of MoE models.
LoRA Extraction: Enables the extraction of LoRA (Low-Rank Adaptation) weights from merged models.
Evolutionary Merge Methods: Provides model merging strategies based on evolutionary algorithms.
Multi-Stage Merging: Supports complex multi-stage merging workflows via
mergekit-multi
.Raw PyTorch Model Merging: Facilitates the merging of raw PyTorch models through
mergekit-pytorch
.
Furthermore, the Arcee team has launched a graphical user interface (GUI) built алкоголь mergekit
, available for trial on the Arcee App and Hugging Face Spaces. This GUI further simplifies the model merging process, making it more accessible to a broader audience.
How to Get Started with mergekit?
Installing mergekit
is straightforward. Simply clone its Git repository and install it using pip:
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e .
If you encounter an error stating that setup.py
or setup.cfg
is not found, you may need to upgrade your pip version:
python3 -m pip install --upgrade pip
The primary entry point for mergekit
is the mergekit-yaml
script. Users need to create a YAML configuration file that details all aspects of the merge operation and then execute the merge via the command line:
mergekit-yaml path/to/your/config.yml ./output-model-directory [--cuda] [--lazy-unpickle] [--allow-crimes] [... other options]
Upon completion, the merged model will be saved in the specified output directory. mergekit
also automatically generates a README.md
file containing basic model card information, making it convenient for users to share their creations on platforms like the Hugging Face Hub.
Summary
mergekit
provides a powerful, flexible, and user-friendly solution for large language model fusion. With its extensive features and excellent support for resource-constrained environments, both seasoned AI researchers and developers new to the field can leverage mergekit
to explore the limitless possibilities of model merging and create more innovative and practical language models. In the upcoming tutorial section, we will demonstrate step-by-step how to use mergekit
to complete a simple model merging task.
mergekit Simple Step-by-Step Tutorial: Getting Started with Model Merging Easily
After understanding the powerful features of mergekit
, let's go through a simple step-by-step tutorial to practically operate how to use mergekit
to merge two pretrained models. This tutorial will use a basic merging scenario as an example to help you quickly grasp the core usage of mergekit
.
Prerequisites:
You have successfully installed
mergekit
as described in the previous section.You have a basic understanding of command-line operations.
Python and pip are installed on your computer.
Tutorial Goal:
We will attempt to merge two small, compatible pretrained language models. Here, we choose two models that can be merged using simple linear weight averaging as an example. Please note that the actual model selection needs to be determined based on your specific requirements and the compatibility of the models.
Step One: Select the Models to be Merged
First, you need to identify the pretrained models you wish to merge. These models are usually hosted on the Hugging Face Hub. For the simplicity of this tutorial, let's assume you have selected two models and know their paths on the Hugging Face Hub (e.g., username/model_a
and username/model_b
).
Important Note: The effectiveness of model merging largely depends on the characteristics of the selected models and their compatibility. Not all models can be successfully merged or produce desirable results. It is recommended to start with models that are structurally similar and task-related.
Step Two: Create the Merge Configuration File (config.yml)
The core of mergekit
lies in its YAML configuration file. This file describes in detail how you want to perform the model merge. Let's create a file named config.yml
and fill it with the following example content. We will use a basic merge method, such as linear
(linear weighting) or slerp
(spherical linear interpolation). Here, we use slerp
as an example, as it generally provides a smoother transition between different models.
# config.yml
models:
- model: HuggingFaceHubUsername/ModelNameA # Replace with the actual path of the first model
parameters:
weight: 0.5 # Weight of the first model
- model: HuggingFaceHubUsername/ModelNameB # Replace with the actual path of the second model
parameters:
weight: 0.5 # Weight of the second model
merge_method: slerp
base_model: HuggingFaceHubUsername/ModelNameA # slerp usually requires a base model
dtype: float16 # Or bfloat16, choose based on your hardware and model support
Configuration File Explanation:
models
: This is a list that enumerates all models participating in the merge.model
: Specifies the path to the model. This can be a model identifier on the Hugging Face Hub or a path to a local model.parameters.weight
: Specifies the weight of this model in the merging process. Forslerp
, the weight usually represents the degree of interpolation or contribution. In this example, we give both models equal weight.
merge_method
: Specifies the merging algorithm to use.slerp
is a common method that performs spherical linear interpolation between the weight spaces of the two models. Other optional methods includelinear
(simple weighted average),dare_ties
,passthrough
, etc. You can consult the officialmergekit
documentation for more merge methods and their applicable scenarios.base_model
: For some merging methods (likeslerp
or when a reference model is needed to handle vocabularies, special tokens, etc.), a base model needs to be specified. Usually, one of the participating models is chosen.dtype
: Specifies the data type used during the merging process and for the output model, e.g.,float16
,bfloat16
, orfloat32
. Choosing an appropriate data type can balance precision and VRAM consumption.
Please be sure to replace HuggingFaceHubUsername/ModelNameA
and HuggingFaceHubUsername/ModelNameB
with the actual Hugging Face Hub model paths you have chosen.
Step Three: Execute the Model Merge
After creating and saving the config.yml
file, open your terminal or command-line interface, navigate to the directory containing the config.yml
file, and run the following command:
mergekit-yaml ./config.yml ./merged_model_output --cuda
Command Explanation:
mergekit-yaml
: This is the main execution script formergekit
../config.yml
: Points to the configuration file you just created../merged_model_output
: Specifies the output directory for the merged model.mergekit
will save the merged model files, a copy of the configuration file, and aREADME.md
file in this directory.--cuda
(optional): If your computer is equipped with a compatible NVIDIA GPU and CUDA is installed, adding this parameter will use the GPU to accelerate the merging process. If you do not have a GPU or wish to run on the CPU, you can omit this parameter.Other optional parameters:
mergekit-yaml
also supports many other parameters, such as--lazy-unpickle
(for low-memory environments),--allow-crimes
(allows some potentially unsafe but sometimes necessary merge operations), etc. You can view all available options by runningmergekit-yaml--help
.
The merging process may take some time, depending on the size of the models, your computer's performance, and the selected merge method.
Step Four: Check the Merge Results
Once the command has successfully executed, you will find the merged model in the specified output directory (in this example, ./merged_model_output
). The directory structure is typically as follows:
merged_model_output/
├── config.yml # Copy of the merge configuration
├── pytorch_model.bin # Or model.safetensors, the merged model weights
├── tokenizer_config.json
├── special_tokens_map.json
├── tokenizer.json # Or tokenizer.model
├── README.md # Auto-generated model description file
└── ... # Other model-related files
You can load and use this newly merged model just like any other Hugging Face Transformers model. The README.md
file provides a basic model card, which you can edit to add more details about your merge experiment and then upload it to the Hugging Face Hub to share with the community.
Tutorial Summary
Through these four simple steps, you have successfully completed a model merge using mergekit
. This is just the tip of the iceberg of mergekit
's capabilities. It also supports more complex merging strategies, such as layer-wise merging (slices), multi-stage merging, and merging of Mixture of Experts models. We encourage you to consult the official mergekit
GitHub repository and documentation to explore more advanced features and application scenarios.
Next, we will conduct a more in-depth analysis of the principles behind the key steps in the tutorial.
A Glimpse into the Principles of the mergekit Tutorial: Deeply Understanding the Mysteries of Model Merging
In the previous section, we quickly experienced the basic operational flow of mergekit
through a simple tutorial. Now, let's go a step further and delve into the principles behind the key steps in the tutorial to better understand the process of model merging.
Principle One: Models as Collections of Parameters
The first step to understanding model merging is to recognize that modern deep learning models (especially Large Language Models) are essentially composed of a vast number of parameters. These parameters, usually in the form of weights and biases, are distributed across the different layers of the model, such as attention layers, feed-forward network layers, etc. It is the specific numerical values of these parameters that determine the model's behavior and capabilities.
When we talk about "merging models," we are actually performing some mathematical operations or selections on these parameters. mergekit
provides a series of algorithms precisely to define how to handle corresponding parameters from different models.
Principle Two: The Core Role of the config.yml
Configuration File
The config.yml
file is the brain of mergekit
; it tells the tool exactly how to perform the merge. Let's review the key parts of the configuration file from the tutorial and their underlying principles:
The
models
List and Model Localization:model:HuggingFaceHubUsername/ModelNameA
: This line specifies the source of the model participating in the merge.mergekit
will download or load the model's weight files and configuration files based on this identifier (which can be a path on the Hugging Face Hub or a local path). These files contain all the parameter and structural information of the model.parameters.weight:0.5
: Theweight
parameter here is specific to the merge method. In theslerp
method in the tutorial, this weight can be understood as the model's "influence" or "contribution" during the interpolation process. If using thelinear
(linear weighted average) method, this weight directly acts as a multiplicative factor for the corresponding model parameters. For example, for two models A and B, the merged result of a parameter P might beP_merged=weight_A*P_A+weight_B*P_B
.
merge_method:slerp
(Spherical Linear Interpolation):Limitations of Linear Interpolation (Lerp): The simplest way to merge parameters is linear interpolation. Imagine two points (representing the parameter vectors of two models); linear interpolation draws a straight line between them and picks a point on this line as the merged result. However, for model parameters in high-dimensional space, direct linear interpolation can lead to a decline in the performance of the merged model because the parameter space is not always "flat." Simply averaging might cause the model to "fall into" a poorly performing region of the parameter space.
Advantages of Spherical Linear Interpolation (SLERP): SLERP provides a method for interpolation on a sphere. In high-dimensional space, it can be seen as interpolating along the "shortest arc" between two vectors. Compared to simple linear interpolation, SLERP is more likely to smoothly transition the parameters of different models while preserving model characteristics (e.g., by normalizing the magnitude of weight vectors), potentially producing a better-performing merged model. It attempts to find a harmonious midpoint while retaining the "essence" of each model. When
mergekit
implements SLERP, it treats the model's weight tensors as high-dimensional vectors and interpolates between these vectors.Role of
base_model
: In SLERP or other merge methods that require a reference point, thebase_model
provides an anchor or reference frame. For example, in vocabulary processing, special token alignment, etc., the configuration of thebase_model
is prioritized to ensure the consistency of the merged model. For SLERP, it might also serve as the starting point or reference direction for the interpolation path.
dtype:float16
(Data Type):Trade-off between Precision and Efficiency: Modern large models have an enormous number of parameters, and storing and computing these parameters requires substantial resources.
float32
(single-precision floating-point) offers higher precision but consumes more VRAM and computational resources.float16
(half-precision floating-point) andbfloat16
(Brain Floating Point Format) significantly reduce storage and computational overhead by sacrificing some precision.float16
has a smaller dynamic range and may encounter overflow or underflow issues during training or merging, whilebfloat16
has a dynamic range similar tofloat32
but lower precision, making it more suitable for deep learning tasks.mergekit
allows users to specify the data type for the merging process and the output model to make a trade-off between precision and efficiency. Choosingfloat16
orbfloat16
can significantly reduce VRAM usage, making it possible to merge large models on consumer-grade hardware.
Principle Three: The mergekit-yaml
Execution Process
When you run the mergekit-yaml./config.yml./merged_model_output--cuda
command, mergekit
internally executes the following approximate flow:
Parse Configuration File: Reads
config.yml
to understand the user's merge intent, including which models to merge, which merge method to use, the weights of each model, the target data type, etc.Load Models: Based on the model paths in the configuration file, downloads/loads the weights and configurations of each model from the Hugging Face Hub or locally. A key feature of
mergekit
is its "out-of-core" processing and "lazy loading." This means it doesn't load all parameters of all models into memory or VRAM at once but loads and processes data in chunks or layers as needed. This enables the merging of very large models on devices with limited memory/VRAM.Parameter Alignment and Matching: The structures of different models may not be entirely identical, or parameter naming might differ.
mergekit
attempts to intelligently align and match corresponding parameter tensors from different models. For example, the query weights of the first attention layer of model A will be matched with the query weights of the first attention layer of model B.Execute Merge Algorithm: According to the specified
merge_method
(e.g.,slerp
), performs the corresponding mathematical operations on each pair of matched parameter tensors. Forslerp
, it calculates the spherical linear interpolation between the two parameter tensors. This process is done layer by layer, parameter by parameter.If the
--cuda
parameter is used and a compatible GPU is available, these computations will be performed on the GPU as much as possible to accelerate the process.
Construct New Model: Assembles the merged parameters into a new model structure. This includes creating new model weight files (like
pytorch_model.bin
ormodel.safetensors
).Process Vocabulary and Configuration Files: Merging the tokenizer and related configuration files (e.g.,
tokenizer_config.json
,special_tokens_map.json
) is also a crucial step.mergekit
decides how to handle the vocabularies of different models based on the settings inconfig.yml
(mergekit
supports more detailedtokenizer
configurations not elaborated in the tutorial). Common strategies include taking the union, using thebase_model
's vocabulary, or more complex token mapping. The goal is to generate a tokenizer that is compatible and fully functional with the merged model.Save Output: Saves the merged model files, new vocabulary files, configuration files, and the auto-generated
README.md
to the user-specified output directory.
Principle Four: The Importance of Model Compatibility
Although mergekit
offers powerful merging capabilities, not all models can be simply merged to produce good results. The following factors affect the success rate and effectiveness of merging:
Similarity of Model Architecture: Generally, models with the same or highly similar architectures (e.g., different fine-tuned versions based on the Llama architecture) are easier to merge successfully. Merging models with vastly different architectures (e.g., a Transformer model and an RNN model) is usually infeasible or meaningless.
Task Relevance: Merging models trained on similar or related tasks is more likely to produce a general-purpose model that performs well on those tasks. Merging unrelated models can lead to knowledge conflicts and performance degradation.
Vocabulary and Embedding Layers: If the models' vocabularies differ significantly, or if the handling of special tokens is different, careful attention must be paid to aligning vocabularies and embedding layers during the merge. Otherwise, it may result in the model being unable to correctly understand input or producing meaningless output.
Some advanced features of mergekit
, such as layer-wise merging ( slices
), allow users to control more finely which parts of the models participate in the merge, which is very useful when dealing with not entirely compatible models.
Summary
Model merging is an exploratory field. mergekit, with its flexible configuration, diverse merging algorithms, and efficient resource utilization, opens the door for us to explore the boundaries of model capabilities. Understanding these basic principles will help you use mergekit more effectively, design more creative merging strategies, and have more reasonable expectations for the merge results. In practice, continuous experimentation and adjustment of the configuration file are key to improving merge effectiveness.