Avoid Merging LoRA Adapters into 4-bit Quantized LLMs
Written on
Understanding LoRA and QLoRA
LoRA (Low-Rank Adaptation) is a technique designed for efficient fine-tuning of large language models (LLMs) by introducing a minimal number of trainable parameters while keeping the bulk of the model unchanged. This method is particularly memory-efficient since only the newly added parameters are subject to training.
On the other hand, QLoRA enhances memory efficiency further by quantizing the underlying LLM before adding these trainable parameters. Typically, during QLoRA training, only the parameters of the adapter are stored. There are two primary methods for employing the adapter during inference:
- Loading it on top of the base LLM
- Merging it into the base LLM
Loading the adapter on top of the base model provides the advantage of easily swapping adapters without much hassle. Additionally, due to their small size, these adapters are simple to store and distribute.
However, one might consider merging the LoRA adapter into the base LLM to simplify usage or conceal the adapter's presence. The original authors of LoRA have established that merging an adapter into the base model can be done without any loss in performance.
Yet, when it comes to QLoRA and quantized LLMs, the situation changes. In a prior analysis, I explored various techniques for merging adapters fine-tuned with QLoRA and found none to be flawless. At that time, merging an adapter directly into a 4-bit quantized LLM was unfeasible, necessitating dequantization first.
Now, with the latest updates to the PEFT library, we can merge a LoRA adapter directly into a 4-bit LLM. While this might seem advantageous, it raises important questions. Should we proceed with this approach? What implications does it carry?
In this article, I will elucidate the reasons why merging an adapter fine-tuned with QLoRA into a quantized LLM is not ideal. Through straightforward experiments, I will demonstrate that this merging process can lead to substantial performance deterioration.
For those interested in replicating my experiments, they can be found in this notebook:
Get the notebook (#27) - The Consequences of Merging a LoRA Adapter into a 4-bit LLM.
Analyzing the Architecture of a 4-bit Llama 2 7B Model
Let's examine the structure of a 4-bit Llama 2 7B model that has been quantized using bitsandbytes NF4:
The quantization method "Linear4bit" is applied to nearly all model components, including:
- Self-attention modules: q_proj, v_proj, o_proj, and k_proj
- MLP modules: gate_proj, up_proj, and down_proj
It's worth noting that the "lm_head" remains unquantized, utilizing a standard "Linear" configuration.
By integrating LoRA into all MLP modules of this model, the architecture adapts as follows:
(Note: Only the initial part of the model’s architecture is displayed; the other MLP modules are omitted.)
In this configuration, three new elements are introduced into the gate_proj module: lora_dropout, along with LoRA's tensors A and B.
During QLoRA fine-tuning, the parameters for A and B are trained. Unlike the model's other parameters, these are not subject to quantization. The QLoRA paper illustrates this as follows:
LoRA's parameters retain a precision of 16 bits while the base model operates at 4 bits.
Once merged, the LoRA parameters vanish from the architecture as they are integrated with the existing 4-bit model parameters. Post-merge, LoRA's parameters are effectively quantized (though not visibly distinct from the base model's parameters). Despite the high accuracy of quantization, it invariably results in some degree of information loss. Consequently, the merged model may underperform compared to the one achieved after QLoRA fine-tuning.
Moreover, the lack of quantization-awareness in LoRA's parameters contributes to this performance decline.
During the QLoRA fine-tuning process, the base model is dequantized to match the data type of LoRA's parameters. The QLoRA paper outlines this:
> We dequantize the storage data type to the computation data type to perform the forward and backward pass.
Here, the storage data type is NF4 (a 4-bit type), while the compute data type is 16-bit (bf16 or fp16). As a result, LoRA's modules only process base model parameters at 16-bit precision.
When merged into the 4-bit model, LoRA's parameters are now alongside 4-bit parameters that they have never encountered. These parameters were not fine-tuned for this new setup, leading to unpredictable performance outcomes post-merge.
Examining Performance: Loading vs. Merging a LoRA Adapter
For all subsequent experiments, I utilized the Llama 2 7B as the foundational LLM. All models were evaluated using the timdettmers/openassistant-guanaco dataset (test split) with perplexity as the metric, where lower perplexity signifies better performance.
Initially, let's assess the performance of the base LLM, Llama 2 7B, on the evaluation dataset. Using the default parameters:
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map={"": 0})
This model achieves a perplexity of 5.174.
Next, we load the same model but apply quantization using bitsandbytes NF4:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", quantization_config=bnb_config, device_map={"": 0})
This results in a perplexity of 5.488, indicating a performance drop, as expected due to information loss from quantization.
Next, we apply a fine-tuned LoRA adapter on the train split of the timdettmers/openassistant-guanaco dataset, using the same BitsAndBytesConfig for the base model:
model = PeftModel.from_pretrained(model, "kaitchup/Llama-2-7B-oasstguanaco-adapter")
This setup yields a perplexity of 3.612, which is a notable improvement over the base model without the adapter.
With the recent PEFT library update, we can now directly merge the adapter into the quantized LLM:
model = model.merge_and_unload()
This results in a perplexity of 5.181, indicating a significant increase in perplexity that nearly matches the base model's performance prior to QLoRA fine-tuning. Merging the adapter into the 4-bit model does not yield better results compared to keeping it loaded.
Conclusion: The Risks of Merging LoRA Adapters
We have identified two critical reasons for avoiding the merging of a LoRA adapter fine-tuned with QLoRA into a 4-bit LLM:
- Degradation of LoRA's parameters due to their quantization to 4 bits.
- Lack of quantization-awareness in LoRA's parameters.
Given these factors, I strongly advise maintaining your LoRA adapter loaded atop the LLM rather than merging it into the model.
If merging is essential, consider dequantizing the base LLM before proceeding, as outlined here. However, keep in mind that subsequent quantization of the merged model may result in significantly degraded performance.
Additionally, it’s feasible to fine-tune quantization-aware LoRA adapters that can be merged perfectly using a method known as QA-LoRA. My exploration of the QA-LoRA implementation revealed it was still a work in progress and required custom adjustments based on the LLM employed. It’s also important to note that the official QA-LoRA implementation is no longer available on GitHub.
To keep up with my research and findings, consider subscribing to my newsletter.
Chapter 2: Practical Insights and Resources
The first video titled "Deploy (Tiny) LLM to Production: Merge Lora Adapter, Push to HF Hub, Rest API with FastAPI & Docker" offers insights into deploying low-latency language models efficiently, with practical steps for merging and using LoRA adapters.
The second video, "Finetune LLM using lora | Step By Step Guide | peft | transformers | tinyllama," provides a comprehensive step-by-step guide on fine-tuning LLMs using LoRA, showcasing techniques for optimizing performance.