czyykj.com

Exploring QDoRA, QLoRA, and LoftQ: Training and Performance Insights

Written on

Chapter 1: Introduction to Fine-Tuning Techniques

LoRA is recognized as an efficient method for fine-tuning by focusing on specialized adapters rather than the entire model. With the emergence of QLoRA, it has become common to enhance LoRA by fine-tuning it on quantized large language models (LLMs). Several alternatives to QLoRA, including QDoRA, QA-LoRA, LQ-LoRA, and LoftQ, have been introduced to further improve fine-tuning processes for these quantized models.

In this article, I will delve into the comparison and experimentation involving QDoRA, LoftQ, and QLoRA, benchmarking their performance and inference throughput. The analysis specifically addresses three configurations:

  1. Loading the adapter on top of the quantized model.
  2. Merging the adapter with the unquantized model.
  3. Merging the adapter followed by quantizing the model.

It’s important to note that merging an adapter before quantizing can lead to a noticeable decline in the model's accuracy. The article also outlines methods to recover much of this lost accuracy using various quantization algorithms.

For practical implementation, a notebook that includes QLoRA, QDoRA, and LoftQ fine-tuning and adapter merging is available here:

Get the notebook (#61) QLoRA, QDoRA, and LoftQ.

I utilized Mistral 7B as the foundational model and the openassistant-guanaco dataset for training and validation.

Chapter 2: Training the Adapters

To train the adapters using the three different methods, I followed the code shared in previous articles. The fine-tuning code is also available in the notebook (#61).

For QDoRA, I applied DoRA on the LLM quantized using bitsandbytes. Previously, applying DoRA on quantized layers was not feasible, but now Hugging Face PEFT supports this functionality.

I trained the adapter for a single epoch using a batch size of 8. Below are the learning curves illustrating the training and validation losses, which are quite similar across all methods. Theoretically, DoRA is expected to outperform LoRA, especially with smaller ranks, while LoftQ should also yield better results than QLoRA due to superior adapter initialization.

Which method should you choose?

While DoRA and LoftQ can theoretically be used in tandem, no experiments have confirmed this yet, and Hugging Face PEFT does not currently support LoftQ with DoRA.

Section 2.1: Inference Speed Analysis

Now, let's evaluate whether one method provides faster inference than the others.

When loading the adapter on top of the quantized model, I executed the following code:

model_name = "mistralai/Mistral-7B-v0.1"

adapter = "../qlora/checkpoint-1231"

bnb_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=compute_dtype,

bnb_4bit_use_double_quant=True,

)

loading_start = time.time()

model = AutoModelForCausalLM.from_pretrained(

model_name, quantization_config=bnb_config, device_map={"": 0}, torch_dtype=compute_dtype, attn_implementation=attn_implementation

)

loading_adapter_start = time.time()

model = PeftModel.from_pretrained(model, adapter)

Replace the content of "adapter" with your desired adapter path. I then benchmarked their perplexity on the openassistant-guanaco test split and measured inference speed. I used Google Colab's L4 instance for these results.

The baseline system was Mistral 7B quantized with bitsandbytes and without any loaded adapters, yielding a significantly lower perplexity across all configurations. Notably, lower perplexity indicates better performance.

However, the inference speed for QDoRA was surprisingly slow, suggesting poor optimization of DoRA's magnitude vector in PEFT. In contrast, inference speeds for QLoRA and LoftQ were comparable and even exceeded the baseline, likely due to the adapters' parameters remaining unquantized, eliminating the need for dequantization during inference.

Chapter 3: Merging Adapters and Performance

Next, we will examine the performance when the adapter is merged into the base model.

For merging, I employed the same method detailed in this article. The code for merging is available in the notebook:

Get the notebook (#61).

It’s crucial to follow the correct procedure to avoid a significant decline in model performance compared to simply loading the adapter. This procedure involves:

  1. Quantizing the base model as done for fine-tuning using bitsandbytes NF4 with double quantization and the same compute dtype.
  2. Dequantizing the model to match the compute dtype used in fine-tuning.
  3. Merging the adapter into the dequantized model.

For this analysis, we will retain the merged model in half-precision, meaning it will be four times larger.

After merging, I conducted the same benchmarks as in the previous section. The models after merging exhibited similar perplexity but showed a substantial increase in speed, thanks to the absence of quantization. With parameters and activations in 16-bit precision, there was no need to dequantize anything, making the DoRA adapter usable post-merge.

Section 3.1: Quantization After Merging

The merged model, though faster, is significantly larger. To maintain the original memory footprint during fine-tuning, we must quantize it again.

Following the merge, I applied the same bitsandbytes configuration used for fine-tuning and ran another benchmark. Unfortunately, the perplexity worsened significantly, though it remained better than the original model. This suggests the model might have lost some of its learned capabilities post-quantization.

I previously discussed potential reasons for this phenomenon: during fine-tuning, the adapter's parameters are not quantized, but after merging and subsequent quantization, they become quantized to 4-bit. This is the first time we observe the model's performance with quantized adapter parameters, leading to a notable accuracy decline.

To ascertain whether the 4-bit quantization was the issue, I ran additional experiments using other quantization algorithms. Instead of bitsandbytes, I employed AWQ for quantization and observed the results.

Notably, AWQ yielded better perplexity than NF4, remaining close to the original perplexity before quantization.

So, what causes issues with bitsandbytes' NF4 quantization?

It’s possible that the adapter's parameters are outliers, adversely affected by this quantization approach, or that the implementation has flaws that fail to accommodate the adapter's parameters effectively. Conversely, AWQ is "activation-aware," automatically recognizing the importance of preserving the adapter's parameters from quantization.

Conclusion

Using consistent hyperparameters, QLoRA, QDoRA, and LoftQ yield comparable performance. Although QDoRA and LoftQ theoretically surpass QLoRA, the slowness of QDoRA makes LoftQ a more viable alternative.

Merging the adapter into the base model results in significantly faster models when left unquantized. However, when quantizing models post-merge, it’s advisable to avoid bitsandbytes NF4 in favor of alternatives like AWQ or GPTQ, which demonstrate superior results in this context.

To support my work, consider subscribing to my newsletter for more articles and tutorials on the latest advancements in AI.

Chapter 4: Video Resources

To enhance your understanding of these methods, check out the following videos:

This video covers how QLoRA facilitates fast and lightweight model fine-tuning.

In this tutorial, we dive into QLora's mechanics and how to fine-tune on the Phi-2 model using Quantized LORA.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Exploring History and Innovating Education: The Impact of VR

This article discusses how VR is transforming education by providing immersive experiences that enhance learning in subjects like history and medicine.

The Dream of a Robot: A Journey Through Emotion and Cleaning

A whimsical exploration of how a robot named Edna transforms a woman's emotional struggles and home life, intertwining humor and reflection.

Breakthrough Study: Cholestyramine Cuts Toxins by 60% in New Findings

A recent Danish study shows cholestyramine can decrease toxin levels by 60%, potentially aiding those with chronic toxicity issues.

The Neuroscientist's Journey: From Personal Tragedy to Hope

A deep dive into a neuroscientist's struggle and triumph over personal loss through groundbreaking research.

Navigating Loss: Understanding Emotional Pain Beyond Death

Explore the complexities of loss beyond death, focusing on emotional healing and coping strategies.

Unlocking Language Learning: A Unique Approach to Mastery

Discover an innovative method for effective language learning that prioritizes word frequency and simplifies the journey.

Exploring the World of Prop Firm Trading: A Comprehensive Guide

Discover the ins and outs of prop firm trading, including its challenges, benefits, and considerations for aspiring traders.

Understanding the Internet: A Beginner's Perspective on Connectivity

A beginner's journey into understanding how the internet functions, breaking down complex concepts into digestible insights.