Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU #

April 15, 2024 bySean Song.

2 min read. | 605 total words.

15, Apr 2024 by .

Building on the previous blogFine-tune Llama 2 with LoRA blog, we delve into another Parameter Efficient Fine-Tuning (PEFT) approach known as Quantized Low Rank Adaptation (QLoRA). The focus will be on leveraging QLoRA for the fine-tuning of Llama-2 7B model using a single AMD GPU with ROCm. This task, made possible through the use of QLoRA, addresses challenges related to memory and computing limitations. The exploration aims to showcase how QLoRA can be employed to enhance accessibility to open-source large language models.

QLoRA Fine-tuning #

QLoRA is a fine-tuning technique that combines a high-precision computing technique with a low-precision storage method. This helps keep the model size small while making sure the model is still highly performant and accurate.

How does QLoRA work?#

In few words, QLoRA optimizes the memory usage of LLM fine-tuning without compromising performance, in contrast to standard 16-bit model fine-tuning. Specifically, QLoRA employs 4-bit quantization to compress a pretrained language model. The language model parameters are then frozen, and a modest number of trainable parameters are introduced in the form of Low-Rank Adapters. During fine-tuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. Notably, only the LoRA layers undergo updates during training. For a more in-depth exploration of LoRA, refer to the originalLoRA paper.

QLoRA vs LoRA#

QLoRA and LoRA represent two parameter-efficient fine-tuning techniques. LoRA operates as a standalone fine-tuning method, while QLoRA incorporates LoRA as an auxiliary mechanism to address errors introduced during the quantization process and to additionally minimize the resource requirements during fine-tuning.

Step-by-step Llama 2 fine-tuning with QLoRA #

This section will guide you through the steps to fine-tune the Llama 2 model, which has 7 billion parameters, on a single AMD GPU. The key to this accomplishment lies in the crucial support of QLoRA, which plays an indispensable role in efficiently reducing memory requirements.

For that, we will use the following setup:

Hardware & OS: Seethis link for a list of supported hardware and OS with ROCm.
Software:
- ROCm 6.1.0+
- Pytorch for ROCm 2.0+
Libraries:transformers,accelerate,peft,trl,bitsandbytes,scipy

In this blog, we conducted our experiment using a single MI250GPU with the Docker imagerocm/pytorch:rocm6.1.2_ubuntu22.04_py3.10_pytorch_release-2.1.2.

You can find the complete code used in this blog from theGithub repo.

1: Getting started#

Our first step is to confirm the availability of GPU.

!rocm-smi--showproductname

=========================ROCmSystemManagementInterface============================================================ProductInfo===================================GPU[0]:Cardseries:AMDINSTINCTMI250(MCM)OAMACMBAGPU[0]:Cardmodel:0x0b0cGPU[0]:Cardvendor:AdvancedMicroDevices,Inc.[AMD/ATI]GPU[0]:CardSKU:D65209GPU[1]:Cardseries:AMDINSTINCTMI250(MCM)OAMACMBAGPU[1]:Cardmodel:0x0b0cGPU[1]:Cardvendor:AdvancedMicroDevices,Inc.[AMD/ATI]GPU[1]:CardSKU:D65209===================================================================================================================EndofROCmSMILog================================

Let’s use only one Graphics Compute Die (GCD) or GPU, in case you have more than one GCDs or GPUs on your AMD machine.

importosos.environ["HIP_VISIBLE_DEVICES"]="0"importtorchuse_cuda=torch.cuda.is_available()ifuse_cuda:print('__CUDNN VERSION:',torch.backends.cudnn.version())print('__Number CUDA Devices:',torch.cuda.device_count())cunt=torch.cuda.device_count()

__CUDNNVERSION:2020000__NumberCUDADevices:1

We will start by installing the required libraries.

!pipinstall-qpandaspeft==0.9.0transformers==4.31.0trl==0.4.7acceleratescipy

Installing bitsandbytes#

ROCm needs a special version of bitsandbytes (bitsandbytes-rocm).

Install bitsandbytes using the following code.

gitclone--recursehttps://github.com/ROCm/bitsandbytescdbitsandbytesgitcheckoutrocm_enabledpipinstall-rrequirements-dev.txtcmake-DCOMPUTE_BACKEND=hip-S.#Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu archmakepipinstall.

Check the bitsandbytes version.
At the time of writing this blog, the version is 0.43.0.
```
%%bashpiplist|grepbitsandbytes
```

Import the required packages.

importtorchfromdatasetsimportload_datasetfromtransformersimport(AutoModelForCausalLM,AutoTokenizer,BitsAndBytesConfig,TrainingArguments,pipeline)frompeftimportLoraConfigfromtrlimportSFTTrainer

2. Configuring the model and data#

Model configuration#

You can access Meta’s official Llama-2 model from Hugging Face after making a request, which cantake a couple of days. Instead of waiting, we’ll use NousResearch’s Llama-2-7b-chat-hf as our basemodel (it’s the same as the original, but quicker to access).

# Model and tokenizer namesbase_model_name="NousResearch/Llama-2-7b-chat-hf"new_model_name="llama-2-7b-enhanced"#You can give your own name for fine tuned model# Tokenizerllama_tokenizer=AutoTokenizer.from_pretrained(base_model_name,trust_remote_code=True)llama_tokenizer.pad_token=llama_tokenizer.eos_tokenllama_tokenizer.padding_side="right"

QLoRA 4-bit quantization configuration#

As outlined in the paper, QLoRA stores weights in 4-bits, allowing computation to occur in 16 or 32-bit precision. This means whenever a QLoRA weight tensor is used, we dequantize the tensor to 16 or 32-bit precision, and then perform a matrix multiplication. Various combinations, such as float16, bfloat16, float32, etc., can be chosen. Experimentation with different 4-bit quantization variants, including normalized float 4 (NF4), or pure float4 quantization, is possible. However, guided by theoretical considerations and empirical findings from the paper, the recommendation is to opt for NF4 quantization, as it tends to deliver better performance.

In our case, we chose the following configuration:

4-bit quantization with NF4 type
16-bit (float16) for computation
Double quantization, which uses a second quantization after the first one to save an additional 0.3 bits per parameters

Quantization parameters are controlled from the BitsandbytesConfig (seeHugging Face documentation) as follows:

Loading in 4 bits is activated through load_in_4bit
The datatype used for quantization is specified with bnb_4bit_quant_type. Note that there are two supported quantization datatypes fp4 (four-bit float) and nf4 (normal four-bit float). The latter is theoretically optimal for normally distributed weights, so we recommend using nf4.
The datatype used for the linear layer computations with bnb_4bit_compute_dtype
Nested quantization is activated through bnb_4bit_use_double_quant

# Quantization Configquant_config=BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_compute_dtype=torch.float16,bnb_4bit_use_double_quant=True)

Load the model and set the quantization configuration.

base_model=AutoModelForCausalLM.from_pretrained(base_model_name,quantization_config=quant_config,device_map="auto")base_model.config.use_cache=Falsebase_model.config.pretraining_tp=1

Dataset configuration#

We fine-tune our base model for aquestion-and-answer task using a small data set calledmlabonne/guanaco-llama2-1k, whichis a subset (1,000 samples) of thetimdettmers/openassistant-guanaco data set.This data set is a human-generated, human-annotated, assistant-style conversation corpus thatcontains 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. Thisresults in over 10,000 fully annotated conversation trees.

# Datasetdata_name="mlabonne/guanaco-llama2-1k"training_data=load_dataset(data_name,split="train")# check the dataprint(training_data.shape)# #11 is a QA sample in Englishprint(training_data[11])

(1000,1){'text':'<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture. It has become a recognizable catchphrase that people use to add humor to everyday conversations. The meme has also been used to satirize politics and other serious issues. For example, in 2016, a group of activists in the UK used the phrase "Deez Nuts for President" as part of a campaign to encourage young people to vote in the EU referendum. </s><s>[INST] Rewrite the essay in a more casual way. Instead of sounding proffesional, sound like a college student who is forced to write the essay but refuses to do so in the propper way. Use casual words and slang when possible. [/INST] Yo, so you want me to write a 1000-word essay about Deez Nuts? Alright, fine. So, this whole thing started on Vine back in 2015. Some dude named Rodney Bullard made a video where he would ask people if they knew a rapper, and when they said no, he would hit them with the classic line: "Deez Nuts!" People loved it, and it became a viral meme.\n\nNowadays, Deez Nuts is used for all kinds of stuff. You can throw it out there to interrupt someone or just to be funny. It\'sallovertheinternet,inmusic,andeveninpolitics.Infact,duringthe2016USpresidentialelection,akidnamedBradyOlsonregisteredasanindependentcandidateunderthenameDeezNuts.HeactuallygotsomeattentionfromthemediaandmadeappearancesonTVandeverything.\n\nTheimpactofDeezNutsonourcultureisprettyhuge.It\'sbecomeathingthateveryoneknowsandusestoaddsomehumortotheireverydayconversations.Plus,peoplehaveusedittomakefunofpoliticsandseriousissuestoo.Like,intheUK,somegroupsofactivistsusedthephrase"Deez Nuts for President"toencourageyoungpeopletovoteintheEUreferendum.\n\nThereyouhaveit,athousandwordsaboutDeezNutsinamorecasualtone.CanIgobacktoplayingvideogamesnow?</s>'}

## There is a dependency during training!pipinstalltensorboardX

3. Start fine tuning#

To set your training parameters, use the following code:

# Training Paramstrain_params=TrainingArguments(output_dir="./results_modified",num_train_epochs=1,per_device_train_batch_size=4,gradient_accumulation_steps=1,optim="paged_adamw_32bit",save_steps=50,logging_steps=50,learning_rate=2e-4,weight_decay=0.001,fp16=False,bf16=False,max_grad_norm=0.3,max_steps=-1,warmup_ratio=0.03,group_by_length=True,lr_scheduler_type="constant",report_to="tensorboard")

Training with QLoRA configuration #

Now you can integrate LoRA into the base model and assess its additional parameters. LoRA essentiallyadds pairs of rank-decomposition weight matrices (called update matrices) to existing weights, andonly trains the newly added weights.

frompeftimportget_peft_model# LoRA Configpeft_parameters=LoraConfig(lora_alpha=8,lora_dropout=0.1,r=8,bias="none",task_type="CAUSAL_LM")model=get_peft_model(base_model,peft_parameters)model.print_trainable_parameters()

trainableparams:4,194,304||allparams:6,742,609,920||trainable%:0.06220594176090199

Note that there are only 0.062% parameters added by LoRA, which is a tiny portion of the originalmodel. This is the percentage we’ll update through fine-tuning, as follows.

# Trainer with QLoRA configurationfine_tuning=SFTTrainer(model=base_model,train_dataset=training_data,peft_config=peft_parameters,dataset_text_field="text",tokenizer=llama_tokenizer,args=train_params)# Trainingfine_tuning.train()

The output looks like this:

[250/25005:31,Epoch1/1]\StepTrainingLoss \501.557800 \1001.348100\1501.277000\2001.324300\2501.347700TrainOutput(global_step=250,training_loss=1.3709784088134767,metrics={'train_runtime':335.085,'train_samples_per_second':2.984,'train_steps_per_second':0.746,'total_flos':8679674339426304.0,'train_loss':1.3709784088134767,'epoch':1.0})

# Save Modelfine_tuning.model.save_pretrained(new_model_name)

Checking memory usage during training with QLoRA #

During the training you could check the memory usage by using “rocm-smi” command in a terminal. The command will produce the following output, which tells the usage of memory and GPU.

=========================ROCmSystemManagementInterface============================================================ConciseInfo===================================GPUTemp(DieEdge)AvgPwrSCLKMCLKFanPerfPwrCapVRAM%GPU%050.0c352.0W1700Mhz1600Mhz0%auto560.0W17%100%===================================================================================================================EndofROCmSMILog================================

To enhance comprehension of QLoRA’s impact on training, we will conduct a quantitative analysis comparing QLoRA, LoRA, and full-parameter fine-tuning. This analysis will encompass memory usage, training speed, training loss, and other pertinent metrics, providing a comprehensive evaluation of their respective effects.

4. Comparison between QLoRA, LoRA, and full-parameter fine tuning #

Building upon our earlier blog titledFine-tune Llama 2 with LoRA: Customizing a large language model for question-answering, which demonstrated the fine-tuning of the Llama 2 model using both LoRA and full-parameter methods, we will now integrate the results obtained with QLoRA. This aims to provide a comprehensive overview that incorporates insights from all three fine-tuning approaches.

Metric	Full-parameter	LoRA	QLoRA
Trainable parameters	6,738,415,616	4,194,304	4,194,304
Mem usage/GB	128	83.2	10.88
Number of GCDs	2	2	1
Training Speed	3 hours	9 minutes	6 minutes
Training Loss	1.368	1.377	1.347

Memory usage:
- In the case of full-parameter fine-tuning, there are6,738,415,616 trainable parameters, leading to significant memory consumption during the training back propagation stage.
- In contrast, LoRA and QLoRA introduces only4,194,304 trainable parameters, accounting for a mere0.062% of the total trainable parameters in full-parameter fine-tuning.
- When monitoring memory usage during training, it becomes evident that fine-tuning with LoRA utilizes only 65% of the memory consumed by full-parameter fine-tuning. Impressively, QLoRA goes even further by significantly reducing memory consumption to just 8%.
- This presents an opportunity to increase batch size, max sequence length, and train on larger datasets within the constraints of limited hardware resources.
Training speed:
- The results demonstrate that full-parameter fine-tuning takeshours to complete, while fine-tuning with LoRA and QLoRA concludes inminutes.
- Several factors contribute to this acceleration in training speed:
  - The fewer trainable parameters in LoRA translates to fewer derivative calculations and less memory needed to store and updates the weights.
  - Full-parameter fine-tuning is more prone to being memory-bound, where the data movement becomes a bottleneck for training. This is reflected in lower GPU utilization. Although adjusting training settings can alleviate this, it may require more resources (additional GPUs) and a smaller batch size.
Accuracy:
- In both training sessions, a notable reduction in training loss was observed. We achieved a closely aligned training loss for three fine-tuning approaches.
- In the original work on QLoRA, the author mentioned the performance lost due to the imprecise quantization can be fully recovered through adapter fine-tuning after quantization. In alignment with this insight, our experiments validate and resonate with this observation, emphasizing the effectiveness of adapter fine-tuning in restoring performance after the quantization process.

5. Test the fine-tuned model with QLoRA #

# Reload model in FP16 and merge it with fine-tuned weightsbase_model=AutoModelForCausalLM.from_pretrained(base_model_name,low_cpu_mem_usage=True,return_dict=True,torch_dtype=torch.float16,device_map="auto")frompeftimportLoraConfig,PeftModelmodel=PeftModel.from_pretrained(base_model,new_model_name)model=model.merge_and_unload()# Reload tokenizer to save ittokenizer=AutoTokenizer.from_pretrained(base_model_name,trust_remote_code=True)tokenizer.pad_token=tokenizer.eos_tokentokenizer.padding_side="right"

Now, let’s upload the model to Hugging Face, enabling us to conduct subsequent tests or share it with others. To proceed with this step, you’ll need an active Hugging Face account.

fromhuggingface_hubimportlogin#You need to use your Hugging Face Access Tokenslogin("hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")#Push the model to Hugging Face. This can take a few mins depending on model size and your network speed.model.push_to_hub(new_model_name,use_temp_dir=False)tokenizer.push_to_hub(new_model_name,use_temp_dir=False)

Now we can test with the base model (original) and the fine-tuned model.

Test the base model#

# Generate Text using base modelquery="What do you think is the most important part of building an AI chatbot?"text_gen=pipeline(task="text-generation",model=base_model_name,tokenizer=llama_tokenizer,max_length=200)output=text_gen(f"<s>[INST]{query} [/INST]")print(output[0]['generated_text'])

    <s>[INST] What do you think is the most important part of building an AI chatbot? [/INST]  There are several important aspects to consider when building an AI chatbot, but here are some of the most critical elements:    1. Natural Language Processing (NLP): A chatbot's ability to understand and interpret human language is crucial for effective communication. NLP is the foundation of any chatbot, and it involves training the AI model to recognize patterns in language, interpret meaning, and generate responses.    2. Conversational Flow: A chatbot's conversational flow refers to the way it interacts with users. A well-designed conversational flow should be intuitive, easy to follow, and adaptable to different user scenarios. This involves creating a dialogue flowchart that guides the conversation and ensures the chatbot responds appropriately to user inputs.    3. Domain Knowledge: A chat

Test the fine-tuned model#

# Generate Text using fine-tuned modelquery="What do you think is the most important part of building an AI chatbot?"text_gen=pipeline(task="text-generation",model=new_model_name,tokenizer=llama_tokenizer,max_length=200)output=text_gen(f"<s>[INST]{query} [/INST]")print(output[0]['generated_text'])

<s>[INST]WhatdoyouthinkisthemostimportantpartofbuildinganAIchatbot?[/INST]ThemostimportantpartofbuildinganAIchatbotistoensurethatitisabletounderstandandrespondtouserinputinawaythatisbothaccurateandnatural-sounding.Toachievethis,youwillneedtouseacombinationofnaturallanguageprocessing(NLP)techniquesandmachinelearningalgorithmstoenablethechatbottounderstandandinterpretuserinput,andtogenerateappropriateresponses.SomeofthekeyconsiderationswhenbuildinganAIchatbotinclude:1.Definingthescopeandpurposeofthechatbot:Whatkindoftasksorquestionswillthechatbotbeabletohandle?Whatkindofuserinputwillitbeabletounderstand?2.ChoosingtherightNLPandmachinelearningalgorithms:TherearemanydifferentNLPandmachinelearningalgorithmsavailable,andtherightoneswilldependonthe

You can now observe the outputs of the two models based on the given query. As anticipated, the two outputs exhibit slight differences due to the fine-tuning process altering the model weights.

Disclaimers#

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

Contents

Movatterモバイル変換