NotificationsYou must be signed in to change notification settings
Fork4.1k
Star49.6k

Home

Daniel Han edited this pageJul 8, 2025 ·74 revisions

View our Docs:https://docs.unsloth.ai/

Updating Unsloth without dependency updates

pip install --upgrade --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.gitpip install --upgrade --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth-zoo.git

Loading LoRA adapters for continued finetuning

If you saved a LoRA adapter through Unsloth, you can also continue training using your LoRA weights. The optimizer state will be reset as well. To load even optimizer states to continue finetuning, see the next section.

fromunslothimportFastLanguageModelmodel,tokenizer=FastLanguageModel.from_pretrained(model_name="LORA_MODEL_NAME",max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit,)trainer=Trainer(...)trainer.train()

Continued Pretraining & Finetuning the`lm_head` and`embed_tokens` matrices

Addlm_head andembed_tokens. For Colab, sometimes you will go out of memory for Llama-3 8b. If so, just addlm_head.

model=FastLanguageModel.get_peft_model(model,r=16,target_modules= ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj","lm_head","embed_tokens",],lora_alpha=16,)

Then use 2 different learning rates - a 2-10x smaller one for thelm_head orembed_tokens like so:

fromunslothimportUnslothTrainer,UnslothTrainingArgumentstrainer=UnslothTrainer(    ....args=UnslothTrainingArguments(        ....learning_rate=5e-5,embedding_learning_rate=5e-6,# 2-10x smaller than learning_rate    ),)

Train on completions / responses only (Do not train on inputs)

We now support it! Try the following:

fromtrlimportSFTTrainerfromtransformersimportTrainingArguments,DataCollatorForSeq2Seqtrainer=SFTTrainer(model=model,tokenizer=tokenizer,train_dataset=dataset,data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),    ...args=TrainingArguments(        ...    ),)fromunsloth.chat_templatesimporttrain_on_responses_onlytrainer=train_on_responses_only(trainer)

Finetuning from your last checkpoint

You must edit theTrainer first to addsave_strategy andsave_steps. Below saves a checkpoint every 50 steps to the folderoutputs.

trainer=SFTTrainer(    ....args=TrainingArguments(        ....output_dir="outputs",save_strategy="steps",save_steps=50,    ),)

Then in the trainer do:

trainer_stats=trainer.train(resume_from_checkpoint=True)

Which will start from the latest checkpoint and continue training.

Saving models to 16bit for VLLM

To save to 16bit for VLLM, use:

model.save_pretrained_merged("model",tokenizer,save_method="merged_16bit",)model.push_to_hub_merged("hf/model",tokenizer,save_method="merged_16bit",token="")

To merge to 4bit to load on HuggingFace, first callmerged_4bit. Then usemerged_4bit_forced if you are certain you want to merge to 4bit. I highly discourage you, unless you know what you are going to do with the 4bit model (ie for DPO training for eg or for HuggingFace's online inference engine)

model.save_pretrained_merged("model",tokenizer,save_method="merged_4bit",)model.push_to_hub_merged("hf/model",tokenizer,save_method="merged_4bit",token="")

To save just the LoRA adapters, either use:

model.save_pretrained(...)ANDtokenizer.save_pretrained(...)

Or just use our builtin function to do that:

model.save_pretrained_merged("model",tokenizer,save_method="lora",)model.push_to_hub_merged("hf/model",tokenizer,save_method="lora",token="")

Saving to`safetensors`, not`bin` format in Colab

We save to.bin in Colab so it's like 4x faster, but setsafe_serialization = None to force saving to.safetensors. Somodel.save_pretrained(..., safe_serialization = None) ormodel.push_to_hub(..., safe_serialization = None)

Saving to GGUF

To save to GGUF, use the below to save locally:

model.save_pretrained_gguf("dir",tokenizer,quantization_method="q4_k_m")model.save_pretrained_gguf("dir",tokenizer,quantization_method="q8_0")model.save_pretrained_gguf("dir",tokenizer,quantization_method="f16")

For to push to hub:

model.push_to_hub_gguf("hf_username/dir",tokenizer,quantization_method="q4_k_m")model.push_to_hub_gguf("hf_username/dir",tokenizer,quantization_method="q8_0")

All supported quantization options forquantization_method are listed below:

# https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19# From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.htmlALLOWED_QUANTS= \{"not_quantized"  :"Recommended. Fast conversion. Slow inference, big files.","fast_quantized" :"Recommended. Fast conversion. OK inference, OK file size.","quantized"      :"Recommended. Slow conversion. Fast inference, small files.","f32"     :"Not recommended. Retains 100% accuracy, but super slow and memory hungry.","f16"     :"Fastest conversion + retains 100% accuracy. Slow and memory hungry.","q8_0"    :"Fast conversion. High resource use, but generally acceptable.","q4_k_m"  :"Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K","q5_k_m"  :"Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K","q2_k"    :"Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.","q3_k_l"  :"Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K","q3_k_m"  :"Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K","q3_k_s"  :"Uses Q3_K for all tensors","q4_0"    :"Original quant method, 4-bit.","q4_1"    :"Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.","q4_k_s"  :"Uses Q4_K for all tensors","q4_k"    :"alias for q4_k_m","q5_k"    :"alias for q5_k_m","q5_0"    :"Higher accuracy, higher resource usage and slower inference.","q5_1"    :"Even higher accuracy, resource usage and slower inference.","q5_k_s"  :"Uses Q5_K for all tensors","q6_k"    :"Uses Q8_K for all tensors","iq2_xxs" :"2.06 bpw quantization","iq2_xs"  :"2.31 bpw quantization","iq3_xxs" :"3.06 bpw quantization","q3_k_xs" :"3-bit extra small quantization",}

Manually saving to GGUF

First save your model to 16bit:

model.save_pretrained_merged("merged_model",tokenizer,save_method="merged_16bit",)

Then use the terminal and do:

apt-get updateapt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -ygit clone https://github.com/ggerganov/llama.cppcmake llama.cpp -B llama.cpp/build \    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ONcmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-clicp llama.cpp/build/bin/llama-* llama.cpppython llama.cpp/convert_hf_to_gguf.py FOLDER --outfile OUTPUT --outtype f16

Or follow the steps athttps://rentry.org/llama-cpp-conversions#merging-loras-into-a-model using the model name "merged_model" to merge to GGUF.

Saving to GGUF / vLLM 16bit crashes

You can try reducing the maximum GPU usage during saving by changingmaximum_memory_usage.

The default ismodel.save_pretrained(..., maximum_memory_usage = 0.75). Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.

Evaluation Loop - also fixes OOM or crashing.

First split your training dataset into a train and test split.Set the trainer settings for evaluation to:

new_dataset=dataset.train_test_split(test_size=0.01)SFTTrainer(args=TrainingArguments(fp16_full_eval=True,per_device_eval_batch_size=2,eval_accumulation_steps=4,eval_strategy="steps",eval_steps=1,    ),train_dataset=new_dataset["train"],eval_dataset=new_dataset["test"],

This will cause no OOMs and make it somewhat faster with no upcasting to float32.

Early Stopping

If you want to stop the finetuning / training run since the evaluation loss is not decreasing, then you can use early stopping which stops the training process. UseEarlyStoppingCallback.

As usual, set up your trainer and your evaluation dataset and procedure:

fromtrlimportSFTConfig,SFTTrainertrainer=SFTTrainer(args=SFTConfig(fp16_full_eval=True,per_device_eval_batch_size=2,eval_accumulation_steps=4,output_dir="training_checkpoints",# location of saved checkpoints for early stoppingsave_strategy="steps",# save model every N stepssave_steps=10,# how many steps until we save the modelsave_total_limit=3,# keep ony 3 saved checkpoints to save disk spaceeval_strategy="steps",# evaluate every N stepseval_steps=10,# how many steps until we do evaluationload_best_model_at_end=True,# MUST USE for early stoppingmetric_for_best_model="eval_loss",# metric we want to early stop ongreater_is_better=False,# the lower the eval loss, the better    ),model=model,tokenizer=tokenizer,train_dataset=new_dataset["train"],eval_dataset=new_dataset["test"],)

then we add a callback:

fromtransformersimportEarlyStoppingCallbackearly_stopping_callback=EarlyStoppingCallback(early_stopping_patience=3,# How many steps we will wait if the eval loss doesn't decrease# For example the loss might increase, but decrease after 3 stepsearly_stopping_threshold=0.0,# Can set higher - sets how much loss should decrease by until# we consider early stopping. For eg 0.01 means if loss was# 0.02 then 0.01, we consider to early stop the run.)trainer.add_callback(early_stopping_callback)

Then train the model as usual:trainer.train()

Chat Templates

Assuming your dataset is a list of list of dictionaries like the below:

[    [{'from':'human','value':'Hi there!'},     {'from':'gpt','value':'Hi how can I help?'},     {'from':'human','value':'What is 2+2?'}],    [{'from':'human','value':'What'syourname?'},     {'from':'gpt','value':'I'mDaniel!'},     {'from':'human','value':'Ok! Nice!'},     {'from':'gpt','value':'What can I do for you?'},     {'from':'human','value':'Oh nothing :)'},],]

You can use ourget_chat_template to format it. Selectchat_template to be any ofzephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth, and usemapping to map the dictionary valuesfrom,value etc.map_eos_token allows you to map<|im_end|> to EOS without any training.

fromunsloth.chat_templatesimportget_chat_templatetokenizer=get_chat_template(tokenizer,chat_template="chatml",# Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unslothmapping= {"role" :"from","content" :"value","user" :"human","assistant" :"gpt"},# ShareGPT stylemap_eos_token=True,# Maps <|im_end|> to </s> instead)defformatting_prompts_func(examples):convos=examples["conversations"]texts= [tokenizer.apply_chat_template(convo,tokenize=False,add_generation_prompt=False)forconvoinconvos]return {"text" :texts, }passfromdatasetsimportload_datasetdataset=load_dataset("philschmid/guanaco-sharegpt-style",split="train")dataset=dataset.map(formatting_prompts_func,batched=True,)

You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in atuple of(custom_template, eos_token) where theeos_token must be used inside the template.

unsloth_template= \"{{ bos_token }}"\"{{ 'You are a helpful assistant to the user\n' }}"\"{% endif %}"\"{% for message in messages %}"\"{% if message['role'] == 'user' %}"\"{{ '>>> User: ' + message['content'] + '\n' }}"\"{% elif message['role'] == 'assistant' %}"\"{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\"{% endif %}"\"{% endfor %}"\"{% if add_generation_prompt %}"\"{{ '>>> Assistant: ' }}"\"{% endif %}"unsloth_eos_token="eos_token"tokenizer=get_chat_template(tokenizer,chat_template= (unsloth_template,unsloth_eos_token,),# You must provide a template and EOS tokenmapping= {"role" :"from","content" :"value","user" :"human","assistant" :"gpt"},# ShareGPT stylemap_eos_token=True,# Maps <|im_end|> to </s> instead)

Adding new tokens

Unsloth has a function calledadd_new_tokens which allows you to add new tokens to your finetune. For example if you want to add<CHARACTER_1>,<THINKING> and<SCRATCH_PAD> we can do the following:

model,tokenizer=FastLanguageModel.from_pretrained(...)fromunslothimportadd_new_tokensadd_new_tokens(model,tokenizer,new_tokens= ["<CHARACTER_1>","<THINKING>","<SCRATCH_PAD>"])model=FastLanguageModel.get_peft_model(...)

Note - you MUST always calladd_new_tokens beforeFastLanguageModel.get_peft_model!

2x Faster Inference

Unsloth supports natively 2x faster inference. All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.

fromunslothimportFastLanguageModelmodel,tokenizer=FastLanguageModel.from_pretrained(model_name="lora_model",# YOUR MODEL YOU USED FOR TRAININGmax_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit,)FastLanguageModel.for_inference(model)# Enable native 2x faster inferencetext_streamer=TextStreamer(tokenizer)_=model.generate(**inputs,streamer=text_streamer,max_new_tokens=64)

NotImplementedError: A UTF-8 locale is required. Got ANSI

Seehttps://github.com/googlecolab/colabtools/issues/3409

In a new cell, run the below:

importlocalelocale.getpreferredencoding=lambda:"UTF-8"

LoRA Parameters Encyclopedia

This section was authored bysebdg. It provides explanations for how parameters affect the finetuning process

LoraConfig Parameters

Adjusting theLoraConfig parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters:

Description: Rank of the low-rank decomposition for factorizing weight matrices.
Impact:
- Higher: Retains more information, increases computational load.
- Lower: Fewer parameters, more efficient training, potential performance drop if too small.

lora_alpha

Description: Scaling factor for the low-rank matrices' contribution.
Impact:
- Higher: Increases influence, speeds up convergence, risks instability or overfitting.
- Lower: Subtler effect, may require more training steps.

lora_dropout

Description: Probability of zeroing out elements in low-rank matrices for regularization.
Impact:
- Higher: More regularization, prevents overfitting, may slow training and degrade performance.
- Lower: Less regularization, may speed up training, risks overfitting.

loftq_config

Description: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers.
Impact:
- Not None: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires settinginit_lora_weights='loftq'.
- None: LoftQ quantization is not applied.
- Note: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself.

use_rslora

Description: Enables Rank-Stabilized LoRA (RSLora).
Impact:
- True: Uses Rank-Stabilized LoRA, setting the adapter scaling factor tolora_alpha/math.sqrt(r), which has been proven to work better as per theRank-Stabilized LoRA paper.
- False: Uses the original default scaling factorlora_alpha/r.

gradient_accumulation_steps

Default: 1
Description: The number of steps to accumulate gradients before performing a backpropagation update.
Impact:
- Higher: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware.
- Lower: Faster updates but may require more memory per step and can be less stable.

weight_decay

Default: 0.01
Description: Regularization technique that applies a small penalty to the weights during training.
Impact:
- Non-zero Value (e.g., 0.01): Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights.
- Zero: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets.

learning_rate

Default: 2e-4
Description: The rate at which the model updates its parameters during training.
Impact:
- Higher: Faster convergence but risks overshooting optimal parameters and causing instability in training.
- Lower: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance.

Target Modules

q_proj (query projection)

Description: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space.
Impact: Transforms the input into query vectors that are used to compute attention scores.

k_proj (key projection)

Description: Projects the input into the key space in the attention mechanism.
Impact: Produces key vectors that are compared with query vectors to determine attention weights.

v_proj (value projection)

Description: Projects the input into the value space in the attention mechanism.
Impact: Produces value vectors that are weighted by the attention scores and combined to form the output.

o_proj (output projection)

Description: Projects the output of the attention mechanism back into the original space.
Impact: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model.

gate_proj (gate projection)

Description: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms.
Impact: Controls the flow of information through the gate, allowing selective information passage based on learned weights.

up_proj (up projection)

Description: Used for up-projection, typically increasing the dimensionality of the input.
Impact: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities.

down_proj (down projection)

Description: Used for down-projection, typically reducing the dimensionality of the input.
Impact: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.

Use Unsloth LoRA Adapter with Ollama

Read this3 step guide, which details how to use LLama.Cpp to convert Unsloth Lora Adapter to GGML(.bin) and use it in Ollama:https://medium.com/p/edadb6d9e0f0

This article was written bySarin Suriyakoon.

Ollama Guide - Unsloth FastLanguageModel

This guide provides information on how to set the fine-tuned model we trained using unsloth from a Google Colab training notebook and call the model locally via the Ollama cli.

This Ollama guide was written byJed Tiotuico

Prerequisites

To successfully run the fine-tuned model, we need:

Hugging Face account
A Base unsloth model - for this guide, we have chosenunsloth/tinyllama as the base model
A basic understanding of the unsloth FastLanguageModel. In particular, fine-tuning unsloth/tinyllama. We recommend their Google Colab training notebooks on huggingface for more information on the training data
- https://huggingface.co/unsloth/tinyllama
The Lora adapters that were saved online via the huggingface hub
A working local ollama installation: as of writing, we used 0.1.32, but it should work from later versions.
- ollama --version
- ollama version is 0.1.32

Training

To recall, we provided some training code using unsloth FastLanguageModel. Please note that we can log in on huggingface on Google Colab by setting our API token as a secret token labeled “HF_TOKEN”

import osfrom google.colab import userdatahf_token = userdata.get("HF_TOKEN")os.environ['HF_TOKEN'] = hf_token

We then run the cli command below to login

!huggingface-cli login --token $HF_TOKEN

To check our token is working, run

!huggingface-cli whoami

Below is a sample training code from the Unsloth notebook

from unsloth import FastLanguageModelimport torchmax_seq_length = 2048dtype = Noneload_in_4bit = Truemodel, tokenizer = FastLanguageModel.from_pretrained(    model_name = "unsloth/tinyllama", # "unsloth/tinyllama" for 16bit loading    max_seq_length = max_seq_length,    dtype = dtype,    load_in_4bit = load_in_4bit,)

Moreover, we used the training code below. We provideddataset andeval_dataset for our training data, which had only onetext column.

from trl import SFTTrainerfrom transformers import TrainingArgumentsfrom transformers.utils import logginglogging.set_verbosity_info()trainer = SFTTrainer(    model = model,    tokenizer = tokenizer,    train_dataset = dataset,    eval_dataset = eval_dataset,    dataset_text_field = "text",    max_seq_length = max_seq_length,    dataset_num_proc = 2,    packing = True, # Packs short sequences together to save time!    args = TrainingArguments(        per_device_train_batch_size = 2,        gradient_accumulation_steps = 4,        warmup_ratio = 0.1,        num_train_epochs = 2,        learning_rate = 2e-5,        fp16 = not torch.cuda.is_bf16_supported(),        bf16 = torch.cuda.is_bf16_supported(),        logging_steps = 1,        optim = "adamw_8bit",        weight_decay = 0.1,        lr_scheduler_type = "linear",        seed = 3407,        output_dir = "outputs",    ),)trainer_stats = trainer.train()

Then, we should be able to run our inference, as shown below.

FastLanguageModel.for_inference(model)inputs = tokenizer(["""<s>Q:What is the capital of France?A:"""], return_tensors = "pt").to("cuda")outputs = model.generate(**inputs, max_new_tokens = 1000, use_cache = True)print(tokenizer.batch_decode(outputs))

Lastly, below, we demonstrate how to save the model online via huggingface

model.push_to_hub_merged(“myhfusername/my-model", tokenizer, save_method = "lora")

Installation

When we wrote part of this guide we merely took from the page belowhttps://rentry.org/llama-cpp-conversions#setup

1. Build llama.cpp

Clone the llama.cpp repository using

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

llama.cpp has Python scripts that we need to run, so we need topip install its dependencies

pip install -r requirements.txt

Now, let us build our local llama.cpp

make clean && make all -j

For anyone with nvidia GPUsmake clean && LLAMA_CUDA=1 make all -j

2. Clone our huggingface base model and the Lora adapters from huggingface hub we uploaded earlier, where we used the`push_to_hub_merged()` function

From the llama.cpp folder let us clone our base model.

git clone https://huggingface.co/unsloth/tinyllama

Next, we clone our Lora model

git clone https://huggingface.co/myhfusername/my-model

3. GGUF conversion

We now need to convert both the base model and the Lora adapters.

python convert.py tinyllama --outtype f16 --outfile tinyllama.f16.gguf

4. GGUF conversion of Lora adapters

python convert-lora-to-ggml.py my-model

If the conversion succeeds, the last lines from our output should be

Converted my-model/adapter_config.json and my-model/adapter_model.safetensors to my-model/ggml-adapter-model.bin

5. Merge our gguf base model and adapter model using the command`export-lora`

--model-base - is the gguf model--model-out - is the new gguf model--lora is the adapter model

export-lora --model-base tinyllama.f16.gguf --model-out tinyllama-my-model.gguf --lora my-model/ggml-adapter-model.bin

Lastly we quantize the merged model

quantize tinyllama-my-model.gguf tinyllama-my-model.Q8_0.gguf Q8_0

6. Create ollama Modelfile

FROM tinyllama-my-model.gguf### Set the system messageSYSTEM """You are a super helpful helper."""PARAMETER stop <s>PARAMETER stop </s>

7. Create a Modelfile

ollama create my-model -f Modelfile

8. Test command

ollama run my-model "<s>\nQ: \nWhat is the capital of France?\nA:\n"

NVIDIA Pascal Support

Support for NVIDIA Pascal family of cards, specifically the P40 and P100.

Setup Guide

Create three files (Dockerfile,unsloth_env_file.yml, anddocker-compose.yml) with the contents provided below.
Ensure Docker and Docker Compose are installed on your system.
Install the NVIDIA Container Toolkit for GPU support if not already done.
Place all three files in the same directory.
Open a terminal and navigate to the directory containing these files.
Run the following command to build and start the container:
```
docker-compose up --build
```
Once the container is running, access Jupyter Lab by opening a web browser and navigating tohttp://localhost:8888.

Configuration Files

1. Dockerfile

# Stage 1: Base image with system dependenciesFROM nvidia/cuda:11.8.0-devel-ubuntu22.04 as base# Install system dependenciesRUN apt-get update && apt-get install -y \    build-essential \    git \    vim \    curl \    wget \    && rm -rf /var/lib/apt/lists/*# Install Miniconda only if it's not already installedRUN if [ ! -d"/opt/conda" ]; then \        wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh && \        bash miniconda.sh -b -p /opt/conda && \        rm miniconda.sh; \    fi# Set path to condaENV PATH /opt/conda/bin:$PATH# Set path to condaENV PATH /opt/conda/bin:$PATH# Stage 2: Python environment setupFROM base as python-envCOPY unsloth_env_file.yml unsloth_env_file.ymlRUN conda env create -f unsloth_env_file.ymlSHELL ["conda","run","-n","unsloth_env","/bin/bash","-c"]# Stage 3: Final imageFROM python-env as final# Install Unsloth (This step is separate because it's likely to change more frequently)RUN pip install"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"ENV PATH /usr/local/cuda/bin:$PATHENV LD_LIBRARY_PATH /usr/local/cuda/lib64:$LD_LIBRARY_PATH# Set the working directoryWORKDIR /workspace# Set the default command to run Jupyter LabCMD ["conda","run","--no-capture-output","-n","unsloth_env","jupyter","lab","--ip=0.0.0.0","--no-browser","--allow-root","--NotebookApp.token=''","--NotebookApp.password=''"]

2. unsloth_env_file.yml

name:unsloth_envchannels:  -xformers  -pytorch  -nvidia  -conda-forge  -defaultsdependencies:  -_libgcc_mutex=0.1=conda_forge  -_openmp_mutex=4.5=2_gnu  -aiohttp=3.9.5=py310h5eee18b_0  -aiosignal=1.2.0=pyhd3eb1b0_0  -anyio=4.2.0=py310h06a4308_0  -argon2-cffi=21.3.0=pyhd3eb1b0_0  -argon2-cffi-bindings=21.2.0=py310h7f8727e_0  -arrow-cpp=16.1.0=hc1eb8f0_0  -async-lru=2.0.4=pyhd8ed1ab_0  -async-timeout=4.0.3=py310h06a4308_0  -attrs=23.1.0=py310h06a4308_0  -aws-c-auth=0.6.19=h5eee18b_0  -aws-c-cal=0.5.20=hdbd6064_0  -aws-c-common=0.8.5=h5eee18b_0  -aws-c-compression=0.2.16=h5eee18b_0  -aws-c-event-stream=0.2.15=h6a678d5_0  -aws-c-http=0.6.25=h5eee18b_0  -aws-c-io=0.13.10=h5eee18b_0  -aws-c-mqtt=0.7.13=h5eee18b_0  -aws-c-s3=0.1.51=hdbd6064_0  -aws-c-sdkutils=0.1.6=h5eee18b_0  -aws-checksums=0.1.13=h5eee18b_0  -aws-crt-cpp=0.18.16=h6a678d5_0  -aws-sdk-cpp=1.10.55=h721c034_0  -babel=2.14.0=pyhd8ed1ab_0  -beautifulsoup4=4.12.3=py310h06a4308_0  -blas=1.0=mkl  -bleach=4.1.0=pyhd3eb1b0_0  -boost-cpp=1.82.0=hdb19cb5_2  -bottleneck=1.3.7=py310ha9d4c09_0  -brotli-python=1.0.9=py310h6a678d5_8  -bzip2=1.0.8=h5eee18b_6  -c-ares=1.19.1=h5eee18b_0  -ca-certificates=2024.7.4=hbcca054_0  -certifi=2024.7.4=pyhd8ed1ab_0  -cffi=1.16.0=py310h5eee18b_1  -charset-normalizer=3.3.2=pyhd3eb1b0_0  -cuda-cudart=11.8.89=0  -cuda-cupti=11.8.87=0  -cuda-libraries=11.8.0=0  -cuda-nvrtc=11.8.89=0  -cuda-nvtx=11.8.86=0  -cuda-runtime=11.8.0=0  -cuda-version=11.8=hcce14f8_3  -cudatoolkit=11.8.0=h6a678d5_0  -datasets=2.19.1=py310h06a4308_0  -debugpy=1.6.7=py310h6a678d5_0  -decorator=5.1.1=pyhd3eb1b0_0  -defusedxml=0.7.1=pyhd3eb1b0_0  -dill=0.3.8=py310h06a4308_0  -entrypoints=0.4=py310h06a4308_0  -ffmpeg=4.3=hf484d3e_0  -filelock=3.13.1=py310h06a4308_0  -freetype=2.12.1=h4a9f257_0  -frozenlist=1.4.0=py310h5eee18b_0  -fsspec=2024.3.1=py310h06a4308_0  -gflags=2.2.2=h6a678d5_1  -glog=0.5.0=h6a678d5_1  -gmp=6.2.1=h295c915_3  -gmpy2=2.1.2=py310heeb90bb_0  -gnutls=3.6.15=he1e5248_0  -h11=0.14.0=pyhd8ed1ab_0  -h2=4.1.0=pyhd8ed1ab_0  -hpack=4.0.0=pyh9f0ad1d_0  -httpcore=1.0.5=pyhd8ed1ab_0  -httpx=0.27.0=pyhd8ed1ab_0  -hyperframe=6.0.1=pyhd8ed1ab_0  -icu=73.1=h6a678d5_0  -idna=3.7=py310h06a4308_0  -importlib-metadata=7.0.1=py310h06a4308_0  -importlib_metadata=7.0.1=hd8ed1ab_0  -importlib_resources=6.4.0=pyhd8ed1ab_0  -intel-openmp=2023.1.0=hdb19cb5_46306  -ipykernel=6.28.0=py310h06a4308_0  -ipython_genutils=0.2.0=pyhd3eb1b0_1  -jedi=0.19.1=py310h06a4308_0  -jinja2=3.1.4=py310h06a4308_0  -jpeg=9e=h5eee18b_2  -json5=0.9.25=pyhd8ed1ab_0  -jsonschema=4.19.2=py310h06a4308_0  -jsonschema-specifications=2023.7.1=py310h06a4308_0  -jupyter-lsp=2.2.5=pyhd8ed1ab_0  -jupyter_client=7.4.9=py310h06a4308_0  -jupyter_core=5.7.2=py310h06a4308_0  -jupyter_events=0.10.0=py310h06a4308_0  -jupyter_server=2.14.1=py310h06a4308_0  -jupyter_server_terminals=0.4.4=py310h06a4308_1  -jupyterlab=4.2.4=pyhd8ed1ab_0  -jupyterlab_pygments=0.3.0=pyhd8ed1ab_1  -jupyterlab_server=2.27.3=pyhd8ed1ab_0  -krb5=1.20.1=h143b758_1  -lame=3.100=h7b6447c_0  -lcms2=2.12=h3be6417_0  -ld_impl_linux-64=2.38=h1181459_1  -lerc=3.0=h295c915_0  -libabseil=20240116.2=cxx17_h6a678d5_0  -libboost=1.82.0=h109eef0_2  -libbrotlicommon=1.0.9=h5eee18b_8  -libbrotlidec=1.0.9=h5eee18b_8  -libbrotlienc=1.0.9=h5eee18b_8  -libcublas=11.11.3.6=0  -libcufft=10.9.0.58=0  -libcufile=1.9.1.3=0  -libcurand=10.3.5.147=0  -libcurl=8.7.1=h251f7ec_0  -libcusolver=11.4.1.48=0  -libcusparse=11.7.5.86=0  -libdeflate=1.17=h5eee18b_1  -libedit=3.1.20230828=h5eee18b_0  -libev=4.33=h7f8727e_1  -libevent=2.1.12=hdbd6064_1  -libffi=3.4.4=h6a678d5_1  -libgcc-ng=14.1.0=h77fa898_0  -libgomp=14.1.0=h77fa898_0  -libgrpc=1.62.2=h2d74bed_0  -libiconv=1.16=h5eee18b_3  -libidn2=2.3.4=h5eee18b_0  -libjpeg-turbo=2.0.0=h9bf148f_0  -libnghttp2=1.57.0=h2d74bed_0  -libnpp=11.8.0.86=0  -libnvjpeg=11.9.0.86=0  -libpng=1.6.39=h5eee18b_0  -libprotobuf=4.25.3=he621ea3_0  -libsodium=1.0.18=h7b6447c_0  -libssh2=1.11.0=h251f7ec_0  -libstdcxx-ng=11.2.0=h1234567_1  -libtasn1=4.19.0=h5eee18b_0  -libthrift=0.15.0=h1795dd8_2  -libtiff=4.5.1=h6a678d5_0  -libunistring=0.9.10=h27cfd23_0  -libuuid=1.41.5=h5eee18b_0  -libwebp-base=1.3.2=h5eee18b_0  -llvm-openmp=14.0.6=h9e868ea_0  -lz4-c=1.9.4=h6a678d5_1  -markupsafe=2.1.3=py310h5eee18b_0  -mistune=2.0.4=py310h06a4308_0  -mkl=2023.1.0=h213fc3f_46344  -mkl-service=2.4.0=py310h5eee18b_1  -mkl_fft=1.3.8=py310h5eee18b_0  -mkl_random=1.2.4=py310hdb19cb5_0  -mpc=1.1.0=h10f8cd9_1  -mpfr=4.0.2=hb69a4c5_1  -mpmath=1.3.0=py310h06a4308_0  -multidict=6.0.4=py310h5eee18b_0  -multiprocess=0.70.15=py310h06a4308_0  -nb_conda_kernels=2.3.1=py310h06a4308_0  -nbclassic=1.1.0=py310h06a4308_0  -nbclient=0.8.0=py310h06a4308_0  -nbconvert=7.10.0=py310h06a4308_0  -nbformat=5.9.2=py310h06a4308_0  -ncurses=6.4=h6a678d5_0  -nest-asyncio=1.6.0=py310h06a4308_0  -nettle=3.7.3=hbbd107a_1  -networkx=3.3=py310h06a4308_0  -notebook=6.5.7=py310h06a4308_0  -notebook-shim=0.2.3=py310h06a4308_0  -numexpr=2.8.7=py310h85018f9_0  -numpy=1.26.4=py310h5f9d8c6_0  -numpy-base=1.26.4=py310hb5e798b_0  -openh264=2.1.1=h4ff587b_0  -openjpeg=2.4.0=h9ca470c_2  -openssl=3.3.1=h4bc722e_2  -orc=2.0.1=h2d29ad5_0  -overrides=7.4.0=py310h06a4308_0  -packaging=24.1=py310h06a4308_0  -pandas=2.2.2=py310h6a678d5_0  -pandocfilters=1.5.0=pyhd3eb1b0_0  -pillow=10.4.0=py310h5eee18b_0  -pip=24.0=py310h06a4308_0  -platformdirs=3.10.0=py310h06a4308_0  -prometheus_client=0.14.1=py310h06a4308_0  -prompt_toolkit=3.0.43=hd3eb1b0_0  -psutil=5.9.0=py310h5eee18b_0  -ptyprocess=0.7.0=pyhd3eb1b0_2  -pure_eval=0.2.2=pyhd3eb1b0_0  -pyarrow=16.1.0=py310h1128e8f_0  -pycparser=2.21=pyhd3eb1b0_0  -pysocks=1.7.1=py310h06a4308_0  -python=3.10.14=h955ad1f_1  -python-dateutil=2.9.0post0=py310h06a4308_2  -python-fastjsonschema=2.16.2=py310h06a4308_0  -python-json-logger=2.0.7=py310h06a4308_0  -python-tzdata=2023.3=pyhd3eb1b0_0  -python-xxhash=2.0.2=py310h5eee18b_1  -pytorch=2.1.0=py3.10_cuda11.8_cudnn8.7.0_0  -pytorch-cuda=11.8=h7e8668a_5  -pytorch-mutex=1.0=cuda  -pytz=2024.1=py310h06a4308_0  -pyyaml=6.0.1=py310h5eee18b_0  -pyzmq=24.0.1=py310h5eee18b_0  -re2=2022.04.01=h295c915_0  -readline=8.2=h5eee18b_0  -referencing=0.30.2=py310h06a4308_0  -regex=2023.10.3=py310h5eee18b_0  -requests=2.32.3=py310h06a4308_0  -rfc3339-validator=0.1.4=py310h06a4308_0  -rfc3986-validator=0.1.1=py310h06a4308_0  -rpds-py=0.10.6=py310hb02cf49_0  -s2n=1.3.27=hdbd6064_0  -safetensors=0.4.2=py310ha89cbab_1  -send2trash=1.8.2=py310h06a4308_0  -setuptools=69.5.1=py310h06a4308_0  -six=1.16.0=pyhd3eb1b0_1  -snappy=1.1.10=h6a678d5_1  -sniffio=1.3.0=py310h06a4308_0  -soupsieve=2.5=py310h06a4308_0  -sqlite=3.45.3=h5eee18b_0  -stack_data=0.2.0=pyhd3eb1b0_0  -sympy=1.12=py310h06a4308_0  -tbb=2021.8.0=hdb19cb5_0  -terminado=0.17.1=py310h06a4308_0  -tinycss2=1.2.1=py310h06a4308_0  -tk=8.6.14=h39e8969_0  -tokenizers=0.19.1=py310hff361bb_0  -tomli=2.0.1=pyhd8ed1ab_0  -torchaudio=2.1.0=py310_cu118  -torchtriton=2.1.0=py310  -torchvision=0.16.0=py310_cu118  -tornado=6.4.1=py310h5eee18b_0  -tqdm=4.66.4=py310h2f386ee_0  -traitlets=5.14.3=py310h06a4308_0  -typing-extensions=4.11.0=py310h06a4308_0  -typing_extensions=4.11.0=py310h06a4308_0  -tzdata=2024a=h04d1e81_0  -urllib3=2.2.2=py310h06a4308_0  -utf8proc=2.6.1=h5eee18b_1  -webencodings=0.5.1=py310h06a4308_1  -websocket-client=1.8.0=py310h06a4308_0  -wheel=0.43.0=py310h06a4308_0  -xformers=0.0.22.post7=py310_cu11.8.0_pyt2.1.0  -xxhash=0.8.0=h7f8727e_3  -xz=5.4.6=h5eee18b_1  -yaml=0.2.5=h7b6447c_0  -yarl=1.9.3=py310h5eee18b_0  -zeromq=4.3.5=h6a678d5_0  -zipp=3.17.0=py310h06a4308_0  -zlib=1.2.13=h5eee18b_1  -zstd=1.5.5=hc292b87_2  -pip:      -accelerate==0.33.0      -asttokens==2.4.1      -bitsandbytes==0.43.2      -comm==0.2.2      -docstring-parser==0.16      -exceptiongroup==1.2.2      -executing==2.0.1      -gguf==0.9.1      -hf-transfer==0.1.8      -huggingface-hub==0.24.2      -iprogress==0.4      -ipython==8.26.0      -ipywidgets==8.1.3      -jupyterlab-widgets==3.0.11      -markdown-it-py==3.0.0      -matplotlib-inline==0.1.7      -mdurl==0.1.2      -parso==0.8.4      -peft==0.12.0      -pexpect==4.9.0      -prompt-toolkit==3.0.47      -protobuf==3.20.3      -pure-eval==0.2.3      -pygments==2.18.0      -rich==13.7.1      -sentencepiece==0.2.0      -shtab==1.7.1      -stack-data==0.6.3      -transformers==4.43.3      -trl==0.8.6      -tyro==0.8.5      -wcwidth==0.2.13      -widgetsnbextension==4.0.11

3. docker-compose.yml

version:'3.8'services:unsloth-env:environment:      -NVIDIA_VISIBLE_DEVICES=alldeploy:resources:reservations:devices:            -driver:nvidiacount:allcapabilities:[gpu]volumes:      -./cache:/root/.cache      -./workspace:/workspaceworking_dir:/workspaceports:      -"8888:8888"# For Jupyter Labtty:truestdin_open:truebuild:context:.dockerfile:Dockerfile

Movatterモバイル変換

Uh oh!