- Notifications
You must be signed in to change notification settings - Fork16
License
llm-jp/llm-jp-sft
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains the code for supervised fine-tuning of LLM-jp models.
- Python: 3.10.12
- torch>=2.0.0 (should meet with cuda version)
- transformers>=4.34.0
- tokenizers>=0.14.0
- accelerate>=0.23.0
- trl>=0.7.2
- peft>=0.5.0
Install the necessary packages usingpip
:
pip install -r requirements.txt
To turn onuse_flash_attention_2
option:
pip install flash-attn --no-build-isolation
A sample dataset is provided indata/
. A training example is structured as follows:
{"text":"以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n日本で一番高い山は?\n\n### 応答:\n富士山"}
During training, loss calculation is focused on tokens post the "### 応答:" segment. For the above example, the loss will be based on "富士山".
Here is the command to train a model on the sample dataset.
python train.py \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --learning_rate 1e-5 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --data_files data/example.jsonl \ --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \ --output_dir results/
We used the following datasets for fine-tuning.
- Jaster: A collection of automatically transformed data from the existing Japanese NLP datasets.
- Dolly: A Japanese translation ofDolly.
- OpenAssistant: A Japanese translation ofOpenAssistant Conversations Dataset.
NOTE: The datasets mentioned above are not public as of now. We're in the process of making them accessible. Stay tuned for updates.
accelerate launch --config_file configs/accelerate_config_zero1.yaml \ train.py \ --num_train_epochs 2 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --learning_rate 1e-5 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --logging_steps 1 \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \ --output_dir results/llm-jp-1.3b-instruct-full-jaster-dolly-oasst-v1.0
accelerate launch --config_file configs/accelerate_config_zero3.yaml \ train.py \ --num_train_epochs 2 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 32 \ --learning_rate 1e-5 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --gradient_checkpointing \ --logging_steps 1 \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --model_name_or_path llm-jp/llm-jp-13b-v1.0 \ --output_dir results/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0
Run following lines from all the nodes.($machine_rank
is the sequential number from 0 to 7 assigned to each node, and$main_process_ip
is the IP address of the node$machine_rank=0
)
accelerate launch --config_file configs/accelerate_config_zero2.8node.yaml \ --main_process_ip$main_process_ip \ --main_process_port 29500 \ --machine_rank$machine_rank \ train.py \ --num_train_epochs 2 \ --per_device_train_batch_size 3 \ --gradient_accumulation_steps 6 \ --learning_rate 1e-5 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --logging_steps 1 \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --model_name_or_path llm-jp/llm-jp-13b-v1.0 \ --output_dir results/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0
CUDA_VISIBLE_DEVICES=0 python train.py \ --num_train_epochs 5 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 4 \ --learning_rate 1e-4 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --use_peft \ --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \ --output_dir results/llm-jp-1.3b-instruct-lora-jaster-dolly-oasst-v1.0
CUDA_VISIBLE_DEVICES=0 python train.py \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 32 \ --learning_rate 1e-4 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --gradient_checkpointing \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --use_peft \ --model_name_or_path llm-jp/llm-jp-13b-v1.0 \ --output_dir results/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0
accelerate launch --config_file configs/accelerate_config_zero1.yaml \ train.py \ --num_train_epochs 5 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --learning_rate 1e-4 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --use_peft \ --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \ --output_dir results/llm-jp-1.3b-instruct-lora-jaster-dolly-oasst-v1.0
accelerate launch --config_file configs/accelerate_config_zero1.yaml \ train.py \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 16 \ --learning_rate 1e-4 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --use_peft \ --model_name_or_path llm-jp/llm-jp-13b-v1.0 \ --output_dir results/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0
Theuse_flash_attention_2
option in transformers v4.36 only supports for the models based on Llama and Falcon.
CUDA_VISIBLE_DEVICES=0 python train.py \ --num_train_epochs 5 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 16 \ --learning_rate 1e-4 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --gradient_checkpointing \ --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \ --use_flash_attention_2 True \ --use_peft \ --model_name_or_path llm-jp/llm-jp-7b \ --output_dir results/llm-jp-7b-instruct-lora-jaster-dolly-oasst-v1.0
python converter/gptq_converter.py \ --model_name_or_path llm-jp/llm-jp-13b-v1.0 \ --dataset ptb \ --output_dir results/llm-jp-13b-v1.0-gptq
About
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors5
Uh oh!
There was an error while loading.Please reload this page.