llm-jp/llm-jp-sftPublic

NotificationsYou must be signed in to change notification settings
Fork16
Star61

License

Apache-2.0 license

61 stars 16 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
configs		configs
converter		converter
data		data
mdx		mdx
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt
train.py		train.py

Repository files navigation

LLM-jp SFT (Supervised Fine-Tuning)

This repository contains the code for supervised fine-tuning of LLM-jp models.

Requirements

Python: 3.10.12
torch>=2.0.0 (should meet with cuda version)
transformers>=4.34.0
tokenizers>=0.14.0
accelerate>=0.23.0
trl>=0.7.2
peft>=0.5.0

Installation

Install the necessary packages usingpip:

pip install -r requirements.txt

To turn onuse_flash_attention_2 option:

pip install flash-attn --no-build-isolation

Dataset Preparation

A sample dataset is provided indata/. A training example is structured as follows:

{"text":"以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n日本で一番高い山は？\n\n### 応答:\n富士山"}

During training, loss calculation is focused on tokens post the "### 応答:" segment. For the above example, the loss will be based on "富士山".

Training

Here is the command to train a model on the sample dataset.

python train.py \    --num_train_epochs 1 \    --per_device_train_batch_size 1 \    --learning_rate 1e-5 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --data_files data/example.jsonl \    --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \    --output_dir results/

To Reproduce LLM-jp Models

Datasets

We used the following datasets for fine-tuning.

Jaster: A collection of automatically transformed data from the existing Japanese NLP datasets.
Dolly: A Japanese translation ofDolly.
OpenAssistant: A Japanese translation ofOpenAssistant Conversations Dataset.

NOTE: The datasets mentioned above are not public as of now. We're in the process of making them accessible. Stay tuned for updates.

Full Parameter Supervised Fine-tuning

For the 1.3B model (single node; 8 A100 40GB GPUs)

accelerate launch --config_file configs/accelerate_config_zero1.yaml \    train.py \    --num_train_epochs 2 \    --per_device_train_batch_size 8 \    --gradient_accumulation_steps 8 \    --learning_rate 1e-5 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --logging_steps 1 \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \    --output_dir results/llm-jp-1.3b-instruct-full-jaster-dolly-oasst-v1.0

For the 13B model (single node; 8 A100 40GB GPUs)

accelerate launch --config_file configs/accelerate_config_zero3.yaml \    train.py \    --num_train_epochs 2 \    --per_device_train_batch_size 1 \    --gradient_accumulation_steps 32 \    --learning_rate 1e-5 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --gradient_checkpointing \    --logging_steps 1 \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --model_name_or_path llm-jp/llm-jp-13b-v1.0 \    --output_dir results/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0

For the 13B model (8 nodes; 64 A100 40GB GPUs)

Run following lines from all the nodes.($machine_rank is the sequential number from 0 to 7 assigned to each node, and$main_process_ip is the IP address of the node$machine_rank=0)

accelerate launch --config_file configs/accelerate_config_zero2.8node.yaml \    --main_process_ip$main_process_ip \    --main_process_port 29500 \    --machine_rank$machine_rank \    train.py \    --num_train_epochs 2 \    --per_device_train_batch_size 3 \    --gradient_accumulation_steps 6 \    --learning_rate 1e-5 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --logging_steps 1 \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --model_name_or_path llm-jp/llm-jp-13b-v1.0 \    --output_dir results/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0

Fine-tuning with PEFT

For the 1.3B model (single node; single A100 40GB GPU)

CUDA_VISIBLE_DEVICES=0 python train.py \    --num_train_epochs 5 \    --per_device_train_batch_size 8 \    --gradient_accumulation_steps 4 \    --learning_rate 1e-4 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --use_peft \    --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \    --output_dir results/llm-jp-1.3b-instruct-lora-jaster-dolly-oasst-v1.0

For the 13B model (single node; single A100 40GB GPU)

CUDA_VISIBLE_DEVICES=0 python train.py \    --num_train_epochs 5 \    --per_device_train_batch_size 1 \    --gradient_accumulation_steps 32 \    --learning_rate 1e-4 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --gradient_checkpointing \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --use_peft \    --model_name_or_path llm-jp/llm-jp-13b-v1.0 \    --output_dir results/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0

For the 1.3B model (single node; 8 A100 40GB GPUs)

accelerate launch --config_file configs/accelerate_config_zero1.yaml \    train.py \    --num_train_epochs 5 \    --per_device_train_batch_size 8 \    --gradient_accumulation_steps 8 \    --learning_rate 1e-4 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --use_peft \    --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \    --output_dir results/llm-jp-1.3b-instruct-lora-jaster-dolly-oasst-v1.0

For the 13B model (single node; 8 A100 40GB GPUs)

accelerate launch --config_file configs/accelerate_config_zero1.yaml \    train.py \    --num_train_epochs 5 \    --per_device_train_batch_size 1 \    --gradient_accumulation_steps 16 \    --learning_rate 1e-4 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --use_peft \    --model_name_or_path llm-jp/llm-jp-13b-v1.0 \    --output_dir results/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0

Using flash-attn

Theuse_flash_attention_2 option in transformers v4.36 only supports for the models based on Llama and Falcon.

For the 7B model (single node; single A100 40GB GPU)

CUDA_VISIBLE_DEVICES=0 python train.py \    --num_train_epochs 5 \    --per_device_train_batch_size 4 \    --gradient_accumulation_steps 16 \    --learning_rate 1e-4 \    --warmup_ratio 0.1 \    --lr_scheduler_type cosine \    --bf16 \    --max_seq_length 2048 \    --gradient_checkpointing \    --data_files jamp.json janli.json jcommonsenseqa.json jemhopqa.json jnli.json jsem.json jsick.json jsquad.json jsts.json niilc.json dolly_deepl.json oasst_deepl.json \    --use_flash_attention_2 True \    --use_peft \    --model_name_or_path llm-jp/llm-jp-7b \    --output_dir results/llm-jp-7b-instruct-lora-jaster-dolly-oasst-v1.0

GPTQ Converter

python converter/gptq_converter.py \    --model_name_or_path llm-jp/llm-jp-13b-v1.0 \    --dataset ptb \    --output_dir results/llm-jp-13b-v1.0-gptq

About

No description, website, or topics provided.

Movatterモバイル変換

License

llm-jp/llm-jp-sft

Folders and files

Latest commit

History

Repository files navigation

LLM-jp SFT (Supervised Fine-Tuning)

Requirements

Installation

Dataset Preparation

Training

To Reproduce LLM-jp Models

Datasets

Full Parameter Supervised Fine-tuning

For the 1.3B model (single node; 8 A100 40GB GPUs)

For the 13B model (single node; 8 A100 40GB GPUs)

For the 13B model (8 nodes; 64 A100 40GB GPUs)

Fine-tuning with PEFT

For the 1.3B model (single node; single A100 40GB GPU)

For the 13B model (single node; single A100 40GB GPU)

For the 1.3B model (single node; 8 A100 40GB GPUs)

For the 13B model (single node; 8 A100 40GB GPUs)

Using flash-attn

For the 7B model (single node; single A100 40GB GPU)

GPTQ Converter

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Contributors5

Uh oh!

Languages

Packages