Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

NLP2024 チュートリアル3 作って学ぶ日本語大規模言語モデル - 環境構築手順とソースコード / NLP2024 Tutorial 3: Practicing how to build a Japanese large-scale language model - Environment construction and experimental source codes

License

NotificationsYou must be signed in to change notification settings

hiroshi-matsuda-rit/NLP2024-tutorial-3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

NLP2024 チュートリアル3: 作って学ぶ日本語大規模言語モデル - 環境構築手順と実験ソースコード
NLP2024 Tutorial 3: Practicing how to build a Japanese large-scale language model - Environment construction and experimental source codes

  • Tutorial Video


https://www.youtube.com/watch?v=eiP2KUOi570

Index

環境構築手順

Environment Construction

For Ubuntu

前提条件 / Prerequisites

  • Hardwares
    • CPU Intel 64bit, RAM >=32GB (>=64GB recommended), Free Disk Space >=200GB
    • GPU RAM >=8GB (>=16GB recommended), Compute Capabilty >=7.0 (>=8.0 recommended)
      • Compute Capability 8.0未満ではbfloat16を使用することができない / Cannot use bfloat16 with Compute Capability below 8.0
      • Compute CapabiltyはHPCシステムズ社のこちらの一覧表を参照 / Compute Capabilty can be checked inthis table.
  • Softwares
    • Ubuntu 22.04がクリーンインストールされた状態を想定 / Assuming a clean installation of Ubuntu 22.04
    • 環境構築を行うユーザにsudo権限が付与されていること / The sudo privileges have been granted to the user who will be building the environment.

gcc-12 installation steps

sudo apt updatesudo apt upgradesudo apt install make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev gitsudo apt install gcc-12 g++-12sudo ln -s -f /usr/bin/gcc-12 /usr/bin/gccsudo ln -s -f /usr/bin/g++-12 /usr/bin/g++

nvidia-driver-535 installation steps

nvidia-smiが実行できたら既にnvidia-driverがインストールされている。
If you can runnvidia-smi, nvidia-driver is already installed.

nvidia-smi

nvidia-driver-525未満がインストールされていたら下記で一旦削除。525以上がインストールされていたら以降はスキップしてCUDAのインストールに進む。
If the installed nvidia-driver version is lower than 525, remove it by following the steps below.If the nvidia-driver version is 525 or higher is installed, skip the rest and proceed to install CUDA.

sudo apt-get --purge remove nvidia-*sudo apt-get --purge remove cuda-*

nvidia-driverをインストールして再起動。
Install nvidia-driver and reboot.

sudo add-apt-repository ppa:graphics-drivers/ppasudo apt updatesudo apt install nvidia-driver-535sudo reboot

再起動したらログインしてnvidia-smiが動作するか確認。
After restarting, login and check ifnvidia-smi works.

nvidia-smi

nvidia-driverが自動更新されて動作しなくなることがあるので、nano等のエディタで設定ファイルのUnattended-Upgradeの値を"0"に変更しておく。
Since nvidia-driver may be updated automatically and stop working, change the value ofUnattended-Upgrade in the configuration file to"0" using an editor such as nano.

sudo nano /etc/apt/apt.conf.d/20auto-upgrades
APT::Periodic::Update-Package-Lists "1";APT::Periodic::Unattended-Upgrade "0";

CUDA 12.1 installation steps

公式サイトにあるrunfileでのインストール手順を実行。
Execute the installation procedure using the runfile onthe official website.

wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.runsudo sh cuda_12.1.1_530.30.02_linux.run

既存のドライバを削除することを推奨されるがContinueを選択。
Although it is recommended to remove the existing driver, select Continue.

│ Existing package manager installation of the driver found. It is strongly    ││ recommended that you remove this before continuing.                          ││ Abort                                                                        ││ Continue                                                                     │

End User License Agreementについて確認したらacceptを入力。
After confirming the End User License Agreement, enter accept.

Do you accept the above EULA? (accept/decline/quit):accept

セットアップオプションを次のように設定してInstallを実行。
Set the setup options as follows and run Install.

│ CUDA Installer                                                               ││ - [ ] Driver                                                                 ││      [ ] 530.30.02                                                           ││ + [X] CUDA Toolkit 12.1                                                      ││   [ ] CUDA Demo Suite 12.1                                                   ││   [ ] CUDA Documentation 12.1                                                ││ - [ ] Kernel Objects                                                         ││      [ ] nvidia-fs                                                           ││   Options                                                                    ││   Install                                                                    │

インストールが終わったらnvccを実行できるか確認。
Once the installation is complete, check if you can run nvcc.

/usr/local/cuda/bin/nvcc -V

For WSL2

前提条件 / Prerequisites

  • Harwares
  • Softwares
    • Windows11 22H2 or later (Windows10 22H2でも動作可能 / can also operate on Windows10 22H2)
    • WSL2上でUbuntu 22.04がクリーンインストールされた状態を想定 / Assuming a clean installation of Ubuntu 22.04 on WSL2
    • 環境構築を行うユーザにAdministrator権限が付与されていること / The user who will be building the environment must be granted Administrator privileges

Windows側でNVIDIA Driverをインストール / Install NVIDIA Driver on Windows side

NVIDIAのドライバーダウンロードページから使用する製品とOSを選択し、ダウンロードタイプは製品ブランチ/Studioを指定して、探すを押下。
Select the product and OS you are using from the NVIDIAdriver download page, specify the Product Branch / Studio as the download type, and press Search.


nvidia-driver-download-setting-en
nvidia-driver-download-setting-en

ダウンロードしたファイルを実行してドライバをインストール。
Run the downloaded file to install the driver.

WSL2でUbuntu 22.04をインストール/ Install Ubuntu 22.04 with WSL2

管理者権限でPowerShellを起動する / Start PowerShell with administrator privileges

  • Windowsボタンを右クリックしてターミナル(管理者)を選択するとPowerShellが起動する / Right-click the Windows button and select Terminal (Administrator) to start PowerShell

WSL2の更新 / Update WSL2

  • PowerShellで次を実行して利用可能なLinuxディストリビューションのリストを表示 / View a list of available Linux distributions by running the following in PowerShell
wsl --set-default-version 2wsl --update

WSL2上でのUbuntu 22.04のインストール / Installing Ubuntu 22.04 on WSL2

下記を実行してユーザ設定を行います。 / Execute the following to configure the user settings.

wsl --install -d Ubuntu-22.04

引き続きUbuntu側でnvidia-smiの動作確認を行います。 / Continue to check the operation of nvidia-smi on the Ubuntu side.

nvidia-smi

Ubuntu側でのCUDAのインストール / Installing CUDA on Ubuntu side

WSL2上のUbuntuで、Ubuntu編のgcc等のインストール、および、CUDA 12.1のインストールを実施します。
On Ubuntu on WSL2, perform the steps described in the Ubuntu edition forgcc-12 installation steps andCUDA 12.1 installation steps.

Windowsターミナルのインストール / Installing Windows Terminal

以降の作業と実験の作業性をよくするためWindowsターミナルの利用を推奨します。 / We recommend using Windows Terminal to improve the workability of subsequent work and experiments.
Microsoft Storeからインストールできます。 / The Windows Terminal can be installed fromMicrosoft Store.

For macOS

前提条件/ Prerequisites

  • Hardwares
    • CPU Apple M1 or later, RAM >=16GB (>=32GB recommended), Free Disk Space >=200GB
  • Softwares
    • macOS 13 or later

Installing Command Line Tools

Command Line Toolsをインストールしていない場合はコンソールアプリで下記を実行。 / If you do not have Command Line Tools installed, run the following in the console app.

xcode-select --install

Installing Python3.10.11 and PATH setting

python.orgからpython 3.10.11 macOS 64-bit universal2 installerをダウンロードして実行。 / Downloadpython 3.10.11 macOS 64-bit universal2 installer from python.org and run it.

実験ソースコード

Experimental Source Codes

Software Installation

CUDAの動作確認 / Checking the operation of CUDA

  • Ubuntu / WSL2
/usr/local/cuda/bin/nvcc -V

環境変数LD_LIBRARY_PATHにCUDAのパスを追加 / Add CUDA path to environment variable LD_LIBRARY_PATH

  • Ubuntu / WSL2
echo'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.1/lib64"'>>~/.bashrcsource~/.bashrc

python3でvenvが使える状態かの確認 / Check if venv is usable in python3

python3 -Vpython3 -m venv venvsource venv/bin/activatedeactivaterm -r venv

pyenv環境の構築 / Building a pyenv environment

pyenv未導入の場合 / If pyenv is not installed

curl https://pyenv.run| bash

pyenv導入済みの場合 / If pyenv has been installed

cd~/.pyenv/plugins/python-build/../..&& git pull&&cd -

pyenvのパス追加 / Add pyenv to the PATH

  • Ubuntu / WSL2
    • ~/.bashrc(zshの場合は ~/.zshrc)に追加 / Add to ~/.bashrc (~/.zshrc for zsh)
echo'export PYENV_ROOT="$HOME/.pyenv"'>>~/.bashrcecho'export PATH="$PYENV_ROOT/bin:$PATH"'>>~/.bashrcecho'eval "$(pyenv init --path)"'>>~/.bashrcsource~/.bashrc
  • macOS
    • ~/.bash_profile(zshの場合は ~/.zshrc)に追加 / Add to ~/.bash_profile (~/.zshrc for zsh)
echo'export PYENV_ROOT="$HOME/.pyenv"'>>~/.bash_profileecho'export PATH="$PYENV_ROOT/bin:$PATH"'>>~/.bash_profileecho'eval "$(pyenv init --path)"'>>~/.bash_profilesource~/.bash_profile

pyenvでPython 3.10.13をインストール / Install Python 3.10.13 with pyenv

pyenv install 3.10.13

実験ディレクトリとvenv環境の作成・有効化・バージョン確認 / Creation, activation, and version confirmation of experiment directory and venv environment

mkdir my-llmcd my-llmpyenvlocal 3.10.13python -m venv venvsource venv/bin/activatewhich pythonpython -Vpip -V

PyTorchのインストールと動作確認 / Installing PyTorch and checking its operation

pip install torch
  • Ubuntu / WSL2
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchtorch.cuda.is_available()torch.cuda.device_count()torch.cuda.get_device_name()
  • macOS
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchtorch.backends.mps.is_available()

BERTでTransformersの動作確認 / Check the operation of Transformers with BERT

pip install transformers fugashi unidic-lite
  • Ubuntu / WSL2
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
fromtransformersimportAutoModelForMaskedLM,AutoTokenizer,pipelinemodel_name="cl-tohoku/bert-large-japanese-v2"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForMaskedLM.from_pretrained(model_name)model=model.to("cuda:0")mlm=pipeline("fill-mask",model=model,tokenizer=tokenizer,device="cuda:0")mlm("語りえぬものについては、[MASK]しなければならない。")[:2]
  • macOS
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
fromtransformersimportAutoModelForMaskedLM,AutoTokenizer,pipelinemodel_name="cl-tohoku/bert-large-japanese-v2"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForMaskedLM.from_pretrained(model_name)model=model.to("mps")mlm=pipeline("fill-mask",model=model,tokenizer=tokenizer,device="mps")mlm("語りえぬものについては、[MASK]しなければならない。")[:2]

Inference and Evaluation

text-generation

pip install accelerate safetensors bitsandbytes

1.3B

  • FP32
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-1.3b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float32,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))
  • FP16
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-1.3b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))
  • BF16 - Ubuntu / WSL2
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-1.3b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.bfloat16,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))

13B

  • FP16
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-13b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))
  • 4bit - Ubuntu / WSL2
    • Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipeline,BitsAndBytesConfigmodel_name="llm-jp/llm-jp-13b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)quantization_config=BitsAndBytesConfig(load_in_4bit=True)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16,device_map="auto",quantization_config=quantization_config)

llm-jp-eval

Installation

  • venv環境に入っている場合はいったん抜ける / If you are in a venv environment, exit it once.
deactivate
  • llm-jp-evalのcloneとvenv環境の作成・有効化 / After cloning llm-jp-eval, create and enable venv.
git clone https://github.com/llm-jp/llm-jp-eval.gitcd llm-jp-evalcp configs/config_template.yaml configs/config.yamlpython -m venv venvsource venv/bin/activatepip install -e.wandb disabled

jasterのビルドとディレクトリ構成の確認 / Building jaster and checking the directory structure

python scripts/preprocess_dataset.py --dataset-name all --output-dir jaster/ls jaster/ls jaster/1.2.0/ls jaster/1.2.0/evaluation

dataset_dirの設定 / Setting dataset_dir

  • configs/config.yamlをエディタで開き、上で確認したdev/までのパスをdataset_dirの値を次のようにセットする / Openconfigs/config.yaml in an editor and set the path todev/ confirmed above and the value ofdataset_dir as follows.
dataset_dir:"jaster/1.2.0/evaluation/dev"

精度評価 / Accuracy evaluation

JNLI devセット全件の評価 / Evaluation of all JNLI dev sets
  • FP32
python scripts/evaluate_llm.py torch_dtype=fp32 \  target_dataset="[jnli]" \  metainfo.max_num_samples=-1 \  wandb.run_name=llm-jp-1.3b-v1.0_fp32_dev-jnli \  model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \  tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
  • FP16
python scripts/evaluate_llm.py torch_dtype=fp16 \  target_dataset="[jnli]" \  metainfo.max_num_samples=-1 \  wandb.run_name=llm-jp-1.3b-v1.0_fp16_dev-jnli \  model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \  tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
  • BF16 - Ubuntu / WSL2
python scripts/evaluate_llm.py torch_dtype=bf16 \  target_dataset="[jnli]" \  metainfo.max_num_samples=-1 \  wandb.run_name=llm-jp-1.3b-v1.0_bf16_dev-jnli \  model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \  tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
jaster全データセット先頭100件の評価 / Evaluate the first 100 results for each of all datasets in jaster
  • FP32
python scripts/evaluate_llm.py torch_dtype=fp32 \  wandb.run_name=llm-jp-1.3b-v1.0_fp32_dev-all \  model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \  tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
  • FP16
python scripts/evaluate_llm.py torch_dtype=fp16 \  wandb.run_name=llm-jp-1.3b-v1.0_fp16_dev-all \  model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \  tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
  • BF16 - Ubuntu / WSL2
python scripts/evaluate_llm.py torch_dtype=bf16 \  wandb.run_name=llm-jp-1.3b-v1.0_bf16_dev-all \  model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \  tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0

Supervised Fine-tuning

Installation

  • llm-jp-eval等のvenv環境に入っている場合はいったん抜ける / If you are in a venv environment such as llm-jp-eval, exit it once.
deactivatecd ..
  • llm-jp-sftのcloneとvenv環境の作成・有効化 / After cloning llm-jp-sft, create and enable venv.
git clone https://github.com/llm-jp/llm-jp-sft.gitcd llm-jp-sftpython -m venv venvsource venv/bin/activatepip install -r requirements.txtwandb disabled
  • macOSではpip uninstall bitsandbytes を行っておく / On macOS, runpip uninstall bitsandbytes

jasterの参照

  • llm-jp-evalのjasterディレクトリへのsymbolic linkを作成しておく / Create a symbolic link to the jaster directory of llm-jp-eval
ln -s ../llm-jp-eval/jaster.

Ichikara-instruction公開データのプロンプト化 / Converting Ichikara-instruction public data to prompt format

importjsonimportrandomimportsysif__name__=="__main__":inst="以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"records= []forfinsys.argv[1:]:withopen(f,"r",encoding="utf8")asfin:forrinjson.load(fin):records.append({"ID":r["ID"],"text":f'{inst}\n\n### 指示:\n{r["text"]}\n\n### 応答:\n{r["output"]}',                })random.shuffle(records)dev_len=len(records)//10dev,train=records[:dev_len],records[dev_len:]json.dump(train,sys.stdout,indent=1,ensure_ascii=False)json.dump(dev,sys.stderr,indent=1,ensure_ascii=False)
  • 公開データの変換と出力の確認 / Convert published data and check output
python convert_ichikara.py Distribution20231115/*.json \> jaster/1.2.0/tuning/train/ichikara.json \2> jaster/1.2.0/tuning/dev/ichikara.jsonhead -n 5 jaster/1.2.0/tuning/dev/ichikara.json

LoRA SFT BF16 - Ubuntu / WSL2

python train.py \  --num_train_epochs 1 \  --per_device_train_batch_size 4 \  --gradient_accumulation_steps 16 \  --learning_rate 2e-5 \  --warmup_ratio 0.1 \  --lr_scheduler_type cosine \  --bf16 \  --max_seq_length 2048 \  --gradient_checkpointing \  --data_files`ls jaster/1.2.0/tuning/train/*.json` \  --use_peft \  --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \  --output_dir results/llm-jp-1.3b-lora-jaster_ichikara-v1.0

LoRA SFT 4bit - Ubuntu / WSL2

python train.py \  --num_train_epochs 1 \  --per_device_train_batch_size 8 \  --gradient_accumulation_steps 8 \  --learning_rate 2e-5 \  --warmup_ratio 0.1 \  --lr_scheduler_type cosine \  --bf16 \  --load_in_4bit True \  --max_seq_length 2048 \  --gradient_checkpointing \  --data_files`ls jaster/1.2.0/tuning/train/*.json` \  --use_peft \  --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \  --output_dir results/llm-jp-1.3b-lora-jaster_ichikara-v1.0

Full Parameter SFT

  • 8GPU構成向けのconfigs/accelerate_config_zero3.yamlの作成
compute_environment:LOCAL_MACHINEdebug:falsedistributed_type:DEEPSPEEDdowncast_bf16:'no'machine_rank:0main_training_function:mainmixed_precision:bf16num_machines:1num_processes:8rdzv_backend:staticsame_network:truetpu_env:[]tpu_use_cluster:falsetpu_use_sudo:falseuse_cpu:falsedeepspeed_config:zero_stage:3offload_optimizer_device:noneoffload_param_device:cpuzero3_init_flag:truezero3_save_16bit_model:true
  • 13Bモデルの8GPUフルパラメータSFT
accelerate launch --config_file configs/accelerate_config_zero3.yaml \  train.py \  --model_name_or_path llm-jp/llm-jp-13b-v1.0 \  --tokenizer_name_or_path llm-jp/llm-jp-13b-v1.0 \  --num_train_epochs 1 \  --per_device_train_batch_size 1 --gradient_accumulation_steps 32 \  --learning_rate 1e-4 --warmup_ratio 0.1 --lr_scheduler_type cosine \  --bf16 \  --max_seq_length 2048 \  --gradient_checkpointing \  --data_files`ls jaster/1.2.0/tuning/train/*.json` \  --output_dir results/llm-jp-13b-full-jaster_ichikara-v1.0

Direct Preference Optimization

Clone the repository

  • llm-jp-sft等の環境に入っている場合はいったん抜ける / If you are in an environment such as llm-jp-sft, exit it once.
deactivatecd ..
  • Clone llm-jp-dpo
git clone https://github.com/llm-jp/llm-jp-dpo.gitcd llm-jp-dpo

Ubuntu / WSL2

  • Installing libraries with poetry
poetry installpoetry shellwandb disabled
  • Creatingaccelerate_config/single_gpu.yaml
compute_environment:LOCAL_MACHINEdebug:falsedistributed_type:'NO'downcast_bf16:'no'machine_rank:0main_training_function:mainmixed_precision:bf16num_machines:1num_processes:1rdzv_backend:staticsame_network:truetpu_env:[]tpu_use_cluster:falsetpu_use_sudo:falseuse_cpu:false
  • AccelerateでDPO学習プロセスを起動 / Launch the DPO training process with Accelerate
accelerate launch --config_file accelerate_configs/single_gpu.yaml train.py --model llm-jp/llm-jp-1.3b-v1.0 --per-device-train-batch-size 4 --per-device-eval-batch-size 8

macOS (パフォーマンスに難があるため改良中です / Performance needs improvement)

  • ライブラリのインストールはpoetryではなくpipで行う / Install libraries using pip instead of poetry
python -m venv venvsource venv/bin/activatepip install torch==2.2.0 transformers==4.37.2 trl==0.7.10 peft==0.8.2 datasets==2.16.1 accelerate==0.26.1 wandbwandb disabled
  • エディタでtrain.pyを開きmain()の先頭にtorch.dynamoのエラー対策を追加 / Opentrain.py in the editor and add a workaround to avoid errors intorch.dynamo at the beginning ofmain()
defmain():importtorch._dynamotorch._dynamo.config.suppress_errors=True
  • 同様にAutoModelForCausalLM.from_pretrained()torch_dtypefloat16に変更 / Similarly, changetorch_dtype ofAutoModelForCausalLM.from_pretrained() tofloat16
torch_dtype=torch.float16,# bfloat16,
  • 同様にTrainingArguments()からbf16の指定をコメントアウト /Similarly, comment out thebf16 specification fromTrainingArguments()
# bf16=True,
  • PythonでDPO学習プロセスを起動 / Launch DPO learning process with Python
python train.py --model llm-jp/llm-jp-1.3b-v1.0 --per-device-train-batch-size 4 --per-device-eval-batch-size 8

Pretraining

環境構築 / Environment construction

  • llm-jp-dpo等の環境に入っている場合はいったん抜ける / If you are in an environment such as llm-jp-dpo, exit it once.
deactivatecd ..
  • CUDA 11.8向けにMegatron-DeepSpeed環境を構築する / Build a Megatron-DeepSpeed ​​environment for CUDA 11.8
echo'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64"'>>~/.bashrcsource~/.bashrcgit clone https://github.com/microsoft/Megatron-DeepSpeedcd Megatron-DeepSpeedmkdir -p tmppython3 -m venv venvsource venv/bin/activatepip install torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118pip install"pip>=23.1""setuptools>=65.5.0" wheel pybind11 six regex nltk numpy deepspeed==0.12.2 einops tensorboard transformers sentencepiece"protobuf<3.21.0"

Installing apex

  • Installation - From Source - Linux の手順をpip>=23.1の前提で進める / ExecuteInstallation - From Source - Linux with assumingpip>=23.1
  • エラーにハマりやすいので以下の点に注意して作業を行う / It's easy to get stuck in comiplation errors, so pay attention to the following points.
    • 本来10分ほどかかるはずのビルド処理がすぐに終わる場合は*.soのコンパイルがスキップされている / If the build process, which should normally take about 10 minutes, finishes immediately, the *.so compilation is being skipped.
    • 次のファイルががなければビルド失敗 / Build fails if the following files are missing
      • build/lib.linux-x86_64-cpython-310/apex_C.cpython-310-x86_64-linux-gnu.so
git clone https://github.com/NVIDIA/apex -b 23.08;cd apexpip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings"--build-option=--cpp_ext" --config-settings"--build-option=--cuda_ext" ./cd ..

Installing FlashAttention2

  • ninjaのバージョンが 1.11.1 か確認する / Check if ninja version is 1.11.1
ninja --version
  • バージョン上限を指定してflash-attnをインストール / Install flash-attn with specified version limit
pip install"flash-attn<2.4.0" --no-build-isolation

トークナイザの準備 / Preparing the tokenizer

  • llm-jp-tokenizer v2.1 SentencePieceモデルファイルのダウンロード / llm-jp-tokenizer v2.1 SentencePiece model file download
curl -O -L https://github.com/llm-jp/llm-jp-tokenizer/raw/main/models/ver2.1/code10k_en20k_ja30k.ver2.1.model

事前学習データの準備 / Preparation of pre-training data

  • 次の内容でdownload_mc4_ja.pyを作成 / Createdownload_mc4_ja.py with the following content
importjsonimportsysfromdatasetsimportload_datasetif__name__=="__main__":dataset=load_dataset('mc4','ja',split='train',streaming=True)limit=int(sys.argv[1])count=0fordocindataset:json.dump(doc,sys.stdout,ensure_ascii=False)print()count+=1ifcount==limit:break
  • download_mc4_ja.pyを実行してmC4の日本語パートから先頭1万件をmc4-ja-10k.jsonlに保存 / Executedownload_mc4_ja.py and save the first 10,000 items from the Japanese part of mC4 tomc4-ja-10k.jsonl
python download_mc4_ja.py 10000> mc4-ja-10k.jsonl
  • データセットをビルドして作成されたファイルを確認 / Build the dataset and check the created files
python tools/preprocess_data.py \  --input ./mc4-ja-10k.jsonl \  --output-prefix dataset/mc4-ja-10k \  --tokenizer-model ./code10k_en20k_ja30k.ver2.1.model \  --append-eod \  --tokenizer-type SentencePieceTokenizer \  --dataset-impl mmap --workers 8ls -l dataset/mc4-ja-10k*

事前学習スクリプトの準備 / Preparing the pretraining script

  • サンプルのpretrain_llama2_distributed.shをコピー / Copy the samplepretrain_llama2_distributed.sh
cp examples_deepspeed/pretrain_llama2_distributed.sh.chmod +x pretrain_llama2_distributed.sh
  • ./pretrain_llama2_distributed.shを編集して次の行を変更 / Edit./pretrain_llama2_distributed.sh and change the following line
    • <の行を>の行の内容に置き換える / Replace the< line with the contents of the> line
< DATASET_1="./tmp/data/bookcorpus_train_1m_text_sentence"> DATASET_1="./dataset/mc4-ja-10k_text_document"< TOKENIZER_PATH=./tmp/tokenizer.model # official llama tokenizer.model> TOKENIZER_PATH=./code10k_en20k_ja30k.ver2.1.model> export NCCL_IB_GID_INDEX=3> export NCCL_IB_TC=106<        --tokenizer-type GPTSentencePieceTokenizer \>        --tokenizer-type SentencePieceTokenizer \
  • pretrain_gpt.pyの最後から2行目のデフォルトトークナイザの指定をコメントアウト / Comment out the default tokenizer specification on the second to last line ofpretrain_gpt.py
# args_defaults={'tokenizer_type': 'GPT2BPETokenizer'},

事前学習の実行 / Execute pretraining

./pretrain_llama2_distributed.sh

以上 / That's all.

About

NLP2024 チュートリアル3 作って学ぶ日本語大規模言語モデル - 環境構築手順とソースコード / NLP2024 Tutorial 3: Practicing how to build a Japanese large-scale language model - Environment construction and experimental source codes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp