- Notifications
You must be signed in to change notification settings - Fork0
NLP2024 チュートリアル3 作って学ぶ日本語大規模言語モデル - 環境構築手順とソースコード / NLP2024 Tutorial 3: Practicing how to build a Japanese large-scale language model - Environment construction and experimental source codes
License
hiroshi-matsuda-rit/NLP2024-tutorial-3
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
NLP2024 チュートリアル3: 作って学ぶ日本語大規模言語モデル - 環境構築手順と実験ソースコード
NLP2024 Tutorial 3: Practicing how to build a Japanese large-scale language model - Environment construction and experimental source codes
- Tutorial Video
https://www.youtube.com/watch?v=eiP2KUOi570
- Tutorial Slide
Environment Construction
- Hardwares
- CPU Intel 64bit, RAM >=32GB (>=64GB recommended), Free Disk Space >=200GB
- GPU RAM >=8GB (>=16GB recommended), Compute Capabilty >=7.0 (>=8.0 recommended)
- Compute Capability 8.0未満ではbfloat16を使用することができない / Cannot use bfloat16 with Compute Capability below 8.0
- Compute CapabiltyはHPCシステムズ社のこちらの一覧表を参照 / Compute Capabilty can be checked inthis table.
- Softwares
- Ubuntu 22.04がクリーンインストールされた状態を想定 / Assuming a clean installation of Ubuntu 22.04
- 環境構築を行うユーザにsudo権限が付与されていること / The sudo privileges have been granted to the user who will be building the environment.
sudo apt updatesudo apt upgradesudo apt install make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev gitsudo apt install gcc-12 g++-12sudo ln -s -f /usr/bin/gcc-12 /usr/bin/gccsudo ln -s -f /usr/bin/g++-12 /usr/bin/g++
nvidia-smi
が実行できたら既にnvidia-driverがインストールされている。
If you can runnvidia-smi
, nvidia-driver is already installed.
nvidia-smi
nvidia-driver-525未満がインストールされていたら下記で一旦削除。525以上がインストールされていたら以降はスキップしてCUDAのインストールに進む。
If the installed nvidia-driver version is lower than 525, remove it by following the steps below.If the nvidia-driver version is 525 or higher is installed, skip the rest and proceed to install CUDA.
sudo apt-get --purge remove nvidia-*sudo apt-get --purge remove cuda-*
nvidia-driverをインストールして再起動。
Install nvidia-driver and reboot.
sudo add-apt-repository ppa:graphics-drivers/ppasudo apt updatesudo apt install nvidia-driver-535sudo reboot
再起動したらログインしてnvidia-smi
が動作するか確認。
After restarting, login and check ifnvidia-smi
works.
nvidia-smi
nvidia-driverが自動更新されて動作しなくなることがあるので、nano等のエディタで設定ファイルのUnattended-Upgrade
の値を"0"
に変更しておく。
Since nvidia-driver may be updated automatically and stop working, change the value ofUnattended-Upgrade
in the configuration file to"0"
using an editor such as nano.
sudo nano /etc/apt/apt.conf.d/20auto-upgrades
APT::Periodic::Update-Package-Lists "1";APT::Periodic::Unattended-Upgrade "0";
公式サイトにあるrunfileでのインストール手順を実行。
Execute the installation procedure using the runfile onthe official website.
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.runsudo sh cuda_12.1.1_530.30.02_linux.run
既存のドライバを削除することを推奨されるがContinueを選択。
Although it is recommended to remove the existing driver, select Continue.
│ Existing package manager installation of the driver found. It is strongly ││ recommended that you remove this before continuing. ││ Abort ││ Continue │
End User License Agreementについて確認したらacceptを入力。
After confirming the End User License Agreement, enter accept.
Do you accept the above EULA? (accept/decline/quit):accept
セットアップオプションを次のように設定してInstallを実行。
Set the setup options as follows and run Install.
│ CUDA Installer ││ - [ ] Driver ││ [ ] 530.30.02 ││ + [X] CUDA Toolkit 12.1 ││ [ ] CUDA Demo Suite 12.1 ││ [ ] CUDA Documentation 12.1 ││ - [ ] Kernel Objects ││ [ ] nvidia-fs ││ Options ││ Install │
インストールが終わったらnvccを実行できるか確認。
Once the installation is complete, check if you can run nvcc.
/usr/local/cuda/bin/nvcc -V
- Harwares
- Ubuntuの前提条件に準じる / See thePrerequisites section for Ubuntu
- Softwares
- Windows11 22H2 or later (Windows10 22H2でも動作可能 / can also operate on Windows10 22H2)
- WSL2上でUbuntu 22.04がクリーンインストールされた状態を想定 / Assuming a clean installation of Ubuntu 22.04 on WSL2
- 環境構築を行うユーザにAdministrator権限が付与されていること / The user who will be building the environment must be granted Administrator privileges
NVIDIAのドライバーダウンロードページから使用する製品とOSを選択し、ダウンロードタイプは製品ブランチ/Studioを指定して、探すを押下。
Select the product and OS you are using from the NVIDIAdriver download page, specify the Product Branch / Studio as the download type, and press Search.


ダウンロードしたファイルを実行してドライバをインストール。
Run the downloaded file to install the driver.
- Windowsボタンを右クリックして
ターミナル(管理者)
を選択するとPowerShellが起動する / Right-click the Windows button and select Terminal (Administrator) to start PowerShell
- PowerShellで次を実行して利用可能なLinuxディストリビューションのリストを表示 / View a list of available Linux distributions by running the following in PowerShell
wsl --set-default-version 2wsl --update
下記を実行してユーザ設定を行います。 / Execute the following to configure the user settings.
wsl --install -d Ubuntu-22.04
引き続きUbuntu側でnvidia-smiの動作確認を行います。 / Continue to check the operation of nvidia-smi on the Ubuntu side.
nvidia-smi
WSL2上のUbuntuで、Ubuntu編のgcc等のインストール、および、CUDA 12.1のインストールを実施します。
On Ubuntu on WSL2, perform the steps described in the Ubuntu edition forgcc-12 installation steps andCUDA 12.1 installation steps.
以降の作業と実験の作業性をよくするためWindowsターミナルの利用を推奨します。 / We recommend using Windows Terminal to improve the workability of subsequent work and experiments.
Microsoft Storeからインストールできます。 / The Windows Terminal can be installed fromMicrosoft Store.
- Hardwares
- CPU Apple M1 or later, RAM >=16GB (>=32GB recommended), Free Disk Space >=200GB
- Softwares
- macOS 13 or later
Command Line Toolsをインストールしていない場合はコンソールアプリで下記を実行。 / If you do not have Command Line Tools installed, run the following in the console app.
xcode-select --install
python.orgからpython 3.10.11 macOS 64-bit universal2 installerをダウンロードして実行。 / Downloadpython 3.10.11 macOS 64-bit universal2 installer from python.org and run it.
Experimental Source Codes
- Ubuntu / WSL2
/usr/local/cuda/bin/nvcc -V
- Ubuntu / WSL2
echo'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.1/lib64"'>>~/.bashrcsource~/.bashrc
python3 -Vpython3 -m venv venvsource venv/bin/activatedeactivaterm -r venv
curl https://pyenv.run| bash
cd~/.pyenv/plugins/python-build/../..&& git pull&&cd -
- Ubuntu / WSL2
- ~/.bashrc(zshの場合は ~/.zshrc)に追加 / Add to ~/.bashrc (~/.zshrc for zsh)
echo'export PYENV_ROOT="$HOME/.pyenv"'>>~/.bashrcecho'export PATH="$PYENV_ROOT/bin:$PATH"'>>~/.bashrcecho'eval "$(pyenv init --path)"'>>~/.bashrcsource~/.bashrc
- macOS
- ~/.bash_profile(zshの場合は ~/.zshrc)に追加 / Add to ~/.bash_profile (~/.zshrc for zsh)
echo'export PYENV_ROOT="$HOME/.pyenv"'>>~/.bash_profileecho'export PATH="$PYENV_ROOT/bin:$PATH"'>>~/.bash_profileecho'eval "$(pyenv init --path)"'>>~/.bash_profilesource~/.bash_profile
pyenv install 3.10.13
実験ディレクトリとvenv環境の作成・有効化・バージョン確認 / Creation, activation, and version confirmation of experiment directory and venv environment
mkdir my-llmcd my-llmpyenvlocal 3.10.13python -m venv venvsource venv/bin/activatewhich pythonpython -Vpip -V
pip install torch
- Ubuntu / WSL2
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchtorch.cuda.is_available()torch.cuda.device_count()torch.cuda.get_device_name()
- macOS
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchtorch.backends.mps.is_available()
pip install transformers fugashi unidic-lite
- Ubuntu / WSL2
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
fromtransformersimportAutoModelForMaskedLM,AutoTokenizer,pipelinemodel_name="cl-tohoku/bert-large-japanese-v2"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForMaskedLM.from_pretrained(model_name)model=model.to("cuda:0")mlm=pipeline("fill-mask",model=model,tokenizer=tokenizer,device="cuda:0")mlm("語りえぬものについては、[MASK]しなければならない。")[:2]
- macOS
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
fromtransformersimportAutoModelForMaskedLM,AutoTokenizer,pipelinemodel_name="cl-tohoku/bert-large-japanese-v2"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForMaskedLM.from_pretrained(model_name)model=model.to("mps")mlm=pipeline("fill-mask",model=model,tokenizer=tokenizer,device="mps")mlm("語りえぬものについては、[MASK]しなければならない。")[:2]
pip install accelerate safetensors bitsandbytes
- FP32
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-1.3b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float32,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))
- FP16
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-1.3b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))
- BF16 - Ubuntu / WSL2
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-1.3b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.bfloat16,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))
- FP16
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_name="llm-jp/llm-jp-13b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16,device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",pad_token_id=tokenizer.pad_token_id)print(pipe("語りえぬものについては、",max_length=128))
- 4bit - Ubuntu / WSL2
- Pythonの対話モードで下記を実行 / Run the following in Python interactive mode
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipeline,BitsAndBytesConfigmodel_name="llm-jp/llm-jp-13b-v1.0"tokenizer=AutoTokenizer.from_pretrained(model_name)quantization_config=BitsAndBytesConfig(load_in_4bit=True)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16,device_map="auto",quantization_config=quantization_config)
- venv環境に入っている場合はいったん抜ける / If you are in a venv environment, exit it once.
deactivate
- llm-jp-evalのcloneとvenv環境の作成・有効化 / After cloning llm-jp-eval, create and enable venv.
git clone https://github.com/llm-jp/llm-jp-eval.gitcd llm-jp-evalcp configs/config_template.yaml configs/config.yamlpython -m venv venvsource venv/bin/activatepip install -e.wandb disabled
python scripts/preprocess_dataset.py --dataset-name all --output-dir jaster/ls jaster/ls jaster/1.2.0/ls jaster/1.2.0/evaluation
configs/config.yaml
をエディタで開き、上で確認したdev/
までのパスをdataset_dir
の値を次のようにセットする / Openconfigs/config.yaml
in an editor and set the path todev/
confirmed above and the value ofdataset_dir
as follows.
dataset_dir:"jaster/1.2.0/evaluation/dev"
- FP32
python scripts/evaluate_llm.py torch_dtype=fp32 \ target_dataset="[jnli]" \ metainfo.max_num_samples=-1 \ wandb.run_name=llm-jp-1.3b-v1.0_fp32_dev-jnli \ model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \ tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
- FP16
python scripts/evaluate_llm.py torch_dtype=fp16 \ target_dataset="[jnli]" \ metainfo.max_num_samples=-1 \ wandb.run_name=llm-jp-1.3b-v1.0_fp16_dev-jnli \ model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \ tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
- BF16 - Ubuntu / WSL2
python scripts/evaluate_llm.py torch_dtype=bf16 \ target_dataset="[jnli]" \ metainfo.max_num_samples=-1 \ wandb.run_name=llm-jp-1.3b-v1.0_bf16_dev-jnli \ model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \ tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
- FP32
python scripts/evaluate_llm.py torch_dtype=fp32 \ wandb.run_name=llm-jp-1.3b-v1.0_fp32_dev-all \ model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \ tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
- FP16
python scripts/evaluate_llm.py torch_dtype=fp16 \ wandb.run_name=llm-jp-1.3b-v1.0_fp16_dev-all \ model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \ tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
- BF16 - Ubuntu / WSL2
python scripts/evaluate_llm.py torch_dtype=bf16 \ wandb.run_name=llm-jp-1.3b-v1.0_bf16_dev-all \ model.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0 \ tokenizer.pretrained_model_name_or_path=llm-jp/llm-jp-1.3b-v1.0
- llm-jp-eval等のvenv環境に入っている場合はいったん抜ける / If you are in a venv environment such as llm-jp-eval, exit it once.
deactivatecd ..
- llm-jp-sftのcloneとvenv環境の作成・有効化 / After cloning llm-jp-sft, create and enable venv.
git clone https://github.com/llm-jp/llm-jp-sft.gitcd llm-jp-sftpython -m venv venvsource venv/bin/activatepip install -r requirements.txtwandb disabled
- macOSでは
pip uninstall bitsandbytes
を行っておく / On macOS, runpip uninstall bitsandbytes
- llm-jp-evalのjasterディレクトリへのsymbolic linkを作成しておく / Create a symbolic link to the jaster directory of llm-jp-eval
ln -s ../llm-jp-eval/jaster.
- 次のページから利用許諾を確認した上で公開データを入手する / Check the usage permission from the next page and obtain the public data.
convert_ichikara.py
の作成 / Creatingconvert_ichikara.py
importjsonimportrandomimportsysif__name__=="__main__":inst="以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"records= []forfinsys.argv[1:]:withopen(f,"r",encoding="utf8")asfin:forrinjson.load(fin):records.append({"ID":r["ID"],"text":f'{inst}\n\n### 指示:\n{r["text"]}\n\n### 応答:\n{r["output"]}', })random.shuffle(records)dev_len=len(records)//10dev,train=records[:dev_len],records[dev_len:]json.dump(train,sys.stdout,indent=1,ensure_ascii=False)json.dump(dev,sys.stderr,indent=1,ensure_ascii=False)
- 公開データの変換と出力の確認 / Convert published data and check output
python convert_ichikara.py Distribution20231115/*.json \> jaster/1.2.0/tuning/train/ichikara.json \2> jaster/1.2.0/tuning/dev/ichikara.jsonhead -n 5 jaster/1.2.0/tuning/dev/ichikara.json
python train.py \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 16 \ --learning_rate 2e-5 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --gradient_checkpointing \ --data_files`ls jaster/1.2.0/tuning/train/*.json` \ --use_peft \ --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \ --output_dir results/llm-jp-1.3b-lora-jaster_ichikara-v1.0
python train.py \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --learning_rate 2e-5 \ --warmup_ratio 0.1 \ --lr_scheduler_type cosine \ --bf16 \ --load_in_4bit True \ --max_seq_length 2048 \ --gradient_checkpointing \ --data_files`ls jaster/1.2.0/tuning/train/*.json` \ --use_peft \ --model_name_or_path llm-jp/llm-jp-1.3b-v1.0 \ --output_dir results/llm-jp-1.3b-lora-jaster_ichikara-v1.0
- 8GPU構成向けの
configs/accelerate_config_zero3.yaml
の作成
compute_environment:LOCAL_MACHINEdebug:falsedistributed_type:DEEPSPEEDdowncast_bf16:'no'machine_rank:0main_training_function:mainmixed_precision:bf16num_machines:1num_processes:8rdzv_backend:staticsame_network:truetpu_env:[]tpu_use_cluster:falsetpu_use_sudo:falseuse_cpu:falsedeepspeed_config:zero_stage:3offload_optimizer_device:noneoffload_param_device:cpuzero3_init_flag:truezero3_save_16bit_model:true
- 13Bモデルの8GPUフルパラメータSFT
accelerate launch --config_file configs/accelerate_config_zero3.yaml \ train.py \ --model_name_or_path llm-jp/llm-jp-13b-v1.0 \ --tokenizer_name_or_path llm-jp/llm-jp-13b-v1.0 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 --gradient_accumulation_steps 32 \ --learning_rate 1e-4 --warmup_ratio 0.1 --lr_scheduler_type cosine \ --bf16 \ --max_seq_length 2048 \ --gradient_checkpointing \ --data_files`ls jaster/1.2.0/tuning/train/*.json` \ --output_dir results/llm-jp-13b-full-jaster_ichikara-v1.0
- llm-jp-sft等の環境に入っている場合はいったん抜ける / If you are in an environment such as llm-jp-sft, exit it once.
deactivatecd ..
- Clone llm-jp-dpo
git clone https://github.com/llm-jp/llm-jp-dpo.gitcd llm-jp-dpo
- Installing libraries with poetry
poetry installpoetry shellwandb disabled
- Creating
accelerate_config/single_gpu.yaml
compute_environment:LOCAL_MACHINEdebug:falsedistributed_type:'NO'downcast_bf16:'no'machine_rank:0main_training_function:mainmixed_precision:bf16num_machines:1num_processes:1rdzv_backend:staticsame_network:truetpu_env:[]tpu_use_cluster:falsetpu_use_sudo:falseuse_cpu:false
- AccelerateでDPO学習プロセスを起動 / Launch the DPO training process with Accelerate
accelerate launch --config_file accelerate_configs/single_gpu.yaml train.py --model llm-jp/llm-jp-1.3b-v1.0 --per-device-train-batch-size 4 --per-device-eval-batch-size 8
- ライブラリのインストールはpoetryではなくpipで行う / Install libraries using pip instead of poetry
python -m venv venvsource venv/bin/activatepip install torch==2.2.0 transformers==4.37.2 trl==0.7.10 peft==0.8.2 datasets==2.16.1 accelerate==0.26.1 wandbwandb disabled
- エディタで
train.py
を開きmain()
の先頭にtorch.dynamo
のエラー対策を追加 / Opentrain.py
in the editor and add a workaround to avoid errors intorch.dynamo
at the beginning ofmain()
defmain():importtorch._dynamotorch._dynamo.config.suppress_errors=True
- 同様に
AutoModelForCausalLM.from_pretrained()
のtorch_dtype
をfloat16
に変更 / Similarly, changetorch_dtype
ofAutoModelForCausalLM.from_pretrained()
tofloat16
torch_dtype=torch.float16,# bfloat16,
- 同様に
TrainingArguments()
からbf16
の指定をコメントアウト /Similarly, comment out thebf16
specification fromTrainingArguments()
# bf16=True,
- PythonでDPO学習プロセスを起動 / Launch DPO learning process with Python
python train.py --model llm-jp/llm-jp-1.3b-v1.0 --per-device-train-batch-size 4 --per-device-eval-batch-size 8
- llm-jp-dpo等の環境に入っている場合はいったん抜ける / If you are in an environment such as llm-jp-dpo, exit it once.
deactivatecd ..
- CUDA 11.8向けにMegatron-DeepSpeed環境を構築する / Build a Megatron-DeepSpeed environment for CUDA 11.8
echo'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64"'>>~/.bashrcsource~/.bashrcgit clone https://github.com/microsoft/Megatron-DeepSpeedcd Megatron-DeepSpeedmkdir -p tmppython3 -m venv venvsource venv/bin/activatepip install torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118pip install"pip>=23.1""setuptools>=65.5.0" wheel pybind11 six regex nltk numpy deepspeed==0.12.2 einops tensorboard transformers sentencepiece"protobuf<3.21.0"
- Installation - From Source - Linux の手順を
pip>=23.1
の前提で進める / ExecuteInstallation - From Source - Linux with assumingpip>=23.1
- エラーにハマりやすいので以下の点に注意して作業を行う / It's easy to get stuck in comiplation errors, so pay attention to the following points.
- 本来10分ほどかかるはずのビルド処理がすぐに終わる場合は*.soのコンパイルがスキップされている / If the build process, which should normally take about 10 minutes, finishes immediately, the *.so compilation is being skipped.
- 次のファイルががなければビルド失敗 / Build fails if the following files are missing
- build/lib.linux-x86_64-cpython-310/apex_C.cpython-310-x86_64-linux-gnu.so
git clone https://github.com/NVIDIA/apex -b 23.08;cd apexpip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings"--build-option=--cpp_ext" --config-settings"--build-option=--cuda_ext" ./cd ..
- ninjaのバージョンが 1.11.1 か確認する / Check if ninja version is 1.11.1
ninja --version
- バージョン上限を指定してflash-attnをインストール / Install flash-attn with specified version limit
pip install"flash-attn<2.4.0" --no-build-isolation
- llm-jp-tokenizer v2.1 SentencePieceモデルファイルのダウンロード / llm-jp-tokenizer v2.1 SentencePiece model file download
curl -O -L https://github.com/llm-jp/llm-jp-tokenizer/raw/main/models/ver2.1/code10k_en20k_ja30k.ver2.1.model
- 次の内容で
download_mc4_ja.py
を作成 / Createdownload_mc4_ja.py
with the following content
importjsonimportsysfromdatasetsimportload_datasetif__name__=="__main__":dataset=load_dataset('mc4','ja',split='train',streaming=True)limit=int(sys.argv[1])count=0fordocindataset:json.dump(doc,sys.stdout,ensure_ascii=False)print()count+=1ifcount==limit:break
download_mc4_ja.py
を実行してmC4の日本語パートから先頭1万件をmc4-ja-10k.jsonl
に保存 / Executedownload_mc4_ja.py
and save the first 10,000 items from the Japanese part of mC4 tomc4-ja-10k.jsonl
python download_mc4_ja.py 10000> mc4-ja-10k.jsonl
- データセットをビルドして作成されたファイルを確認 / Build the dataset and check the created files
python tools/preprocess_data.py \ --input ./mc4-ja-10k.jsonl \ --output-prefix dataset/mc4-ja-10k \ --tokenizer-model ./code10k_en20k_ja30k.ver2.1.model \ --append-eod \ --tokenizer-type SentencePieceTokenizer \ --dataset-impl mmap --workers 8ls -l dataset/mc4-ja-10k*
- サンプルの
pretrain_llama2_distributed.sh
をコピー / Copy the samplepretrain_llama2_distributed.sh
cp examples_deepspeed/pretrain_llama2_distributed.sh.chmod +x pretrain_llama2_distributed.sh
./pretrain_llama2_distributed.sh
を編集して次の行を変更 / Edit./pretrain_llama2_distributed.sh
and change the following line<
の行を>
の行の内容に置き換える / Replace the<
line with the contents of the>
line
< DATASET_1="./tmp/data/bookcorpus_train_1m_text_sentence"> DATASET_1="./dataset/mc4-ja-10k_text_document"< TOKENIZER_PATH=./tmp/tokenizer.model # official llama tokenizer.model> TOKENIZER_PATH=./code10k_en20k_ja30k.ver2.1.model> export NCCL_IB_GID_INDEX=3> export NCCL_IB_TC=106< --tokenizer-type GPTSentencePieceTokenizer \> --tokenizer-type SentencePieceTokenizer \
pretrain_gpt.py
の最後から2行目のデフォルトトークナイザの指定をコメントアウト / Comment out the default tokenizer specification on the second to last line ofpretrain_gpt.py
# args_defaults={'tokenizer_type': 'GPT2BPETokenizer'},
./pretrain_llama2_distributed.sh
以上 / That's all.
About
NLP2024 チュートリアル3 作って学ぶ日本語大規模言語モデル - 環境構築手順とソースコード / NLP2024 Tutorial 3: Practicing how to build a Japanese large-scale language model - Environment construction and experimental source codes
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors2
Uh oh!
There was an error while loading.Please reload this page.