epfl-dlab/llm-baselines-zip2zipPublic

forked fromepfml/llm-baselines

NotificationsYou must be signed in to change notification settings
Fork0
Star0

nanoGPT-like codebase for LLM training

License

MIT license

0 stars 36 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
assets		assets
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

LLM-baselines

A modular codebase to experiment with transformers, inspired by NanoGPT.

Quickstart

Install dependencies:

pip install -r requirements.txt

Run a simple training on the Slimpajama dataset (6B subset, 24GBs decompressed, takes a few minutes to download):

python ./src/main.py --config_format base

The above command trains a 123.59M parameters model. It trains for 25k iterations with a batch size of 128=32x4 (4 gradient accumulation steps), using a cosine schedule with a maximum learning rate of 1e-3 that is reduced to 1e-4 at the end of training. The model is saved in the./exps folder.

This training takes roughly ~3h on a single A100 (80GB) GPU. The plot of the training and validation loss should look roughly like this:

You can check out the wandb run for yourselfhere.

Less quick start

Here are the possible parameters you can use (copypasted fromconfig/base.py):

# General training paramsparser.add_argument('--batch_size',default=32,type=int)parser.add_argument('--acc_steps',default=4,type=int)parser.add_argument('--seed',default=0,type=int)# random seed for the parametersparser.add_argument('--data_seed',default=1337,type=int)# random seed defining the data orderingparser.add_argument('--device',default='cuda:0',type=str)# see below to run on multiple GPUsparser.add_argument('--iterations',default=25000,type=int)# total number of training iterationsparser.add_argument('--lr',default=1e-3,type=float)parser.add_argument('--warmup_percent',default=0.05,type=float)# the total number of warmup steps is iterations * warmup_percentparser.add_argument('--weight_decay',default=0.1,type=float)# I recommend you keep this value, else instabilities might ariseparser.add_argument('--beta1',default=0.9,type=float)# adam parameterparser.add_argument('--beta2',default=0.95,type=float)# adam parameterparser.add_argument('--scheduler',default='cos',choices=['linear','cos','none'])parser.add_argument('--opt',default='adamw',choices=['adamw','sgd'])parser.add_argument('--eval_freq',default=200,type=int)# in iterationsparser.add_argument('--results_base_folder',default="./exps",type=str)# where the checkpoints will be savedparser.add_argument('--grad_clip',default=0.0,type=float)# default value is 1.0 in NanoGPT# Dataset paramsparser.add_argument('--dataset',default='slimpajama',choices=['slimpajama','wikitext',"shakespeare-char",'arxiv',"arxiv2000","arxiv+wiki",'openwebtext2'])parser.add_argument('--vocab_size',default=50304,type=int)parser.add_argument('--data_in_ram',action='store_true')# force the data to RAM, you most likely do not need this# Model paramsparser.add_argument('--model',default='base',choices=['base','llama2'])parser.add_argument('--use_pretrained',default="none",type=str)# 'none', 'gpt-2' or a path to the pretraind modelparser.add_argument('--dropout',default=0.0,type=float)# keep to 0 unless in low data regime (e.g. wikitext)parser.add_argument('--n_head',default=12,type=int)parser.add_argument('--n_layer',default=12,type=int)# depth in (att + ff) blocksparser.add_argument('--n_embd',default=768,type=int)# hidden size ...parser.add_argument('--sequence_length',default=512,type=int)parser.add_argument('--dtype',default=torch.bfloat16,type=torch.dtype)parser.add_argument('--bias',default=False,type=bool)parser.add_argument('--compile',action='store_true')# if true then model is compiledparser.add_argument('--rmsnorm_eps',default=1e-5,type=float)# used by the llama modelparser.add_argument('--multiple_of',default=256,type=int)# used by the llama model make SwiGLU hidden layer size multiple of large power of 2# logging params (WandB)parser.add_argument('--wandb',action='store_true')# whether to use wandb or notparser.add_argument('--wandb_project',default="my-project",type=str)parser.add_argument('--wandb_run_prefix',default="none",type=str)# is added before the autogenerated experiment nameparser.add_argument('--eval_seq_prefix',default="Once upon a time",type=str)# prefix used to generate sequences# Distributed argsparser.add_argument('--distributed_backend',default=None,type=str,required=False,choices=distributed.registered_backends())# distributed backend type (e.g. nccl)parser.add_argument('--save_checkpoint_freq',default=None,type=int,required=False)

Using WandB

You need to give your wandb authorize key in order to send the data to your wandb account. If you start jobs on a server without access to prompt, then you can set theWANDB_API_KEY variable within your script:

# this is a script that could be executed on a serverpip install -r requirements.txt# install req.export WANDB_API_KEY="put your authorize key here, to find it: https://wandb.ai/authorize"python ./src/main.py --config_format base --wandb --wandb_project"my awesome project" --n_layer 7 --model base --seed 123

How to add your own transformer architecture?

The structure of the project is the following:

src/    main.py# pick the right data, model, and training function    config/        __init__.py# contains CONFIG_FORMAT_TO_MODULE_MAP mapping the name given to the --config_format flag with a python conf file        base.py# config for the base model    data/        utils.py# contains the get_dataset function        wikitext.py# load/process wikitext        arxiv.py# load/process arxiv        shakespeare.py# load/process the Shakespeare dataset        slimpajama.py        ...    models/        utils.py# contains the get_model function        base.py# contains the standard transformer base architecture        llama.py# llama architecture    optim/        utils.py# contains eval and get_batch functions        base.py# training function for the base and llama models    distributed/# code to enable simple distributed training

Given the above structure, to add your own model, you can just fork the./src/models/base.py file, do your modifications, then if necessary fork the./src/optim/base.py in case you need some custom training loop or evaluation. You also need to fork the./src/config/base.py file to add your own parameters, which imply adding your new config to the mappingCONFIG_FORMAT_TO_MODULE_MAP in./src/config/__init__.py. To add a new dataset, create a new file in thedata folder, checkwikitext.py for the expected format.

Multi-GPU training

Given a multi-GPU machine with e.g. 4 GPUs, one can distribute the training using data-parallelism:

torchrun --nproc_per_node=4 ./src/main.py --config_format base --distributed_backend nccl --dataset slimpajama --model base

When using multiple GPUs, the data will be distributed among the GPUs by dividing the number of accumulation steps by the number of nodes. For instance if we train with a batch size of 32 and 4 accumulation steps, then each GPU will process batches of 32 elements and do 1 accumulation steps. For this reason we requireacc_steps to be a multiple of the number of GPUs.

Experimenting locally on your device with CPU

If do not have access to a GPU or just want to try the code locally on your device, you can try the Shakespeare dataset with character-level tokens:

python ./src/main.py --n_layer=2 --n_head=4 --n_embd=128 --sequence_length=256 --dataset=shakespeare-char --device=cpu --vocab_size=96

About

nanoGPT-like codebase for LLM training

Releases

No releases published

Packages

No packages published

Languages

Python99.7%
Shell0.3%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM-baselines

Quickstart

Less quick start

Using WandB

How to add your own transformer architecture?

Multi-GPU training

Experimenting locally on your device with CPU

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

epfl-dlab/llm-baselines-zip2zip

Folders and files

Latest commit

History

Repository files navigation

LLM-baselines

Quickstart

Less quick start

Using WandB

How to add your own transformer architecture?

Multi-GPU training

Experimenting locally on your device with CPU

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages