NotificationsYou must be signed in to change notification settings
Fork14
Star21

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
assets		assets
checkpoints		checkpoints
preprocessing		preprocessing
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

Official Repository for the OmniLearn paper

This is the official repository used in the paper OmniLearn: A Method to Simultaneously Facilitate All Jet Physics Tasks. If you find this repository useful for your own work please cite the paper:

@article{Mikuni:2024qsr,    author = "Mikuni, Vinicius and Nachman, Benjamin",    title = "{OmniLearn: A Method to Simultaneously Facilitate All Jet Physics Tasks}",    eprint = "2404.16091",    archivePrefix = "arXiv",    primaryClass = "hep-ph",    month = "4",    year = "2024"}

Installation

The list of packages needed to train/evaluate the model is found in therequirements.txt file. Alternatively, you can use the Docker container found in thislink.

Ourrecommendation is to use the docker container.

Data

Use the flag ```--folder''' with the path of the downloaded data to preprocess each file. See below for specific file instructions.

JetClass

OmniLearn is trained using theJetClass dataset. All files can be directly download from zenodo or downloaded using the dataloader scripts provided by the authors from the originalrepository.

cd preprocessingpython preprocess_jetclass.py --sample [train/test/val] --folder FOLDER

This command will create the new files in the same folder specified by the input flag.

Notice these files can be several GB, so save them in a folder where you have a lot of storage space

Top Tagging Dataset

Files can be donwloaded fromzenodo and further preprocessed using the script:

cd preprocessingpython preprocess_top.py --sample [train.h5/test.h5/val.h5] --folder FOLDER

Quark and Gluon dataset

Files can either be downloaded fromzenodo or directly through theEnergyFlow package with the commands.

cd preprocessingpython preprocess_qg.py --folder FOLDER

CMS Open Data Simulations

The 2011A release of theCMS Open Data can also be loaded using the EnergyFlow package as part of the MOD dataset using the following script:

cd preprocessingpython preprocess_cms.py --folder FOLDER

ATLAS Top Tagging Dataset

The ATLAS Top Tagging Dataset can be downloaded from the followinglink and preprocessed using the following script:

cd preprocessingpython preprocess_atlas.py --folder FOLDER --sample [train.h5/test.h5]

DIS Dataset

The DIS dataset uses simulations created by the H1 Collaboration and are currently proprietary. If you are interested in this application, send me a message and we can work through!

JetNet Dataset

The JetNet datasets can be downloaded from zenodo with30 particles and150 particles and similarly preprocessed with script:

cd preprocessingpython preprocess_jetnet.py --folder FOLDER --label [30/150]

OmniFold Dataset

OmniFold files can be downloaded directly fromzenodo or part of the EnergyFlow package. The processing script is:

cd preprocessingpython preprocess_omnifold.py --folder FOLDER --sample [pythia/herwig]

LHCO R&D Dataset

The dataset can be downloaded fromzenodo

To perform the clustering and selection of the two leading jets,Jupyter Notebooks are provided to save the files in the format needed for the OmniLearn training. Additional preprocessing to split the files into training/test/validation in the signal region and sidebands can be used:

cd preprocessingpython preprocess_lhco.py --folder FOLDER

Training OmniLearn

You can either train OmniLearn from scratch or take advantage of the pre-trained checkpoint and skip this step. For the training run:

cd scriptspython train.py --dataset jetclass --lr 3e-5 --layer_scale --local --mode all

Since the training dataset is large, this step may take hours or days depending on the amount of resources available.

Adapting OmniLearn to new datasets

We provide the scripts used in the paper to adapt OmniLearn to each individual task we present with evaluation scripts to derive the results shown in the paper. Notice that training any algorithm from scratch can also be accomplished by omitting the--fine_tune flag. We provide the trained checkpoints from OmniLearn in the checkpoint folder. Copy the checkpoint folder to the dataset folder before running the fine tunning step.

Top Tagging, Quark and Gluon, CMS Open Data, DIS Classification datasets

These datasets can all be fine-tuned using the same scripts used to train OmniLearn with commands:

cd scriptspython train.py --dataset top --layer_scale --local --mode classifier --warm_epoch 3  --epoch 10 --stop_epoch 3 --batch 256 --wd 0.1 --fine_tune python train.py --dataset cms --lr 3e-5 --layer_scale --local --mode classifier --warm_epoch 3  --epoch 40 --stop_epoch 3 --batch 256 --wd 0.0001 --fine_tunepython train.py --dataset qg --lr 3e-5 --layer_scale --local --mode classifier --warm_epoch 3  --epoch 20 --stop_epoch 3 --batch 256 --wd 0.1 --fine_tune python train.py --dataset h1 --lr 3e-5 --layer_scale --local --mode classifier --warm_epoch 3  --epoch 10 --stop_epoch 3 --batch 256 --lr_factor 2 --fine_tune

ATLAS Top Tagging

For the ATLAS Top Tagging dataset we need to modify the loss function to include the event weights, requiring a different script:

cd scriptspython train_atlas.py --dataset [atlas/atlas_small]  --layer_scale --local --fine_tune

The evaluation for all basic classifiers is carried out using the evaluation script:

python evaluate_classifiers.py --batch 1000 --local --layer_scale --dataset [top,qg,cms,atlas,h1] [--load] --fine_tune

The output of the evaluation code are the metrics described in the paper for AUC, accuracy, and signal efficiency at fixed background efficiency values. The evaluation script will also run the network evaluation and save the outputs of the predictions to a npy file. If you need to evaluate the same model again, you can simply load the npy file and skip the full evaluation by providing the--load flag.

OmniFold

You can run the full iterative OmniFold training using the commands::

python train_omnifold.py  --layer_scale --local --num_iter 5 --fine_tune

With number of iterations determined by the--num_iter flag. The evaluation is performed with script:

python evaluate_omnifold.py --local --layer_scale [--reco] --num_iter 5

The--reco flag is used to load step 1 iteration 0 of the algorithm, equivalent to the event reweighting results shown in the paper.

JetNet

To adapt OmniLearn to the JetNet data you can use the script:

python train_jetnet.py --local --layer_scale --dataset [jetnet30/jetnet150] --fine_tune

The JetNet dataset evaluation is performed in 2 steps. The first one we only generate samples. The second step we load the generated data and run through the scripts from theJetNet official repo to determine the performance. Clone the original repo first if you are interested in the metrics. The generation is dine with the following command:

python evaluate_jetnet.py --dataset [jetnet30/jetnet150] --layer_scale   --local  --fine_tune --sample

LHCO

The LHCO training is performed in two steps. First we adapt the generative model using background events present in the sidebands, generate background events in the signal region, then adapt the classifier to separate data in the signal region from generated background events. The generative model is trained with the following commands:

cd scriptspython train_lhco.py --local --layer_scale --fine_tune

After training, background events are generated with:

python evaluate_lhco.py  --layer_scale   --local  [--SR] --sample --fine_tune

The same script can create plots of the generated distributions when called without the--sample flag. You can also generate predictions for the sidebands by omitting the--SR flag.

From the generated samples, you can train the classifier with the commands:

python classify_lhco.py --local --layer_scale [--ideal] [--SR] --nsig NSIG --fine_tune

The--ideal flag trains a classifier using true background events, used in the weakly-supervised results displayed in the paper. The--nsig flag determined the amount of injected background events to consider during the training. The evaluation of the classifier is done with:

python evaluate_classifiers_lhco.py --batch 100 --local --layer_scale [--ideal] --nsig NSIG --fine_tune

Multi-GPU Training and Inference

All training scripts, evaluation scripts, and generation scripts (for generative models) can be run in parallel using multiple GPUs. This is accomplished usingHorovod. Horovod can either be installed separately, as part of the requirements package, but already comes as part of the docker container linked in the top of the repository. On the Perlmutter supercomputer, one can use the SLURM submission system to run the scripts with multiple GPUs using the command

srun --mpi=pmi2 shifter python [train.py]

for example.

Adapt your own dataset

Dataset creation

You can easily include new datasets and benefit from the OmniLearn training! For any task, the first step is to prepare the data to have the same format as the data used in this work. This means creating an.h5 file that contains 3 groups of data:

data: Contains the point clouds used for inputs. The expected shape is (N,P,F) where N is the total number of events, P is the maximum number of particles to be used and F is the number of features per particle. For best results use the same set of features used to train OmniLearn. These features can be found in any of the preprocessing scripts under the preprocess folder. Notice that OmniLearn does not enforce a fixed number of particles and can adapt to any dataset.
jet: Contains the kinematic information for jets. The expected shape is (N,J) with J the number of features per jet. For best results, we use J = 4 with features (jet pT, jet eta, jet mass, particle multiplicity), but any other combination may still work well.
pid: These contain the class labels for the classification task. For best results we one_hot_encode the labels. This can be accomplished directly at the level of the preprocessing or done later inside the Dataloader class.

With the dataset created the next step is to add a subclassed dataset object to theutils.py file.For simple applications look at theTopDataLoader subclassed dataloader for an example. Notice that we use a preprocessing that shifts the mean of the data to 0 and sets the standard deviation to unit, based on the JetClass data. If your dataset has similar ranges for each feature, then keeping the preprocessing from jetclass will result in better results. You can also override the preprocessing parameters in case your data has different features or very different ranges. Look at theH1DataLoader for an example. You can also plot the input distributions to verify that each feature is consistent by running:

python plotter.py --dataset yourdata

by adding your new dataloader object to the plotter.py script. Notice that the created plots will be already normalized.

Training

The minimum addition to be able to train your OmniLearn-powered classifier is to add your dataloader to thetrain.py script and start the training with commands:

python train.py --dataset YOURDATA --layer_scale --local --mode classifier --warm_epoch 3  --epoch 10 --stop_epoch 3 --batch 256 --wd 0.1 --fine_tune

Tune the number of epochs and early stopping based on the size of your dataset (smaller datasets are quick to start overtraining, so no many epochs are needed). If you need to include event weights in the training procedure, look at the ATLAS Top tagging training scripttrain_atlas.py where we modify the classifier loss function to handle the event weights.

For generative models, the same script can be used in case you only need the point cloud generation. This can be accomplished by calling

python train.py --dataset YOURDATA --layer_scale --local --mode generator --warm_epoch 3  --epoch 100 --stop_epoch 30 --batch 256 --wd 0.1 --fine_tune

Similarly, change the number of epochs and early stopping based on the size of your dataset. In case you need the jet generation to be determined simultaneously during training (such is the case for both LHCO and JetNets datasets), you can create a new subclassed model that loads OmniLearn as part of the model. In this case, take a look at thetrain_jetnet.py script where we load a subclassed model namedPET_jetnet stored in thePET_jetnet.py file.

Evaluation

The evaluation of the trained classifier can be carried out directly from the script

python evaluate_classifiers.py --batch 1000 --local --layer_scale --dataset YOURDATA [--load] --fine_tune

after including your dataloader in the script. For data generation, you can create your scripts based on the evaluation scripts for jetnet to accomplish the sampling. Additional plotting functionality can be used to compare generated samples with data. Look at the evaluation scripts for the LHCO application for more details.

About

No description, website, or topics provided.

Releases2

OmniLearn Published Release Latest

Mar 5, 2025

+ 1 release

Packages

No packages published

Contributors2

Languages

Python100.0%

Movatterモバイル変換

ViniciusMikuni/OmniLearn

Folders and files

Latest commit

History

Repository files navigation

Official Repository for the OmniLearn paper

Installation

Data

JetClass

Top Tagging Dataset

Quark and Gluon dataset

CMS Open Data Simulations

ATLAS Top Tagging Dataset

DIS Dataset

JetNet Dataset

OmniFold Dataset

LHCO R&D Dataset

Training OmniLearn

Adapting OmniLearn to new datasets

Top Tagging, Quark and Gluon, CMS Open Data, DIS Classification datasets

ATLAS Top Tagging

OmniFold

JetNet

LHCO

Multi-GPU Training and Inference

Adapt your own dataset

Dataset creation

Training

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages