djunamay/datamodels_tutorialPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

part of the MIT Center for Brains Minds + Machines computational tutorial series

License

MIT license

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs		configs
fast_l1		fast_l1
subgroups		subgroups
LICENSE		LICENSE
README.md		README.md
compute_datamodels.py		compute_datamodels.py
environment.yml		environment.yml
inspect_dataloader.ipynb		inspect_dataloader.ipynb
launch_headnode.sh		launch_headnode.sh
train_a_better_model.ipynb		train_a_better_model.ipynb
train_cifar_with_ray.py		train_cifar_with_ray.py
train_cifar_with_ray.sh		train_cifar_with_ray.sh
train_datamodels.sh		train_datamodels.sh
train_good_model.ipynb		train_good_model.ipynb
write_dataset_regression_with_concatenation.py		write_dataset_regression_with_concatenation.py
write_datasets.py		write_datasets.py

Repository files navigation

`datamodels` Tutorial:

Identifying Subgroups in Biomedical Datasets using Data Attribution

Understanding how training data influences model predictions ("data attribution") is an active area of machine learning research. In this tutorial, we will introduce a data attribution method (datamodels:https://gradientscience.org/datamodels-1/) and explore how it can be applied in the life sciences to identify meaningful subgroups in biomedical datasets, such as disease subtypes. We will begin with a simple example from image classification (CIFAR10), offering a step-by-step guide to demonstrate how the data attribution method works in practice. Since the approach involves training thousands of lightweight classifiers, we will focus on strategies for fast and efficient model training. Next, we will explore its applications in biomedical science, with a focus on single-cell and genetic datasets, highlighting the biological insights gained from applying this computational approach. The tutorial will conclude with an interactive, hands-on session using Google Colab, where participants can apply the techniques themselves and explore the approach further. This session is designed to be accessible to participants of all coding and machine learning experience levels—whether you're new to machine learning or curious about its intersection with biomedical applications.

Tutorial Materials

This Repository

This repository contains code to reproduce the example experiment given in the tutorial (i.e.datamodels.pt).For the purpose of this tutorial, this repository adapts code by the Madry lab (here) and relies on theory presented in Ilyas et al (here).

conda env create -f environment.yml --name ffcvconda activate ffcvpip install tqdm ffcv pyyaml fastargs ray torchvision fast_l1 notebook matplotlibpip install "ray[tune]"# install fastl1 from https://github.com/MadryLab/fast_l1# optionally install ipykernel to use the notebook interfaceconda install ipykernelpython -m ipykernel install --user --name=ffcv

1. Subset the CIFAR10 dataset

conda activate ffcvpython write_datasets.py --data.train_dataset ./CIFAR10/cifar10_train_subset_binaryLabels.beton \                         --data.val_dataset ./CIFAR10/cifar10_val_subset_binaryLabels.beton \                         --data.binary_labels True \                         --data.subset_indices 25000 # subset the training set to 25k samples

2. Steps for Model Training and Tuning
Optionalif you plan on using the same dataset and alpha as in the example, move on to step 3
Inspect the Dataloader (Optional)
Before starting training, you can inspect the dataloader by running the following notebook:
inspect_dataloader.ipynb
Verify Training
Ensure that the model training is functioning correctly by running the training notebook:
train_a_good_model.ipynb
Parameter Tuning for Alpha
To fine-tune your model parameters for a specific alpha value, use the notebook:
train_a_better_model.ipynb
Make sure you have the wandb library installed (pip install wandb)

3. Train many models on different training subsets

conda activate ffcvsbatch launch_headnode.shsbatch train_cifar_with_ray.sh # once the headnode is running, update the address in this file before submitting the training jobs (found in the .out file for the launch_headnode job)

4. Fit datamodels

conda activate ffcvsbatch train_datamodels.sh

5. Explore the datamodels embeddings

google collab; code no outputs

System Requirements used for this example

Python libraries

tqdm: 4.66.5
ffcv: 1.0.2
pyyaml: 6.0.2
fastargs: 1.2.0
ray: 2.37.0
torchvision: 0.19.0+cu118

Hardware

GPU: NVIDIA Tesla V100-PCIE-32GB
- Memory: 32 GB
- CUDA Capability: Required for running GPU-accelerated tasks.

Software

NVIDIA Driver Version: 535.183.01
CUDA Version: 12.2

Notes

Update the sbatch files according to your resource availability
You can adjust the number of simultaneous ray trials by modifyingcpus_per_trial andgpus_per_trial parameters in theconfig file
Profile your GPU and CPU usage by runningnvidia-smi -l orhtop, respectively on your compute node

About

part of the MIT Center for Brains Minds + Machines computational tutorial series

cbmm.mit.edu/computational-tutorials

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

`datamodels` Tutorial:

Identifying Subgroups in Biomedical Datasets using Data Attribution

Tutorial Materials

This Repository

1. Subset the CIFAR10 dataset

2. Steps for Model Training and Tuning

3. Train many models on different training subsets

4. Fit datamodels

5. Explore the datamodels embeddings

System Requirements used for this example

Python libraries

Hardware

Software

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

djunamay/datamodels_tutorial

Folders and files

Latest commit

History

Repository files navigation

datamodels Tutorial:

Identifying Subgroups in Biomedical Datasets using Data Attribution

Tutorial Materials

This Repository

1. Subset the CIFAR10 dataset

2. Steps for Model Training and Tuning

3. Train many models on different training subsets

4. Fit datamodels

5. Explore the datamodels embeddings

System Requirements used for this example

Python libraries

Hardware

Software

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

`datamodels` Tutorial:

Packages