- Notifications
You must be signed in to change notification settings - Fork6
Tools for Munging Folding@Home datasets
License
choderalab/fahmunge
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A tool for automated processing of Folding@home data to producemdtraj-compatible trajectory sets.
- Kyle A. Beauchamp
- John D. Chodera
- Steven Albanese
- Rafal Wiewiora
- 0x17 [OpenMM]
- 0x18 [OpenMM]
- 0x21 [OpenMM 6.3]
The easiest way to installfahmunge
and its dependencies is viaconda
(preferablyminiconda
):
conda install --yes -c omnia fahmunge
Basic usage simply specifies a project CSV file and an output path for the munged data:
munge-fah-data --projects projects.csv --outpath /data/choderalab/fah/munged3 --nprocesses 16 --validate
The metadata for FAH is a CSV file located here onchoderalab
FAH servers:
/data/choderalab/fah/Software/FAHMunge/projects.csv
This file specifies the project number, the location of the FAH data, a reference PDB file (or files) to be used for munging, and the MDTraj DSL topology selection to be used for extracting solute coordinates of interest.
For example:
project,location,pdb,topology_selection"10491","/home/server.140.163.4.245/server2/data/SVR2359493877/PROJ10491/","/home/server.140.163.4.245/server2/projects/GPU/p10491/topol-renumbered-explicit.pdb","not water""10492","/home/server.140.163.4.245/server2/data/SVR2359493877/PROJ10492/","/home/server.140.163.4.245/server2/projects/GPU/p10492/topol-renumbered-explicit.pdb","not (water or resname NA or resname CL)""10495","/home/server.140.163.4.245/server2/data/SVR2359493877/PROJ10492/","/home/server.140.163.4.245/server2/projects/GPU/p10495/MTOR_HUMAN_D0/RUN%(run)d/system.pdb","not (water or resname NA or resname CL)"
pdb
points the pipeline toward a PDB file to look at for numbering atoms in the munged data.The top two lines are examples of using a single PDB for all RUNs in the project.The third line shows how to use a different PDB for each RUN.%(run)d
is substituted by the run number viafilename % vars()
in Python, which allows run numbers or other local Python variables to be substituted.Substitution is only performed on a per-run basis, not per-clone.
The projects CSV file will undergo minimal validation automatically to make sure all data and file paths can be found.
More advanced usage allows additional arguments to be specified:
--nprocesses <NPROCESSES>
will parallelize munging by RUN usingmultiprocessing
ifNPROCESSES > 1
is specified. By default,NPROCESSES = 1
.--time <TIME_LIMIT>
specifies that munging should move on to another phase or project after the given time limit (in seconds) is reached, once it is safe to move on. This is useful for ensuring that some munging occurs on all projects of interest every day.--verbose
will produce verbose output--maxits <MAXITS>
will cause the munging pipeline to run for the specified number of iterations and then exit. This can be useful for debugging. Without specifying this option, munging will run indefinitely.--sleeptime <SLEEPTIME>
will cause munging to sleep for the specified number of seconds if no work was done in this iteration (default:3600).--validate
will validate the choice oftopology_selection
MDTraj DSL topology selection queries to make sure they are valid; note that this may take a significant amount of time, so is optional behavior--compress-xml
will compress.xml
files after unpacking them from old-WS-style result packages to save space
- Login to work server using the usual FAH login
- Check if script is running (
screen -r -d
). If True, stop here. - Start a screen session
- Run with:
munge-fah-data --projects /data/choderalab/fah/projects.csv --outpath /data/choderalab/fah/munged-data --time 600 --nprocesses 16
- To stop, control c when the script is in the "sleep" phase
Overall Pipeline (Core17/18):
- Extract XTC data from
bzip
s - Append all-atom coordinates and filenames to HDF5 file
- Extract protein coordinates and filenames from the all-atom HDF5 file into a second HDF5 file
The rate limiting step appears to bebunzip
.
If we can avoid having the trajectories be double-bzip
ped by the client, this will speed up things immensely.
Mungedno-solvent
data isrsync
ed nightly fromplfah1
andplfah2
tohal.cbio.mskcc.org
via thechoderalab
robot user account to:
/cbio/jclab/projects/fah/fah-data/munged
This is done via acrontab
:
# kill any rsyncs already in progress42 00 * * * skill rsync# munged304 01 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah1.mskcc.org:/data/choderalab/fah/munged2/no-solvent /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah1-rsync3-no-solvent.log 2>&138 02 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah2.mskcc.org:/data/choderalab/fah/munged2/no-solvent /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah2-rsync3-no-solvent.log 2>&134 03 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah1.mskcc.org:/data/choderalab/fah/munged2/all-atoms /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah1-rsync3-all-atoms.log 2>&150 03 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah2.mskcc.org:/data/choderalab/fah/munged2/all-atoms /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah2-rsync3-all-atoms.log 2>&1
To install thiscrontab
as thechoderalab
user:
crontab~/crontab
To list the activecrontab
:
crontab -l
Transfers are logged in thechoderalab
account:
plfah1-rsync3-all-atoms.logplfah1-rsync3-no-solvent.logplfah2-rsync3-all-atoms.logplfah2-rsync3-no-solvent.log
Project skeleton based on theComputational Chemistry Python Cookiecutter
About
Tools for Munging Folding@Home datasets