- Notifications
You must be signed in to change notification settings - Fork187
Open
Description
Integration Proposal
- Implement a new module that handles creating the
volumemanage
for VCK - Insert logic to provision
volumemanage
resource and monitor it for completion before executing the training job workload. - To make it more elastic, we need to come up with some algorithm on how much data replicas we need for each job. Then create some labels/tags to allow users to reuse the same dataset volume.
- Need to figure out a shared file storage for all the learner pods (required for many distributed learning methods) and a way to store the model results for our users.
For more details, please refer tohttps://github.com/IBM/FfDL/blob/vck-patch/etc/examples/vck-integration.md