Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Springer Nature Link
Log in

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that theco-located Parameter Server (PS) configuration is not uncommon in production DDNN training clusters, which inevitably causes intense network resource contention among the co-located PS and worker tasks. Our motivation experiments on Amazon EC2 further show that such network resource contention brings severe performance variation to DDNN training jobs. While existing works largely mitigate theinter-job network resource contention, theintra-job (i.e., task-level) network resource contention among the co-located PS and worker tasks has received comparably little attention. To tackle such performance issues, in this paper, we design and implementNebula, aNetworkbandwidth resource allocation strategy for DDNN training tasks, in order to mitigate the network resource contention and alleviate the performance variation of DDNN training jobs.Nebula monitors theweights of co-located PS and workers and rations the network bandwidth resources for the two tasks by comparing the corresponding task weights. We implement a prototype ofNebula and conduct extensive prototype experiments with representative DNN models trained on Amazon EC2. Our experiment results demonstrate thatNebula can reduce the iteration time of a DDNN training job by up to 25% and improve the cluster resource utilization by up to 30% in comparison to MXNet, yet with practically acceptable runtime overhead.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. We consider the iteration time as the difference of end time ofpull operations for two adjacent iterations (Zhang et al.2017).

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. Proc. USENIX OSDI2016, 265–283 (2016)

    Google Scholar 

  • Berral JL, Wang C, Youssef A (2020) AI4DL: mining behaviors of deep learning workloads for resource management. In: Proceedings of USENIX HotCloud (2020)

  • Chen, C., Wang, W., Li, B.: Round-Robin synchronization: mitigating communication Bottlenecks in parameter servers. Proc IEEE INFOCOM2019, 532–540 (2019)

    Google Scholar 

  • Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015). arXiv preprintarXiv:151201274

  • Gu, J., Chowdhury, M., Shin, K.G., Zhu, Y., Jeon, M., Qian, J., Liu, H., Guo, C.: Tiresias: a GPU cluster manager for distributed deep learning. Proc. USENIX NSDI2019, 485–500 (2019)

    Google Scholar 

  • Guo, J., Liu, F., Lui, J.C.S., Jin, H.: Fair network bandwidth allocation in iaas datacenters via a cooperative game approach. IEEE/ACM Trans. Netw.24, 873–886 (2015)

    Article  Google Scholar 

  • Guptaand, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. Proc. ICML2015, 1737–1746 (2015)

    Google Scholar 

  • Huang, X.S., Chen, A., Ng, T.: Green, yellow, yield: end-host traffic scheduling for distributed deep learning with tensorlights. Proc. IEEE IPDPSW2019, 430–437 (2019)

    Google Scholar 

  • Jayarajan, A., Wei, J., Gibson, G., Fedorova, A., Pekhimenko, G.: Priority-based parameter propagation for distributed DNN training. In: Talwalkar, A., Smith, V., Zaharia, M. (eds.) Proceedings of Machine Learning and Systems 2019, vol. 3, pp. 132–145 (2019)

  • Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J., Xiao, W., Yang, F.: Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. Proc. USENIX ATC2019, 947–960 (2019)

    Google Scholar 

  • Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., Guo, C.: A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: Proceedings of USENIX OSDI, pp 463–479 (2020)

  • Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Proc. NIPS2012, 1097–1105 (2012)

    Google Scholar 

  • Kshiteej, M., Arjun, B., Arjun, S., Shivaram, V., Aditya, A., Amar, P., Shuchi, C.: Themis: fair and efficient GPU cluster scheduling. Proc. USENIX NSDI2020, 289–304 (2020)

    Google Scholar 

  • Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. (2017) arXiv preprintarXiv:171201887

  • Luo, L., Nelson, J., Ceze, L., Phanishayee, A., Krishnamurthy, A.: Parameter hub: a rack-scale parameter server for distributed deep neural network training. Proc. ACM SOCC2018, 41–54 (2018)

    Google Scholar 

  • Luo L, West P, Krishnamurthy A, Ceze L, Nelson J (2020) PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training. Proc. of MLSys 2020

  • Mai, L., Hong, C., Costa, P. (2015) Optimizing network performance in distributed machine learning. In: Proceedings of USENIX HotCloud 2015

  • Mayer, R., Jacobsen, H.A.: Scalable deep learning on distributed infrastructures: challenges, techniques, and tools. ACM Comput Surv (CSUR)53, 1–37 (2020)

    Article  Google Scholar 

  • Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. Proc. ICML2017, 2430–2439 (2017)

    Google Scholar 

  • Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N.R., Ganger, G.R., Gibbons, P.B., Zaharia, M.: PipeDream: generalized pipeline parallelism for DNN training. Proc. ACM SOSP2019, 1–15 (2019)

    Google Scholar 

  • Panayiotou, T., Manousakis, K., Chatzis, S.P., Ellinas, G.: A data-driven bandwidth allocation framework With QoS considerations for EONs. J Lightwave Technol37, 1853–1864 (2019)

    Article  Google Scholar 

  • Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. Proc. EuroSys2018, 1–14 (2018)

    Google Scholar 

  • Peng, Y., Zhu, Y., Chen, Y., Bao, Y., Yi, B., Lan, C., Wu, C., Guo, C.: A generic communication scheduler for distributed DNN training acceleration. Proc. ACM SOSP2019, 16–29 (2019)

    Google Scholar 

  • Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall, New York (2020)

    MATH  Google Scholar 

  • Shen, D., Luo, J., Dong, F., Jin, J., Zhang, J., Shen, J.: Facilitating Application-Aware Bandwidth Allocation in the Cloud with One-Step-Ahead Traffic Information. IEEE Trans. Serv. Comput.13, 381–394 (2020)

    Google Scholar 

  • Shi, S., Chu, X., Li, B.: MG-WFBP: efficient data communication for distributed synchronous SGD algorithms. Proc. IEEE INFOCOM2019, 172–180 (2019)

    Google Scholar 

  • Shi, S., Wang, Q., Chu, X., Li, B., Qin, Y., Liu, R., Zhao, X.: Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: Proceedings of IEEE INFOCOM 2020 (2020)

  • Ukidave, Y., Li, X., Kaeli, D.: Mystic: predictive scheduling for Gpu based cloud servers using machine learning. Proc. IEEE IPDPS2016, 353–362 (2016)

    Google Scholar 

  • Wang, C., Zhang, S., Chen, Y., Qian, Z., Wu, J., Xiao, M.: Joint configuration adaptation and bandwidth allocation for edge-based real-time video analytics. Proc. IEEE INFOCOM2020, 1–10 (2020)

    Google Scholar 

  • Wang, Q., Shi, S., Wang, C., Chu, X. Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs. (2020b). arXiv preprintarXiv:200210105

  • Wang, S., Li, D., Geng, J.: Geryon: accelerating distributed CNN training by network-level flow scheduling. Proc. IEEE INFOCOM2020, 1678–1687 (2020)

    Google Scholar 

  • Xu, F., Ye, W., Liu, Y., Zhang, W.: Ufalloc: towards utility max-min fairness of bandwidth allocation for applications in datacenter networks. Mobile Netw. Appl.22, 161–173 (2017)

    Article  Google Scholar 

  • Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei, J., Xie, P., Xing, E.P.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. Proc. USENIX ATC2017, 181–193 (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the NSFC under grant No.61972158, in part by the Science and Technology Commission of Shanghai Municipality under grant No.20511102802 and No.18DZ2270800, and in part by the Tencent Corporation. Li Chen’s work was supported by a grant from BoRSF-RCS under the contract LEQSF(2019-22)-RD-A-21. Zhi Zhou’s work was supported in part by the NSFC under grant No.61802449.

Author information

Authors and Affiliations

  1. Shanghai Key Laboratory of Multidimensional Information Processing, School of Computer Science and Technology, East China Normal University, Shanghai, China

    Qiang Qi & Fei Xu

  2. School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, USA

    Li Chen

  3. Guangdong Key Laboratory of Big Data Analysis and Processing, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

    Zhi Zhou

Authors
  1. Qiang Qi

    You can also search for this author inPubMed Google Scholar

  2. Fei Xu

    You can also search for this author inPubMed Google Scholar

  3. Li Chen

    You can also search for this author inPubMed Google Scholar

  4. Zhi Zhou

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toFei Xu.

Rights and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qi, Q., Xu, F., Chen, L.et al. Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters.CCF Trans. HPC3, 171–185 (2021). https://doi.org/10.1007/s42514-021-00064-x

Download citation

Keywords

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp