Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Better schedules for low precision training of deep neural networks

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

All data used in this work is open-source and accessible online.

Code Availability

Notes

  1. Related work (Chen et al.,2022b): considers a sampling rate for each profile, which controls the frequency of hyperparameter updates. This sampling rate is less pertinent to precision schedules because precision is always rounded to the nearest integer, which makes updates less frequent.

  2. Cosine and linear function profiles are symmetric, which causes horizontal and vertical reflections to be identical.

  3. Prior work (Fu et al.,2021) shows that SBM outperforms other static techniques for quantized training.

  4. The lesser impact of\(\texttt {Q-Agg}\) on OGBN-Products is due to the use of neighborhood sampling. The aggregation process computes a sum over all neighboring features, which can generate large, numerically unstable components unless the sum is truncated to a smaller, fixed number of neighboring features.

  5. The manner in which learning rate decay is performed could impact the final result of critical learning period experiments (Achille et al.,2018) We test multiple learning rate decay strategies and find that they perform similarly. As such, we adopt a simple schedule that decays the learning rate normally throughout training.

  6. Here, each number represents an epoch. The window [100, 600] means that low precision training was performed between epochs 100 and 600.

References

  • Achille, A., Rovere, M., & Soatto, S. (2018) Critical learning periods in deep networks. InInternational conference on learning representations

  • Ash, J., & Adams, R. P. (2020). On warm-starting neural network training.Advances in Neural Information Processing Systems,33, 3884–3894.

    Google Scholar 

  • Banner, R., Hubara, I., Hoffer, E., & Soudry, D. (2018). Scalable methods for 8-bit training of neural networks.Advances in Neural Information Processing Systems,31.

  • Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., & Kwak, N. (2020). Lsq+: Improving low-bit quantization through learnable offsets and better initialization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 696–697).

  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems,33, 1877–1901.

    Google Scholar 

  • Chen, J., Wolfe, C., Li, Z., & Kyrillidis, A. (2022a). Demon: Improved neural network training with momentum decay. InICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3958–3962). IEEE.

  • Chen, J., Wolfe, C., & Kyrillidis, A. (2022b). Rex: Revisiting budgeted training with an improved schedule.Proceedings of Machine Learning and Systems,4, 64–76.

  • Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S. R., Schwenk, H., & Stoyanov, V. (2018). Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 Conference on empirical methods in natural language processing. Association for Computational Linguistics.

  • Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1.arXiv:1602.02830

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv:1810.04805

  • Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., & Modha, D. S. (2019). Learned step size quantization.arXiv:1902.08153

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. The PASCAL visual object classes challenge 2012 (VOC2012) results.http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  • Fast.ai. (2020). Training a state-of-the-art model. GitHub.

  • Feng, B., Wang, Y., Li, X., Yang, S., Peng, X., & Ding, Y. (2020). Sgquant: Squeezing the last bit on graph neural networks with specialized quantization. In2020 IEEE 32nd international conference on tools with artificial intelligence (ICTAI) (pp. 1044–1052). IEEE.

  • Fu, Y., Guo, H., Li, M., Yang, X., Ding, Y., Chandra, V., & Lin, Y. (2021). Cpt: Efficient deep neural network training via cyclic precision.arXiv:2101.09868

  • Fu, Y., You, H., Zhao, Y., Wang, Y., Li, C., Gopalakrishnan, K., Wang, Z., & Lin, Y. (2020). Fractrain: Fractionally squeezing bit savings both temporally and spatially for efficient DNN training.Advances in Neural Information Processing Systems,33, 12127–12139.

    Google Scholar 

  • Golatkar, A. S., Achille, A., & Soatto, S. (2019). Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence.Advances in Neural Information Processing Systems,32 (2019).

  • Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv:1706.02677

  • Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015) Deep learning with limited numerical precision. InInternational conference on machine learning (pp. 1737–1746). PMLR.

  • Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs.Advances in Neural Information Processing Systems30.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computation,9(8), 1735–1780.

    Article  Google Scholar 

  • Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., & Leskovec, J. (2020). Open graph benchmark: Datasets for machine learning on graphs.Advances in Neural Information Processing Systems,33, 22118–22133.

    Google Scholar 

  • HuggingFace. (2023). Text Classification Examples. GitHub.

  • Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 2704–2713).

  • Jung, S., Son, C., Lee, S., Son, J., Han, J.-J., Kwak, Y., Hwang, S. J., Choi, C. (2019). Learning to quantize deep networks by optimizing quantization intervals with task loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4350–4359).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization.arXiv:1412.6980

  • Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks.arXiv:1609.02907

  • Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets.arXiv:1608.08710

  • Li, F., Zhang, B., & Liu, B. (2016). Ternary weight networks.arXiv:1605.04711

  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017). Focal loss for dense object detection. InProceedings of the IEEE International conference on computer vision (pp. 2980–2988).

  • Loshchilov, I. & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts.arXiv:1608.03983

  • Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. (2017). Mixed precision training.arXiv:1710.03740

  • Park, E., & Yoo, S. (2020). Profit: A novel training method for sub-4-bit mobilenet models. InEuropean conference on computer vision (pp. 430–446). Springer.

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch.

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).

  • Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model.arXiv:2211.05100

  • Smith, L.N. (2017) Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV) (pp. 464–472). IEEE.

  • Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay.arXiv:1803.09820

  • Smith, L. N. (2022). General cyclical training of neural networks.arXiv:2202.08835

  • Tailor, S. A., Fernandez-Marques, J., & Lane, N. D. (2020). Degree-quant: Quantization-aware training for graph neural networks.arXiv:2008.05000

  • Wan, C., Li, Y., Wolfe, C. R., Kyrillidis, A., Kim, N. S., & Lin, Y. (2022). Pipegcn: Efficient full-graph training of graph convolutional networks with pipelined feature communication.arXiv:2203.10428

  • Wang, K., Liu, Z., Lin, Y., Lin, J., & Han, S. (2019). HAQ: Hardware-aware automated quantization with mixed precision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8612–8620).

  • Wang, M. Y. (2019). Deep graph library: Towards efficient and scalable deep learning on graphs. InICLR workshop on representation learning on graphs and manifolds.

  • Wang, N., Choi, J., Brand, D., Chen, C.-Y., & Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers.Advances in Neural Information Processing Systems,31.

  • Wang, X., Yu, F., Dou, Z.-Y., Darrell, T., Gonzalez, J. E. (2018). Skipnet: Learning dynamic routing in convolutional networks. InProceedings of the European conference on computer vision (ECCV) (pp. 409–424).

  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2020). Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 38–45).

  • Wu, C.-Y., Girshick, R., He, K., Feichtenhofer, C. & Krahenbuhl, P. (2020) A multigrid method for efficiently training video models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 153–162).

  • Xu, Y., Zhang, S., Qi, Y., Guo, J., Lin, W., & Xiong, H. (2018). DNQ: Dynamic network quantization.arXiv:1812.02375

  • Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., & Li, G. (2020). Training high-performance and large-scale deep neural networks with full 8-bit integers.Neural Networks,125, 70–82.

    Article  Google Scholar 

  • Zaremba, W., Sutskever, I., & Vinyals, O. (2014) Recurrent neural network regularization.arXiv:1409.2329

  • Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., & Zou, Y. (2016). Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.arXiv:1606.06160

Download references

Funding

NSF FET:Small no. 1907936, NSF MLWiNS CNS no. 2003137, and NSF CAREER award no. 2145629.

Author information

Authors and Affiliations

  1. Department of Computer Science, Rice University, 6100 Main Street, Houston, TX, 77005, USA

    Cameron R. Wolfe & Anastasios Kyrillidis

Authors
  1. Cameron R. Wolfe

    You can also search for this author inPubMed Google Scholar

  2. Anastasios Kyrillidis

    You can also search for this author inPubMed Google Scholar

Contributions

CRW was the primary researcher on this project with AK acting as the advisor.

Corresponding author

Correspondence toCameron R. Wolfe.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Vu Nguyen, Dani Yogatama.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wolfe, C.R., Kyrillidis, A. Better schedules for low precision training of deep neural networks.Mach Learn113, 3569–3587 (2024). https://doi.org/10.1007/s10994-023-06480-0

Download citation

Keywords

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp