Movatterモバイル変換

345Accesses
1Citation
1Altmetric
Explore all metrics

Abstract

Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Artificial neural networks training acceleration through network science strategies

ArticleOpen access09 September 2020

Artificial Neural Networks Training Acceleration Through Network Science Strategies

Deep Randomized Networks for Fast Learning

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

All data used in this work is open-source and accessible online.

Code Availability

https://github.com/wolfecameron/BetterPrecisionSchedules.

Notes

Related work (Chen et al.,2022b): considers a sampling rate for each profile, which controls the frequency of hyperparameter updates. This sampling rate is less pertinent to precision schedules because precision is always rounded to the nearest integer, which makes updates less frequent.
Cosine and linear function profiles are symmetric, which causes horizontal and vertical reflections to be identical.
Prior work (Fu et al.,2021) shows that SBM outperforms other static techniques for quantized training.
The lesser impact of\(\texttt {Q-Agg}\) on OGBN-Products is due to the use of neighborhood sampling. The aggregation process computes a sum over all neighboring features, which can generate large, numerically unstable components unless the sum is truncated to a smaller, fixed number of neighboring features.
The manner in which learning rate decay is performed could impact the final result of critical learning period experiments (Achille et al.,2018) We test multiple learning rate decay strategies and find that they perform similarly. As such, we adopt a simple schedule that decays the learning rate normally throughout training.
Here, each number represents an epoch. The window [100, 600] means that low precision training was performed between epochs 100 and 600.

References

Achille, A., Rovere, M., & Soatto, S. (2018) Critical learning periods in deep networks. InInternational conference on learning representations
Ash, J., & Adams, R. P. (2020). On warm-starting neural network training.Advances in Neural Information Processing Systems,33, 3884–3894.
Google Scholar
Banner, R., Hubara, I., Hoffer, E., & Soudry, D. (2018). Scalable methods for 8-bit training of neural networks.Advances in Neural Information Processing Systems,31.
Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., & Kwak, N. (2020). Lsq+: Improving low-bit quantization through learnable offsets and better initialization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 696–697).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems,33, 1877–1901.
Google Scholar
Chen, J., Wolfe, C., Li, Z., & Kyrillidis, A. (2022a). Demon: Improved neural network training with momentum decay. InICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3958–3962). IEEE.
Chen, J., Wolfe, C., & Kyrillidis, A. (2022b). Rex: Revisiting budgeted training with an improved schedule.Proceedings of Machine Learning and Systems,4, 64–76.
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S. R., Schwenk, H., & Stoyanov, V. (2018). Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 Conference on empirical methods in natural language processing. Association for Computational Linguistics.
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1.arXiv:1602.02830
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv:1810.04805
Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., & Modha, D. S. (2019). Learned step size quantization.arXiv:1902.08153
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. The PASCAL visual object classes challenge 2012 (VOC2012) results.http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Fast.ai. (2020). Training a state-of-the-art model. GitHub.
Feng, B., Wang, Y., Li, X., Yang, S., Peng, X., & Ding, Y. (2020). Sgquant: Squeezing the last bit on graph neural networks with specialized quantization. In2020 IEEE 32nd international conference on tools with artificial intelligence (ICTAI) (pp. 1044–1052). IEEE.
Fu, Y., Guo, H., Li, M., Yang, X., Ding, Y., Chandra, V., & Lin, Y. (2021). Cpt: Efficient deep neural network training via cyclic precision.arXiv:2101.09868
Fu, Y., You, H., Zhao, Y., Wang, Y., Li, C., Gopalakrishnan, K., Wang, Z., & Lin, Y. (2020). Fractrain: Fractionally squeezing bit savings both temporally and spatially for efficient DNN training.Advances in Neural Information Processing Systems,33, 12127–12139.
Google Scholar
Golatkar, A. S., Achille, A., & Soatto, S. (2019). Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence.Advances in Neural Information Processing Systems,32 (2019).
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv:1706.02677
Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015) Deep learning with limited numerical precision. InInternational conference on machine learning (pp. 1737–1746). PMLR.
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs.Advances in Neural Information Processing Systems30.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computation,9(8), 1735–1780.
Article Google Scholar
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., & Leskovec, J. (2020). Open graph benchmark: Datasets for machine learning on graphs.Advances in Neural Information Processing Systems,33, 22118–22133.
Google Scholar
HuggingFace. (2023). Text Classification Examples. GitHub.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 2704–2713).
Jung, S., Son, C., Lee, S., Son, J., Han, J.-J., Kwak, Y., Hwang, S. J., Choi, C. (2019). Learning to quantize deep networks by optimizing quantization intervals with task loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4350–4359).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization.arXiv:1412.6980
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks.arXiv:1609.02907
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets.arXiv:1608.08710
Li, F., Zhang, B., & Liu, B. (2016). Ternary weight networks.arXiv:1605.04711
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017). Focal loss for dense object detection. InProceedings of the IEEE International conference on computer vision (pp. 2980–2988).
Loshchilov, I. & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts.arXiv:1608.03983
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. (2017). Mixed precision training.arXiv:1710.03740
Park, E., & Yoo, S. (2020). Profit: A novel training method for sub-4-bit mobilenet models. InEuropean conference on computer vision (pp. 430–446). Springer.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model.arXiv:2211.05100
Smith, L.N. (2017) Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV) (pp. 464–472). IEEE.
Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay.arXiv:1803.09820
Smith, L. N. (2022). General cyclical training of neural networks.arXiv:2202.08835
Tailor, S. A., Fernandez-Marques, J., & Lane, N. D. (2020). Degree-quant: Quantization-aware training for graph neural networks.arXiv:2008.05000
Wan, C., Li, Y., Wolfe, C. R., Kyrillidis, A., Kim, N. S., & Lin, Y. (2022). Pipegcn: Efficient full-graph training of graph convolutional networks with pipelined feature communication.arXiv:2203.10428
Wang, K., Liu, Z., Lin, Y., Lin, J., & Han, S. (2019). HAQ: Hardware-aware automated quantization with mixed precision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8612–8620).
Wang, M. Y. (2019). Deep graph library: Towards efficient and scalable deep learning on graphs. InICLR workshop on representation learning on graphs and manifolds.
Wang, N., Choi, J., Brand, D., Chen, C.-Y., & Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers.Advances in Neural Information Processing Systems,31.
Wang, X., Yu, F., Dou, Z.-Y., Darrell, T., Gonzalez, J. E. (2018). Skipnet: Learning dynamic routing in convolutional networks. InProceedings of the European conference on computer vision (ECCV) (pp. 409–424).
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2020). Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 38–45).
Wu, C.-Y., Girshick, R., He, K., Feichtenhofer, C. & Krahenbuhl, P. (2020) A multigrid method for efficiently training video models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 153–162).
Xu, Y., Zhang, S., Qi, Y., Guo, J., Lin, W., & Xiong, H. (2018). DNQ: Dynamic network quantization.arXiv:1812.02375
Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., & Li, G. (2020). Training high-performance and large-scale deep neural networks with full 8-bit integers.Neural Networks,125, 70–82.
Article Google Scholar
Zaremba, W., Sutskever, I., & Vinyals, O. (2014) Recurrent neural network regularization.arXiv:1409.2329
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., & Zou, Y. (2016). Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.arXiv:1606.06160

Download references

Funding

NSF FET:Small no. 1907936, NSF MLWiNS CNS no. 2003137, and NSF CAREER award no. 2145629.

Author information

Authors and Affiliations

Department of Computer Science, Rice University, 6100 Main Street, Houston, TX, 77005, USA
Cameron R. Wolfe & Anastasios Kyrillidis

Authors

Cameron R. Wolfe
View author publications
You can also search for this author inPubMed Google Scholar
Anastasios Kyrillidis
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

CRW was the primary researcher on this project with AK acting as the advisor.

Corresponding author

Correspondence toCameron R. Wolfe.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Vu Nguyen, Dani Yogatama.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wolfe, C.R., Kyrillidis, A. Better schedules for low precision training of deep neural networks.Mach Learn113, 3569–3587 (2024). https://doi.org/10.1007/s10994-023-06480-0

Download citation

Received:26 May 2023
Revised:18 October 2023
Accepted:20 October 2023
Published:08 January 2024
Issue Date:June 2024
DOI:https://doi.org/10.1007/s10994-023-06480-0

Movatterモバイル変換

Better schedules for low precision training of deep neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Artificial neural networks training acceleration through network science strategies

Artificial Neural Networks Training Acceleration Through Network Science Strategies

Deep Randomized Networks for Fast Learning

Explore related subjects

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Access this article

Subscribe and save

Buy Now