345Accesses
1Citation
1Altmetric
Abstract
Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
All data used in this work is open-source and accessible online.
Code Availability
Notes
Related work (Chen et al.,2022b): considers a sampling rate for each profile, which controls the frequency of hyperparameter updates. This sampling rate is less pertinent to precision schedules because precision is always rounded to the nearest integer, which makes updates less frequent.
Cosine and linear function profiles are symmetric, which causes horizontal and vertical reflections to be identical.
Prior work (Fu et al.,2021) shows that SBM outperforms other static techniques for quantized training.
The lesser impact of\(\texttt {Q-Agg}\) on OGBN-Products is due to the use of neighborhood sampling. The aggregation process computes a sum over all neighboring features, which can generate large, numerically unstable components unless the sum is truncated to a smaller, fixed number of neighboring features.
The manner in which learning rate decay is performed could impact the final result of critical learning period experiments (Achille et al.,2018) We test multiple learning rate decay strategies and find that they perform similarly. As such, we adopt a simple schedule that decays the learning rate normally throughout training.
Here, each number represents an epoch. The window [100, 600] means that low precision training was performed between epochs 100 and 600.
References
Achille, A., Rovere, M., & Soatto, S. (2018) Critical learning periods in deep networks. InInternational conference on learning representations
Ash, J., & Adams, R. P. (2020). On warm-starting neural network training.Advances in Neural Information Processing Systems,33, 3884–3894.
Banner, R., Hubara, I., Hoffer, E., & Soudry, D. (2018). Scalable methods for 8-bit training of neural networks.Advances in Neural Information Processing Systems,31.
Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., & Kwak, N. (2020). Lsq+: Improving low-bit quantization through learnable offsets and better initialization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 696–697).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems,33, 1877–1901.
Chen, J., Wolfe, C., Li, Z., & Kyrillidis, A. (2022a). Demon: Improved neural network training with momentum decay. InICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3958–3962). IEEE.
Chen, J., Wolfe, C., & Kyrillidis, A. (2022b). Rex: Revisiting budgeted training with an improved schedule.Proceedings of Machine Learning and Systems,4, 64–76.
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S. R., Schwenk, H., & Stoyanov, V. (2018). Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 Conference on empirical methods in natural language processing. Association for Computational Linguistics.
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1.arXiv:1602.02830
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv:1810.04805
Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., & Modha, D. S. (2019). Learned step size quantization.arXiv:1902.08153
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. The PASCAL visual object classes challenge 2012 (VOC2012) results.http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Fast.ai. (2020). Training a state-of-the-art model. GitHub.
Feng, B., Wang, Y., Li, X., Yang, S., Peng, X., & Ding, Y. (2020). Sgquant: Squeezing the last bit on graph neural networks with specialized quantization. In2020 IEEE 32nd international conference on tools with artificial intelligence (ICTAI) (pp. 1044–1052). IEEE.
Fu, Y., Guo, H., Li, M., Yang, X., Ding, Y., Chandra, V., & Lin, Y. (2021). Cpt: Efficient deep neural network training via cyclic precision.arXiv:2101.09868
Fu, Y., You, H., Zhao, Y., Wang, Y., Li, C., Gopalakrishnan, K., Wang, Z., & Lin, Y. (2020). Fractrain: Fractionally squeezing bit savings both temporally and spatially for efficient DNN training.Advances in Neural Information Processing Systems,33, 12127–12139.
Golatkar, A. S., Achille, A., & Soatto, S. (2019). Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence.Advances in Neural Information Processing Systems,32 (2019).
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv:1706.02677
Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015) Deep learning with limited numerical precision. InInternational conference on machine learning (pp. 1737–1746). PMLR.
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs.Advances in Neural Information Processing Systems30.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computation,9(8), 1735–1780.
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., & Leskovec, J. (2020). Open graph benchmark: Datasets for machine learning on graphs.Advances in Neural Information Processing Systems,33, 22118–22133.
HuggingFace. (2023). Text Classification Examples. GitHub.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 2704–2713).
Jung, S., Son, C., Lee, S., Son, J., Han, J.-J., Kwak, Y., Hwang, S. J., Choi, C. (2019). Learning to quantize deep networks by optimizing quantization intervals with task loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4350–4359).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization.arXiv:1412.6980
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks.arXiv:1609.02907
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets.arXiv:1608.08710
Li, F., Zhang, B., & Liu, B. (2016). Ternary weight networks.arXiv:1605.04711
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017). Focal loss for dense object detection. InProceedings of the IEEE International conference on computer vision (pp. 2980–2988).
Loshchilov, I. & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts.arXiv:1608.03983
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. (2017). Mixed precision training.arXiv:1710.03740
Park, E., & Yoo, S. (2020). Profit: A novel training method for sub-4-bit mobilenet models. InEuropean conference on computer vision (pp. 430–446). Springer.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model.arXiv:2211.05100
Smith, L.N. (2017) Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV) (pp. 464–472). IEEE.
Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay.arXiv:1803.09820
Smith, L. N. (2022). General cyclical training of neural networks.arXiv:2202.08835
Tailor, S. A., Fernandez-Marques, J., & Lane, N. D. (2020). Degree-quant: Quantization-aware training for graph neural networks.arXiv:2008.05000
Wan, C., Li, Y., Wolfe, C. R., Kyrillidis, A., Kim, N. S., & Lin, Y. (2022). Pipegcn: Efficient full-graph training of graph convolutional networks with pipelined feature communication.arXiv:2203.10428
Wang, K., Liu, Z., Lin, Y., Lin, J., & Han, S. (2019). HAQ: Hardware-aware automated quantization with mixed precision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8612–8620).
Wang, M. Y. (2019). Deep graph library: Towards efficient and scalable deep learning on graphs. InICLR workshop on representation learning on graphs and manifolds.
Wang, N., Choi, J., Brand, D., Chen, C.-Y., & Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers.Advances in Neural Information Processing Systems,31.
Wang, X., Yu, F., Dou, Z.-Y., Darrell, T., Gonzalez, J. E. (2018). Skipnet: Learning dynamic routing in convolutional networks. InProceedings of the European conference on computer vision (ECCV) (pp. 409–424).
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2020). Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 38–45).
Wu, C.-Y., Girshick, R., He, K., Feichtenhofer, C. & Krahenbuhl, P. (2020) A multigrid method for efficiently training video models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 153–162).
Xu, Y., Zhang, S., Qi, Y., Guo, J., Lin, W., & Xiong, H. (2018). DNQ: Dynamic network quantization.arXiv:1812.02375
Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., & Li, G. (2020). Training high-performance and large-scale deep neural networks with full 8-bit integers.Neural Networks,125, 70–82.
Zaremba, W., Sutskever, I., & Vinyals, O. (2014) Recurrent neural network regularization.arXiv:1409.2329
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., & Zou, Y. (2016). Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.arXiv:1606.06160
Funding
NSF FET:Small no. 1907936, NSF MLWiNS CNS no. 2003137, and NSF CAREER award no. 2145629.
Author information
Authors and Affiliations
Department of Computer Science, Rice University, 6100 Main Street, Houston, TX, 77005, USA
Cameron R. Wolfe & Anastasios Kyrillidis
- Cameron R. Wolfe
You can also search for this author inPubMed Google Scholar
- Anastasios Kyrillidis
You can also search for this author inPubMed Google Scholar
Contributions
CRW was the primary researcher on this project with AK acting as the advisor.
Corresponding author
Correspondence toCameron R. Wolfe.
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Vu Nguyen, Dani Yogatama.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wolfe, C.R., Kyrillidis, A. Better schedules for low precision training of deep neural networks.Mach Learn113, 3569–3587 (2024). https://doi.org/10.1007/s10994-023-06480-0
Received:
Revised:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative