TY - GEN
T1 - Taming the noisy gradient
T2 - 28th International Joint Conference on Artificial Intelligence, IJCAI 2019
AU - Zhang, Yikai
AU - Qu, Hui
AU - Chen, Chao
AU - Metaxas, Dimitris
N1 - Publisher Copyright:
© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2019
Y1 - 2019
N2 - Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.
AB - Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.
UR - https://www.scopus.com/pages/publications/85074909689
U2 - 10.24963/ijcai.2019/604
DO - 10.24963/ijcai.2019/604
M3 - Conference contribution
AN - SCOPUS:85074909689
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 4348
EP - 4354
BT - Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019
A2 - Kraus, Sarit
PB - International Joint Conferences on Artificial Intelligence
Y2 - 10 August 2019 through 16 August 2019
ER -