📄 Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective

Published in Arxiv, 2025

☛ Download paper .

Abstract summary:

We analyze Stochastic Gradient Descent (SGD) through its continuous-time approximation as a Fokker-Planck equation. By studying the Kullback-Leibler divergence between the resulting quasi-stationary distribution and the initial weight distribution, we derive an explicit, mathematically grounded bound for the expected loss in Deep Neural Networks (DNNs). Minimizing this bound yields an optimal initialization variance, which we validate experimentally, showing it outperforms conventional methods like He-normal initialization on standard datasets, providing a theoretical foundation for this key hyperparameter.