Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets.

ICLR(2019)

引用 270|浏览202
暂无评分
摘要
Training activation quantized neural networks involves minimizing a piecewise constant function whose vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass, so that the gradient through the modified chain rule becomes non-trivial. Since this unusual gradient is certainly not the of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual gradient given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse correlates positively with the population (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要