AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We proposed RNNPool, an efficient Recurrent Neural Networks-based non-linear pooling operator
RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference
NIPS 2020, (2020): 20473-20484
Pooling operators are key components in most Convolutional Neural Networks (CNNs) as they serve to downsample images, aggregate feature information, and increase receptive field. However, standard pooling operators reduce the feature size gradually to avoid significant loss in information via gross aggregation. Consequently, CNN archite...More
PPT (Upload PPT)
- Pooling operators generate aggregate representations of features corresponding to a spatial region and are commonly used in CNNs to down-sample activation maps.
- While MobileNetV2 (Sandler et al, 2018) and EfficientNet (Tan & Le, 2019) do not have explicit pooling layers, they use strided convolutions to down-sample the image, which can be viewed as a weighted average pooling.
- Typical pooling operators use computationally efficient but gross aggregation methods like average or maximum of the inputs, which restricts their application to a small receptive field.
- DenseNet121 uses 41 layers to reduce size of the image from 112×112 to 14×14.
- Pooling operators generate aggregate representations of features corresponding to a spatial region and are commonly used in Convolutional Neural Networks (CNNs) to down-sample activation maps
- We present empirical evidence to demonstrate that RNNPool operator is compatible with popular CNN architectures for vision tasks, and can push the envelope of compute and memory usage vs accuracy curve
- We show that RNNPool combined with a MobileNet style architecture generates accurate models for Visual Wake Words and face detection problems that can meet the compute and budget requirements of ARM Cortex M4 devices
- We proposed RNNPool, an efficient Recurrent Neural Networks (RNNs)-based non-linear pooling operator
- RNNPool operator was used to create RNNPoolLayer which is an effective alternative to the RAM intensive components in modern architectures
- Extensive experimentation on Face Detection and Visual Wake Word problems shows that RNNPool based architectures can enable real-time solutions on resource-constrained tiny microcontrollers
- Base Network Average Pooling Max Pooling Strided Convolution RNNPool Block Last layer RNNPool RNNPool Block + Last layer RNNPool improves from 2.29 MB to 0.24 MB (if re-computation is disallowed and memory optimization similar to (Chowdhery et al, 2019) is used) and the number of computations slightly improves from 300 MFLOPs to 226 MFLOPs, while the accuracy on ImageNet-10 is retained — 94.4% for the new model vs 94.2% for the base model
- These results extend to other networks like EfficientNet, ResNet and GoogLeNet (Szegedy et al, 2015), where residual connection based functional blocks in the initial parts can be effectively replaced with the RNNPoolLayer with improvements in working memory and compute, while retaining comparable accuracy.
- Evaluation of RNNPool
the authors present empirical evidence to demonstrate that RNNPool operator is compatible with popular CNN architectures for vision tasks, and can push the envelope of compute and memory usage vs accuracy curve.
- The authors show that RNNPool combined with a MobileNet style architecture generates accurate models for Visual Wake Words and face detection problems that can meet the compute and budget requirements of ARM Cortex M4 devices.
- Hyperparameters: Models are trained in PyTorch (Paszke et al, 2019) using SGD with momentum optimizer (Sutskever et al, 2013) with weight decay 4 × 10−5 and momentum 0.9.
- The authors do data-parallel training with 4 NVIDIA P40 GPUs and use a batch size of 256 for classification and 32 for face detection.
- More details about hyperparameters can be found in the Appendix D.3
- The authors proposed RNNPool, an efficient RNN-based non-linear pooling operator.
- RNNPool operator was used to create RNNPoolLayer which is an effective alternative to the RAM intensive components in modern architectures.
- Extensive experimentation on Face Detection and Visual Wake Word problems shows that RNNPool based architectures can enable real-time solutions on resource-constrained tiny microcontrollers.
- Going forward, real-world deployment of RNNPool based solutions for wakeword and similar problems would be of great interest.
- RNNPool can potentially be combined with more efficient computation graphs over the same architectures.
- The authors leave further investigation into optimizing these specific computation graphs as a topic for future research
- Table1: Rows 2-5) reinforces the same but on bigger image classification datasets. ImageNet-10 classification and Visual Wake Words tasks with few base layers replaced by different pooling strategies. Row 5 refers to RNNPool Block utilized as shown in Table 2 for MobileNetV2 and Figure 4 for DenseNet121. Average Pooling, Max Pooling and Strided Convolutions are used to replace the blocks in the base network at the same position as RNNPool was used in Row 5. Last layer RNNPool (Row 6) replaces the last Average Pooling layer in the models. The last row of the table refers to the replacement of blocks as in Row 5 along with the last Average Pooling layer in the base network with RNNPoolLayer
- Table2: MobileNetV2-RNNPool: RNNPoolLayer(Rin = 112, Cin = 112, S = 4, r = 6, c = 6, k = 32, h1 = 16, h2 = 16) is used. The rest of the layers are defined as in MobileNetV2 (<a class="ref-link" id="cSandler_et+al_2018_a" href="#rSandler_et+al_2018_a">Sandler et al, 2018</a>). Each line denotes a sequence of layers, repeated n times. The first layer of each bottleneck sequence has stride s and rest use stride 1. Expansion factor t is multiplied to the input channels to change the width
- Table3: The effect of replacing functional blocks in the baseline models with RNNPoolLayer for ImageNet-10 image classification. Memory Optimised calculations
- Table4: Comparison of resources and accuracy for ImageNet-1K
- Table5: Comparison of memory requirement, # parameters and validation MAP obtained by different methods for Face Detection on the WIDER FACE dataset. RNNPool-Face-C is able to achieve higher accuracy than the baselines despite using 3× less RAM and 4.5× less FLOPs. RNNPool-Face-Quant enables deployment on Cortex M4 class devices with 6-7% accuracy gains over the cheapest baselines
- Pooling: Max-pooling, Average-pooling and strided convolution layers (LeCun et al, 2015) are standard techniques for feature aggregation and for reducing spatial resolution in DNNs. Existing literature on rethinking pooling (Zhao et al, 2018; He et al, 2015; Gong et al, 2014) focuses mainly on increasing accuracy. But it does not take compute/memory efficiency into consideration which is one of the primary focus of this paper and the RNNPool operator.
Efficient CNN architectures: Most existing research on design of efficient CNN models aims at reducing inference cost (FLOPs) and model size. The methods include designing new model architectures such as DenseNet (Huang et al, 2017), MobileNets (Howard et al, 2017; Sandler et al, 2018) or searching for them (e.g. ProxylessNAS (Cai et al, 2018), EfficientNets (Tan & Le, 2019)). The aforementioned models do not optimize the peak working memory (RAM) of the model, which can be a critical constraint on tiny devices like microcontrollers. Previous work on memory (RAM) optimized inference manipulates existing convolution operator by reordering computations (Cho & Brand, 2017; Lai et al, 2018) or performing them in-place (Gural & Murmann, 2019) to save storage. However, most of these methods provide relatively small memory savings and typically apply to small images like in CIFAR-10 (Krizhevsky et al, 2009). In contrast, RNNPool reduces memory requirement significantly while maintaining accuracy on various real-world vision tasks and benchmarks.
- In Section 5.2, we show that an RNNPool based MobileNetV2 architecture can enable solutions with comparable accuracy to the prior art but in about 8× less RAM and 40% lower compute cost (FLOPs)
- We propose an RNNPool based architecture that can potentially be deployed on Cortex M4 class devices while still ensuring 5-10% higher accuracy than EagleEye
- We can increase the accuracy of the base model by more than 1% when replacing the last average pool layer with a RNNPool Block
- 0.07M 0.12G 0.80 0.78 0.53 0.84 0.83 0.81 the lowest resolution image is about 40 KB, while our model requires only 34 KB RAM despite using the highest resolution image and ensuring ≈ 4% higher accuracy
- Acuna, D., Ling, H., Kar, A., and Fidler, S. Efficient interactive annotation of segmentation datasets with polygonrnn++. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 859–868, 2018.
- Bell, S., Lawrence Zitnick, C., Bala, K., and Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Cai, H., Zhu, L., and Han, S. ProxylessNAS: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
- Chowdhery, A., Warden, P., Shlens, J., Howard, A., and Rhodes, R. Visual wake words dataset. arXiv preprint arXiv:1906.05721, 2019.
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
- Dennis, D. K., Gaurkar, Y., Gopinath, S., Gupta, C., Jain, M., Kumar, A., Kusupati, A., Lovett, C., Patil, S. G., and Simhadri, H. V. EdgeML: Machine Learning for resourceconstrained edge devices. URL https://github.com/Microsoft/EdgeML.
- Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
- Gural, A. and Murmann, B. Memory-optimal direct convolutions for maximizing classification accuracy in embedded applications. In International Conference on Machine Learning, pp. 2515–2524, 2019.
- He, K., Zhang, X., Ren, S., and Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- He, Y., Xu, D., Wu, L., Jian, M., Xiang, S., and Pan, C. LFFD: A light and fast face detector for edge devices. arXiv preprint arXiv:1904.10633, 2019.
- Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
- Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P., and Varma, M. FastGRNN: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In Advances in Neural Information Processing Systems, pp. 9017–9028, 2018.
- Lai, L., Suda, N., and Chandra, V. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus. arXiv preprint arXiv:1801.06601, 2018.
- LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
- Liu, N., Han, J., and Yang, M.-H. Picanet: Learning pixelwise contextual attention for saliency detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.
- Pleiss, G., Chen, D., Huang, G., Li, T., van der Maaten, L., and Weinberger, K. Q. Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990, 2017.
- Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510– 4520, 2018.
- Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147, 2013.
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
- Tan, M. and Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114, 2019.
- Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828, 2019.
- Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., and Bengio, Y. Renet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393, 2015.
- Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., and Xu, W. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2285–2294, 2016.
- Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620, 2019.
- Xie, W., Noble, A., and Zisserman, A. Layer recurrent neural networks. 2016.
- Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.K., and Woo, W.-c. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810, 2015.
- Yang, S., Luo, P., Loy, C.-C., and Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5525–5533, 2016.
- Yoo, Y., Han, D., and Yun, S. EXTD: Extremely tiny face detector via iterative filter reuse. arXiv preprint arXiv:1906.06579, 2019.
- Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., and Li, S. Z. Faceboxes: A CPU real-time face detector with high accuracy. In 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 1–9. IEEE, 2017a.
- Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., and Li, S. Z. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 192–201, 2017b.
- Zhao, Q., Lyu, S., Zhang, B., and Feng, W. Multiactivation pooling method in convolutional neural networks for image recognition. Wireless Communications and Mobile Computing, 2018, 2018.
- Zhao, X., Liang, X., Zhao, C., Tang, M., and Wang, J. Real-time multi-scale face detector on embedded devices. Sensors, 19(9):2158, 2019.
- We created ImageNet-10 by taking images from ILSVRC 2012 ImageNet-1K dataset of 1000 classes. All images corresponding to the 10 classes from CIFAR-10 are sampled from the full dataset. The classes in CIFAR-10 are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The corresponding classes chosen from ImageNet-1K are: 1. n02690373: ’airliner’ 2. n04285008: ’sports car’ 3. n01560419: ’bulbul’ 4. n02124075: ’Egyptian cat’ 5. n02430045: ’deer’ 6. n02099601: ’golden retriever’ 7. n01641577: ’bullfrog’ 8. n03538406: ’horse cart’ 9. n03673027: ’ocean liner’ 10. n04467665: ’trailer truck’
- 2. Residual Block: There is a residual connection, therefore the input activation map has to be store until the output is calculated and then added together. The RAM usage will be sum of input and output in the case of a residual block without stride. Since there is no expansion layer, and no activation map needs to be avoided the RAM usage of a residual block with stride will be 2× output size, as the input can be downsampled and stored before adding being added to the output.
- 3. Dense block: At any point in a dense block the activation maps to be stored is the input to the dense block and outputs of all previous dense layers, since the last last layer needs all the activation maps concatenated as its input. The total activation maps being stored will reach the peak just after the last dense layer. Therefore the peak RAM usage is the output of the dense block.
- 4. Inception block: The peak RAM usage for this has been explained in detail in Section 5. Since no inception layer is strided, we do not need separate case like in residual block.