Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

FPGA, pp.161-170, (2015)

Cited by: 1676|Views422
EI

Abstract

Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerator...More

Code:

Data:

0
Introduction
  • Convolutional neural network (CNN), a well-known deep learning architecture extended from artificial neural network, has been extensively adopted in various applications, which include video surveillance, mobile robot vision, image search engine in data centers, etc [6] [7] [8] [10] [14].
  • If an accelerator structure is not carefully designed, its computing throughput cannot match the memory bandwidth provided an FPGA platform.
  • The authors quantitatively analyze computing throughput and required memory bandwidth of any potential solution of a CNN design on an FPGA platform.
Highlights
  • Convolutional neural network (CNN), a well-known deep learning architecture extended from artificial neural network, has been extensively adopted in various applications, which include video surveillance, mobile robot vision, image search engine in data centers, etc [6] [7] [8] [10] [14]
  • Rapid growth of modern applications based on deep learning algorithms has further improved research on deep convolutional neural network
  • Various accelerators based on FPGA, GPU, and even ASIC design have been proposed recently to improve performance of CNN designs [3] [4] [9]
  • As a case study, we implement a CNN accelerator that achieves a performance of 61.62 GFLOPS
  • A CNN accelerator design on FPGA is composed of several major components, which are processing elements (PEs), on-chip buffer, external memory, and on-/off-chip interconnect
Results
  • The authors propose a CNN accelerator design with uniform loop unroll factors across different convolutional layers.
  • As shown in Figure 4, a CNN accelerator design on FPGA is composed of several major components, which are processing elements (PEs), on-chip buffer, external memory, and on-/off-chip interconnect.
  • The authors use data reuse technique to reduce external memory traffic and formulate the computation to communication ratio with tiling factors (Section 3.3).
  • The authors use standard polyhedral-based data dependence analysis [13] to derive a series of legal design variants of equivalently CNN implementations through loop scheduling and loop tile size enumeration.
  • 3.3 Memory Access Optimization In Section 3.2, the authors discussed how to derive design variants with different computational roofs, assuming all data accessed by the computation engine are already buffered onchip.
  • Data reuse optimization will reduce the total number of memory accesses, increase the computation to communication ratio.
  • The computation to communication ratio of the code shown in Figure 9 can be calculated by Equation (4), where αin, αout, αwght and Bin, Bout, Bwght denote the trip counts and buffer sizes of memory accesses to input/output feature maps and weights respectively.
  • Implementation A achieves the highest possible computational performance, the memory bandwidth required cannot be satisfied by the target platform.
  • Designing a hardware accelerator to support multiple convolutional layer with different unroll factors would be challenging.
  • CNN accelerator with unified unroll factors is simple to design and implement, but may be sub-optimal for some layers.
  • For the best cross-layer design ( Tm, Tn = 64, 7 ) case, the computation engine is implemented as a tree-shaped poly structure with 7 inputs from input feature maps and 7 inputs from weights and one input from bias, which is stored in the buffers of output feature maps.
Conclusion
  • The authors discuss different design methods with reference to other previous work on FPGA-based CNN ac-
  • The earliest approach in work [6] was to build their CNN application mainly by software implementation while using one hardware systolic architecture accelerator to do the filtering convolution job.
  • The authors' implementation take advantage of data reuse, and balances limitations of bandwidth and FPGA computation power.
Tables
  • Table1: CNN configurations
  • Table2: Data sharing relations of CNN code input f m weights output f m trr dependent irrelevant independent tcc dependent irrelevant independent too irrelevant independent independent tii independent independent irrelevant i dependent independent irrelevant j dependent independent irrelevant
  • Table3: Data sharing relations of communication part
  • Table4: Layer specific optimal solution and crosslayer optimization
  • Table5: Comparison to previous implementations
  • Table6: FPGA Resource utilization
  • Table7: Performance comparison to CPU
  • Table8: Power consumption and energy
  • Table9: Resource occupation comparison
Download tables as Excel
Related work
  • In this section, we discuss different design methods with reference to other previous work on FPGA-based CNN ac-

    S. Cadambi. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News, volume 38, pages 247–257. ACM, 2010.

    celerator designs.

    [4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and
Funding
  • In Application-specific Systems, Architectures and Processors, This work was supported in part by NSF China 61202072, RFDP 20110001110099, National High Technology Research and Development Program of China 2012AA010902 and CFAR, one of six centers of STARnet, a Semiconductor Re-
Reference
  • 9–18, New York, NY, USA, 2013. ACM.
    Google ScholarFindings
  • [1] D. Aysegul, J. Jonghoon, G. Vinayak, K. Bharadwaj, C. Alfredo, M. Berin, and C. Eugenio. Accelerating deep neural networks on mobile processor with embedded programmable logic. In NIPS 2013. IEEE, 2013.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科