AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our evaluations show that Programmable Ultra-efficient Memristor-based Accelerator can achieve significant improvements compared to state-of-the-art CPUs, GPUs, and ASICs for Machine Learning acceleration

PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming L..., (2019): 715-731

Cited by: 110|Views298
EI

Abstract

Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations of digital logic. They have been shown to be effective in special-purpose accelerators for a limited set of neural network applications. We present the Programmable Ultra-efficient Memristor...More

Code:

Data:

0
Introduction
  • General-purpose computing systems have benefited from scaling for several decades, but are hitting an energy wall.
  • ML workloads tend to be data-intensive and perform a large number of Matrix Vector Multiplication (MVM) operations
  • Their execution on digital CMOS hardware is typically characterized by high data movement costs relative to compute [49].
  • To overcome this limitation, memristor crossbars can store a matrix with high storage density and perform MVM operations with very low energy and latency [5, 13, 52, 87, 98, 116].
  • It combines compute and storage in a single device to alleviate data movement, thereby providing intrinsic suitability for data-intensive workloads [23, 95]
Highlights
  • General-purpose computing systems have benefited from scaling for several decades, but are hitting an energy wall
  • Memristor crossbars can store a matrix with high storage density and perform Matrix Vector Multiplication (MVM) operations with very low energy and latency [5, 13, 52, 87, 98, 116]
  • We propose a programmable architecture and Instruction Set Architecture (ISA) design that leverage memristor crossbars for accelerating Machine Learning (ML) workloads
  • Increasing the Matrix-Vector Multiplication Unit (MVMU) dimension increases the number of crossbar multiply-add operations quadratically and the number of peripherals linearly resulting in more amortization of overhead from peripherals
  • Our accelerator design comes with a complete compiler to transform high-level code to Programmable Ultra-efficient Memristor-based Accelerator (PUMA) ISA and a detailed simulator for estimating performance and energy consumption
  • Our evaluations show that PUMA can achieve significant improvements compared to state-of-the-art CPUs, GPUs, and ASICs for ML acceleration
Methods
  • Design Space Exploration

    Figure 12 shows a PUMA tile’s peak area and power efficiency swept across multiple design space parameters.
  • All other parameters are kept at the sweetspot (PUMA configuration with maximum efficiency).
  • Increasing the MVMU dimension increases the number of crossbar multiply-add operations quadratically and the number of peripherals linearly resulting in more amortization of overhead from peripherals.
  • Increasing the # MVMUs per core increases efficiency because of the high efficiency of memristor crossbars relative to CMOS digital components.
  • With too many MVMUs, the VFU becomes a bottleneck which degrades efficiency.
  • Increasing the VFU width degrades efficiency because of the low efficiency of CMOS relative to memristor crossbars.
  • Increasing the register file size results in lower efficiency, a register file that is too small results in too many register spills
Results
  • 7.1 Inference Energy

    CNNs show the least energy reduction over CMOS architectures (11.7×-13.0× over Pascal).
  • CMOS architectures can amortize DRAM accesses of weights across multiple computations
  • For this reason, PUMA’s energy savings in CNNs come primarily from the use of crossbars for energy efficient MVM computation.
  • In addition to efficient MVM computation, PUMA has the added advantage of eliminating weight data movement.
  • For this reason, the authors see much better energy reductions for MLPs (30.2×-80.1× over Pascal), Deep LSTMs (2,302×-2,446× over Pascal), and Wide LSTMs (758×-1336× over Pascal)
Conclusion
  • PUMA is the first ISA-programmable accelerator for ML inference that uses hybrid CMOS-memristor technology.
  • It enhances memristor crossbars with general purpose execution units carefully designed to maintain crossbar area/energy efficiency and storage density.
  • The authors' accelerator design comes with a complete compiler to transform high-level code to PUMA ISA and a detailed simulator for estimating performance and energy consumption.
  • The authors' evaluations show that PUMA can achieve significant improvements compared to state-of-the-art CPUs, GPUs, and ASICs for ML acceleration
Tables
  • Table1: Workload Characterization
  • Table2: Instruction Set Architecture Overview
  • Table3: PUMA Hardware Characteristics
  • Table4: Benchmarking Platforms
  • Table5: Benchmarks
  • Table6: Comparison with ML Accelerators
  • Table7: Programmability Comparison with ISAAC
  • Table8: Evaluation of Optimizations
Download tables as Excel
Related work
  • Sze et al [103] provide a thorough survey of deep learning accelerators. In the digital realm, accelerators can be classified as weight stationary spatial architectures [15, 17, 36, 43, 85, 93], output stationary spatial architectures [33, 47, 86], spatial architectures with no local reuse [19, 20, 117], and row stationary spatial architectures [21]. Many designs also support optimizations and features including weight pruning and exploiting sparsity [3, 4, 25, 32, 48, 84, 89, 119], reducing precision [8, 62], stochastic computing [65, 90], layer fusing [6], meeting QoS/QoR requirements [108], graph tuning [56], and reconfigurable interconnects [68]. Digital accelerators have varied in their degree of flexibility, ranging from custom accelerators specialized for a particular field [14, 79, 101, 114], to accelerators that are fully programmable via an ISA [61, 71, 72, 105, 119]. FPGAs have also been popular targets for building accelerators [37, 45, 74, 86, 96, 97, 110, 117]. All these works remain in the digital domain, while PUMA leverages hybrid digital-analog computing.
Funding
  • This work is supported by Hewlett Packard Labs and the US Department of Energy (DOE) under Cooperative Agreement DE-SC0012199, the Blackcomb 2 Project
  • This work was also supported in part by the Center for Brain-inspired Computing (C-BRIC), one of six centers in JUMP, a DARPA sponsored Semiconductor Research Corporation (SRC) program
Reference
  • 2018. TSMC Annual Report 2017. TSMC (Mar 2018).
    Google ScholarLocate open access versionFindings
  • Alan Agresti. 200Logistic regression. Wiley Online Library.
    Google ScholarFindings
  • Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O’Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic Deep Neural Network Computing. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 ’17). ACM, New York, NY, USA, 382–394.
    Google ScholarLocate open access versionFindings
  • Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: ineffectual-neuron-free deep neural network computing. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 1–13.
    Google ScholarLocate open access versionFindings
  • Fabien Alibart, Elham Zamanidoost, and Dmitri B Strukov. 2013. Pattern classification by memristive crossbar circuits using ex situ and in situ training. Nature communications 4 (2013).
    Google ScholarLocate open access versionFindings
  • Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 201Fused-layer CNN accelerators. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1–12.
    Google ScholarLocate open access versionFindings
  • Joao Ambrosi, Aayush Ankit, Rodrigo Antunes, Sai Rahul Chalamalasetti, Soumitra Chatterjee, Izzat El Hajj, Guilherme Fachini, Paolo Faraboschi, Martin Foltin, Sitao Huang, Wen mei Hwu, Gustavo Knuppe, Sunil Vishwanathpur Lakshminarasimha, Dejan Milojicic, Mohan Parthasarathy, Filipe Ribeiro, Lucas Rosa, Kaushik Roy, Plinio Silveira, and John Paul Strachan. 2018. Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning. In Rebooting Computing (ICRC), 2018 IEEE International Conference on. IEEE.
    Google ScholarLocate open access versionFindings
  • Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2017. YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2017).
    Google ScholarLocate open access versionFindings
  • Aayush Ankit, Abhronil Sengupta, Priyadarshini Panda, and Kaushik Roy. 2017. RESPARC: A Reconfigurable and Energy-Efficient Architecture with Memristive Crossbars for Deep Spiking Neural Networks. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 27.
    Google ScholarLocate open access versionFindings
  • Aayush Ankit, Abhronil Sengupta, and Kaushik Roy. 2017. Trannsformer: Neural network transformation for memristive crossbar based neuromorphic system design. In Proceedings of the 36th International Conference on Computer-Aided Design. IEEE Press, 533–540.
    Google ScholarLocate open access versionFindings
  • Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm. 2016. Makalu: Fast Recoverable Allocation of Non-volatile Memory. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2016). ACM, New York, NY, USA, 677–694.
    Google ScholarLocate open access versionFindings
  • Mahdi Nazm Bojnordi and Engin Ipek. 2016. Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 1–13.
    Google ScholarLocate open access versionFindings
  • G. W. Burr, R. M. Shelby, S. Sidler, C. di Nolfo, J. Jang, I. Boybat, R. S. Shenoy, P. Narayanan, K. Virwani, E. U. Giacometti, B. N. Kurdi, and H. Hwang. 2015. Experimental Demonstration and Tolerancing of a Large-Scale Neural Network (165 000 Synapses) Using PhaseChange Memory as the Synaptic Weight Element. IEEE Transactions on Electron Devices 62, 11 (Nov 2015), 3498–3507.
    Google ScholarLocate open access versionFindings
  • Ruizhe Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, and Yanzhi Wang. 2018. VIBNN: Hardware Acceleration of Bayesian Neural Networks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 476–488.
    Google ScholarLocate open access versionFindings
  • Lukas Cavigelli, David Gschwend, Christoph Mayer, Samuel Willi, Beat Muheim, and Luca Benini. 20Origami: A Convolutional Network Accelerator. In Proceedings of the 25th Edition on Great Lakes
    Google ScholarLocate open access versionFindings
  • Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud Bhandari. 2014. Atlas: Leveraging Locks for Non-volatile Memory Consistency. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14). ACM, New York, NY, USA, 433–452.
    Google ScholarLocate open access versionFindings
  • Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 247–257.
    Google ScholarLocate open access versionFindings
  • Guoyang Chen, Lei Zhang, Richa Budhiraja, Xipeng Shen, and Youfeng Wu. 2017. Efficient support of position independence on non-volatile memory. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 191–203.
    Google ScholarLocate open access versionFindings
  • Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). ACM, 269–284.
    Google ScholarLocate open access versionFindings
  • Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609–622.
    Google ScholarLocate open access versionFindings
  • Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 367–379.
    Google ScholarLocate open access versionFindings
  • Ming Cheng, Lixue Xia, Zhenhua Zhu, Yi Cai, Yuan Xie, Yu Wang, and Huazhong Yang. 2017. TIME: A Training-in-memory Architecture for Memristor-based Deep Neural Networks. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 26.
    Google ScholarLocate open access versionFindings
  • Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-inmemory architecture for neural network computation in reram-based main memory. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 27–39.
    Google ScholarLocate open access versionFindings
  • Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38, 2 (2018).
    Google ScholarLocate open access versionFindings
  • Jaeyong Chung and Taehwan Shin. 2016. Simplifying deep neural networks for neuromorphic architectures. In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1–6.
    Google ScholarLocate open access versionFindings
  • Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neural computation 22, 12 (2010), 3207–3220.
    Google ScholarLocate open access versionFindings
  • Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making Persistent Objects Fast and Safe with Next-generation, Nonvolatile Memories. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, USA, 105–118.
    Google ScholarLocate open access versionFindings
  • Nachshon Cohen, David T. Aksun, and James R. Larus. 2018. Objectoriented Recovery for Non-volatile Memory. Proc. ACM Program. Lang. 2, OOPSLA, Article 153 (Oct. 2018), 22 pages.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.
    Google ScholarLocate open access versionFindings
  • Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • Biplob Debnath, Alireza Haghdoost, Asim Kadav, Mohammed G Khatib, and Cristian Ungureanu. 2016. Revisiting hash table design for phase change memory. ACM SIGOPS Operating Systems Review 49, 2 (2016), 18–26.
    Google ScholarLocate open access versionFindings
  • Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-circulant Weight Matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 ’17). ACM, New York, NY, USA, 395–408.
    Google ScholarLocate open access versionFindings
  • Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 92–104.
    Google ScholarLocate open access versionFindings
  • Izzat El Hajj, Thomas B. Jablin, Dejan Milojicic, and Wen-mei Hwu. 2017. SAVI Objects: Sharing and Virtuality Incorporated. Proc. ACM Program. Lang. 1, OOPSLA, Article 45 (2017), 24 pages.
    Google ScholarLocate open access versionFindings
  • Izzat El Hajj, Alexander Merritt, Gerd Zellweger, Dejan Milojicic, Reto Achermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe, and Karsten Schwan. 2016. SpaceJMP: Programming with Multiple Virtual Address Spaces. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). ACM, New York, NY, USA, 353–368.
    Google ScholarLocate open access versionFindings
  • C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems. 257–260.
    Google ScholarLocate open access versionFindings
  • Clément Farabet, Cyril Poulet, Jefferson Y Han, and Yann LeCun. 2009. Cnp: An fpga-based processor for convolutional networks. In Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on. IEEE, 32–37.
    Google ScholarLocate open access versionFindings
  • Ben Feinberg, Shibo Wang, and Engin Ipek. 2018. Making memristive neural network accelerators reliable. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 52–65.
    Google ScholarLocate open access versionFindings
  • B. Feinberg, S. Wang, and E. Ipek. 2018. Making Memristive Neural Network Accelerators Reliable. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
    Google ScholarLocate open access versionFindings
  • Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-Memory Data Parallel Processor. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 1–14.
    Google ScholarLocate open access versionFindings
  • Terrence S Furey, Nello Cristianini, Nigel Duffy, David W Bednarski, Michel Schummer, and David Haussler. 2000. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 10 (2000), 906–914.
    Google ScholarLocate open access versionFindings
  • Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 751–764.
    Google ScholarLocate open access versionFindings
  • Vinayak Gokhale, Jonghoon Jin, Aysegul Dundar, Berin Martini, and Eugenio Culurciello. 2014. A 240 g-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 682–687.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
    Google ScholarFindings
  • Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2017. A Survey of FPGA Based Neural Network Accelerator. arXiv preprint arXiv:1712.08934 (2017).
    Findings
  • Xinjie Guo, F Merrikh Bayat, M Bavandpour, M Klachko, MR Mahmoodi, M Prezioso, KK Likharev, and DB Strukov. 2017. Fast, energyefficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded NOR flash memory technology. In Electron Devices Meeting (IEDM), 2017 IEEE International. IEEE, 6–5.
    Google ScholarLocate open access versionFindings
  • Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1737–1746.
    Google ScholarLocate open access versionFindings
  • Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243–254.
    Google ScholarLocate open access versionFindings
  • Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135–1143.
    Google ScholarFindings
  • Yukio Hayakawa, Atsushi Himeno, Ryutaro Yasuhara, W Boullart, E Vecchio, T Vandeweyer, T Witters, D Crotti, M Jurczak, S Fujii, et al. 2015. Highly reliable TaO x ReRAM with centralized filament for 28-nm embedded application. In VLSI Technology (VLSI Technology), 2015 Symposium on. IEEE, T14–T15.
    Google ScholarFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
    Google ScholarLocate open access versionFindings
  • Miao Hu, Catherine Graves, Can Li, Yunning Li, Ning Ge, Eric Montgomery, Noraica Davila, Hao Jiang, R. Stanley Williams, J. Joshua Yang, Qiangfei Xia, and John Paul Strachan. 2018. Memristor-based analog computation and neural network classification with a dot product engine. Advanced Materials (2018).
    Google ScholarLocate open access versionFindings
  • Miao Hu, Hai Li, Qing Wu, and Garrett S. Rose. 2012. Hardware Realization of BSB Recall Function Using Memristor Crossbar Arrays. In Proceedings of the 49th Annual Design Automation Conference (DAC ’12). ACM, New York, NY, USA, 498–503.
    Google ScholarLocate open access versionFindings
  • Miao Hu, John Paul Strachan, Zhiyong Li, Emmanuelle M Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning Ge, Jianhua Joshua Yang, and R Stanley Williams. 2016. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1–6.
    Google ScholarLocate open access versionFindings
  • Joseph Izraelevitz, Hammurabi Mendes, and Michael L Scott. 2016. Linearizability of persistent memory objects under a full-systemcrash failure model. In International Symposium on Distributed Computing. Springer, 313–327.
    Google ScholarLocate open access versionFindings
  • Yu Ji, Youhui Zhang, Wenguang Chen, and Yuan Xie. 2018. Bridge the Gap Between Neural Networks and Neuromorphic Hardware with a Neural Network Compiler. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 448–460.
    Google ScholarLocate open access versionFindings
  • Yu Ji, Youhui Zhang, Wenguang Chen, and Yuan Xie. 2018. Bridge the Gap between Neural Networks and Neuromorphic Hardware with a Neural Network Compiler. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 448–460.
    Google ScholarLocate open access versionFindings
  • Yu Ji, YouHui Zhang, ShuangChen Li, Ping Chi, CiHang Jiang, Peng Qu, Yuan Xie, and WenGuang Chen. 2016. NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1–13.
    Google ScholarLocate open access versionFindings
  • Nan Jiang, George Michelogiannakis, Daniel Becker, Brian Towles, and William J. Dally. 2013. BookSim 2.0 User’s Guide.
    Google ScholarFindings
  • Arpit Joshi, Vijay Nagarajan, Stratis Viglas, and Marcelo Cintra. 2017. ATOM: Atomic durability in non-volatile memory through hardware logging. In High Performance Computer Architecture (HPCA), 2017
    Google ScholarLocate open access versionFindings
  • [62] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1–12.
    Google ScholarLocate open access versionFindings
  • [63] A. B. Kahng, B. Lin, and S. Nath. 2012. Comprehensive Modeling Methodologies for NoC Router Estimation. Technical Report. UCSD.
    Google ScholarLocate open access versionFindings
  • [64] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 380–392.
    Google ScholarLocate open access versionFindings
  • [65] Kyounghoon Kim, Jungki Kim, Joonsang Yu, Jungwoo Seo, Jongeun Lee, and Kiyoung Choi. 2016. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. In Proceedings of the 53rd Annual Design Automation Conference. ACM, 124.
    Google ScholarLocate open access versionFindings
  • [66] Yongtae Kim, Yong Zhang, and Peng Li. 2015. A Reconfigurable Digital Neuromorphic Processor with Memristive Synaptic Crossbar for Cognitive Computing. J. Emerg. Technol. Comput. Syst. 11, 4, Article 38 (April 2015), 25 pages.
    Google ScholarLocate open access versionFindings
  • [67] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
    Google ScholarFindings
  • [68] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 461–475.
    Google ScholarLocate open access versionFindings
  • [69] Dongsoo Lee and Kaushik Roy. 2013. Area efficient ROM-embedded SRAM cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 9 (2013), 1583–1595.
    Google ScholarLocate open access versionFindings
  • [70] Shuangchen Li, Dimin Niu, Krishna T Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. DRISA: A DRAM-based Reconfigurable In-Situ Accelerator. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 288–301.
    Google ScholarLocate open access versionFindings
  • [71] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA, 369–381.
    Google ScholarLocate open access versionFindings
  • [72] Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 393–405.
    Google ScholarLocate open access versionFindings
  • [73] Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Hai Li, Yiran Chen, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, et al. 2015. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design. In Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE. IEEE, 1–6.
    Google ScholarLocate open access versionFindings
  • [74] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. Tabla: A unified template-based framework for accelerating statistical machine learning. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 14–26.
    Google ScholarLocate open access versionFindings
  • [75] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
    Google ScholarLocate open access versionFindings
  • [76] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Understand Large Caches. Technical Report. HP Labs, HPL-2009-85.
    Google ScholarFindings
  • [77] Boris Murmann. 2011. ADC performance survey 1997-2011. http://www.stanford.edu/̃ murmann/adcsurvey.html (2011).
    Locate open access versionFindings
  • [78] Boris Murmann. 2015. The race for the extra decibel: a brief review of current ADC performance trajectories. IEEE Solid-State Circuits Magazine 7, 3 (2015), 58–66.
    Google ScholarLocate open access versionFindings
  • [79] Sean Murray, William Floyd-Jones, Ying Qi, George Konidaris, and Daniel J Sorin. 2016. The microarchitecture of a real-time robot motion planning accelerator. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1–12.
    Google ScholarLocate open access versionFindings
  • [80] Sanketh Nalli, Swapnil Haria, Mark D Hill, Michael M Swift, Haris Volos, and Kimberly Keeton. 2017. An analysis of persistent memory use with WHISPER. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 135–148.
    Google ScholarLocate open access versionFindings
  • [81] John Neter, Michael H Kutner, Christopher J Nachtsheim, and William Wasserman. 1996. Applied linear statistical models. Vol. 4. Irwin Chicago.
    Google ScholarLocate open access versionFindings
  • [82] Matheus Almeida Ogleari, Ethan L Miller, and Jishen Zhao. 2018. Steal but no force: Efficient hardware undo+ redo logging for persistent memory systems. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 336–349.
    Google ScholarLocate open access versionFindings
  • [83] Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner, Thomas Willhalm, and Grégoire Gomes. 2017. Memory management techniques for large-scale persistent-main-memory systems. Proceedings of the VLDB Endowment 10, 11 (2017), 1166–1177.
    Google ScholarLocate open access versionFindings
  • [84] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA, 27–40.
    Google ScholarLocate open access versionFindings
  • [85] Seongwook Park, Kyeongryeol Bong, Dongjoo Shin, Jinmook Lee, Sungpill Choi, and Hoi-Jun Yoo. 2015. 4.6 A1. 93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In Solid-State Circuits Conference-(ISSCC), 2015 IEEE International. IEEE, 1–3.
    Google ScholarLocate open access versionFindings
  • [86] Maurice Peemen, Arnaud AA Setio, Bart Mesman, and Henk Corporaal. 2013. Memory-centric accelerator design for convolutional neural networks. In Computer Design (ICCD), 2013 IEEE 31st International Conference on. IEEE, 13–19.
    Google ScholarLocate open access versionFindings
  • [87] Mirko Prezioso, Farnood Merrikh-Bayat, BD Hoskins, GC Adam, Konstantin K Likharev, and Dmitri B Strukov. 2015. Training and operation of an integrated neuromorphic network based on metaloxide memristors. Nature 521, 7550 (2015), 61–64.
    Google ScholarLocate open access versionFindings
  • [88] Shankar Ganesh Ramasubramanian, Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, and Anand Raghunathan. 2014. SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing. In Proceedings of the 2014 international symposium on Low power electronics and design. ACM, 15–20.
    Google ScholarLocate open access versionFindings
  • [89] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, GuYeon Wei, and David Brooks. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 267–278.
    Google ScholarLocate open access versionFindings
  • [90] Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. 2017. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 405–418.
    Google ScholarLocate open access versionFindings
  • [91] Paul Resnick and Hal R Varian. 1997. Recommender systems. Commun. ACM 40, 3 (1997), 56–58.
    Google ScholarLocate open access versionFindings
  • [92] Ron M Roth. 2017. Fault-Tolerant Dot-Product Engines. arXiv preprint arXiv:1708.06892 (2017).
    Findings
  • [93] Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. 2009. A massively parallel coprocessor for convolutional neural networks. In Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on. IEEE, 53–60.
    Google ScholarLocate open access versionFindings
  • [94] Abhronil Sengupta, Yong Shim, and Kaushik Roy. 2016. Proposal for an all-spin artificial neural network: Emulating neural and synaptic functionalities through domain wall motion in ferromagnets. IEEE transactions on biomedical circuits and systems 10, 6 (2016), 1152–1160.
    Google ScholarLocate open access versionFindings
  • [95] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press.
    Google ScholarLocate open access versionFindings
  • [96] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1–12.
    Google ScholarLocate open access versionFindings
  • [97] Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA, 535–547.
    Google ScholarLocate open access versionFindings
  • [98] Patrick M Sheridan, Fuxi Cai, Chao Du, Wen Ma, Zhengya Zhang, and Wei D Lu. 2017. Sparse coding with memristor networks. Nature nanotechnology (2017).
    Google ScholarLocate open access versionFindings
  • [99] Seunghee Shin, Satish Kumar Tirukkovalluri, James Tuck, and Yan Solihin. 2017. Proteus: A flexible and fast software supported hardware logging approach for NVM. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM.
    Google ScholarLocate open access versionFindings
  • [100] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A pipelined ReRAM-based accelerator for deep learning. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 541–552.
    Google ScholarLocate open access versionFindings
  • [101] M. Song, J. Zhang, H. Chen, and T. Li. 2018. Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 66–77.
    Google ScholarLocate open access versionFindings
  • [102] Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor. 2009. The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems. 1601–1608.
    Google ScholarLocate open access versionFindings
  • [103] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039 (2017).
    Findings
  • [104] Toshiyuki Tanaka. 1998. Mean-field theory of Boltzmann machine learning. Physical Review E 58, 2 (1998), 2302.
    Google ScholarLocate open access versionFindings
  • [105] Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. 2017. ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA, 13–26.
    Google ScholarLocate open access versionFindings
  • [106] Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Lightweight Persistent Memory. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, USA, 91–104.
    Google ScholarLocate open access versionFindings
  • [107] Qian Wang, Yongtae Kim, and Peng Li. 2016. Neuromorphic processors with memristive synapses: Synaptic interface and architectural exploration. ACM Journal on Emerging Technologies in Computing Systems (JETC) 12, 4 (2016), 35.
    Google ScholarLocate open access versionFindings
  • [108] Ying Wang, Huawei Li, and Xiaowei Li. 2017. Real-Time Meets Approximate Computing: An Elastic CNN Inference Accelerator with Adaptive Trade-off between QoS and QoR. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 33.
    Google ScholarLocate open access versionFindings
  • [109] Yandan Wang, Wei Wen, Beiye Liu, Donald Chiarulli, and Hai Helen Li. 2017. Group Scissor: Scaling Neuromorphic Computing Design to Large Neural Networks. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 85.
    Google ScholarLocate open access versionFindings
  • [110] Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1–6.
    Google ScholarLocate open access versionFindings
  • [111] Zhuo Wang, Robert Schapire, and Naveen Verma. 2014. Erroradaptive classifier boosting (EACB): Exploiting data-driven training for highly fault-tolerant hardware. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 3884– 3888.
    Google ScholarLocate open access versionFindings
  • [112] Rainer Waser, Regina Dittmann, Georgi Staikov, and Kristof Szot. 2009. Redox-based resistive switching memories–nanoionic mechanisms, prospects, and challenges. Advanced materials 21, 25-26 (2009), 2632– 2663.
    Google ScholarLocate open access versionFindings
  • [113] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems.. In FAST, Vol. 15. 167–181.
    Google ScholarLocate open access versionFindings
  • [114] Reza Yazdani, Albert Segura, Jose-Maria Arnau, and Antonio Gonzalez. 2016. An ultra low-power hardware accelerator for automatic speech recognition. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1–12.
    Google ScholarLocate open access versionFindings
  • [115] Richard Yu. 2017. Panasonic and UMC Partner for 40nm ReRAM Process Platform. UMC Press Release (Feb 2017).
    Google ScholarFindings
  • [116] Shimeng Yu, Zhiwei Li, Pai-Yu Chen, Huaqiang Wu, Bin Gao, Deli Wang, Wei Wu, and He Qian. 2016. Binary neural network with 16 Mb RRAM macro chip for classification and online training. In Electron Devices Meeting (IEDM), 2016 IEEE International. IEEE, 16–2.
    Google ScholarLocate open access versionFindings
  • [117] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’15). ACM, New York, NY, USA, 161–170.
    Google ScholarLocate open access versionFindings
  • [118] Jintao Zhang, Zhuo Wang, and Naveen Verma. 2016. A machinelearning classifier implemented in a standard 6T SRAM array. In VLSI Circuits (VLSI-Circuits), 2016 IEEE Symposium on. IEEE, 1–2.
    Google ScholarLocate open access versionFindings
  • [119] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1–12.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科