The Promise of Dataflow Architectures in the Design of Processing Systems for Autonomous Machines

ArXiv(2021)

引用 0|浏览7
暂无评分
摘要
The commercialization of autonomous machines is a thriving sector, and likely to be the next major computing demand driver, after PC, cloud computing, and mobile computing. Nevertheless, a suitable computer architecture for autonomous machines is missing, and many companies are forced to develop ad hoc computing solutions that are neither scalable nor extensible. In this article, we analyze the demands of autonomous machine computing, and argue for the promise of dataflow architectures in autonomous machines. 1 Rise of the Autonomous Machines The commercialization of autonomous machines is a thriving sector, with projected average compound annual growth rate (CAGR) of 26%, and by 2030 this sector will have a market size of $1 trillion [1]. Hence, this sector is likely to be the next major computing demand driver, after personal computers, cloud computing, and mobile computing. Autonomous machines exist in multiple forms, e.g., cars, aerial drones, service robots, industrial robots. Different kinds of autonomous machines are quite diverse in size, shape, mission, goals, location, propulsion, etc. [2] Generally, completely independent teams have approached the design, resulting in a bevy of solutions, some replications but certainly no standardization. A better understanding of the underlying issues, a formalization of the common problems, and a certain unification of the solutions would all yield a more efficient approach to the design process. The core of autonomous machines obviously resides in their computing systems, and encompasses both the hardware and the software level, entailing algorithms, systems software, compilers, as well as computer architectures. Despite the recent advancements in autonomous machine systems design of such major industrial organizations as Google [3], Tesla [4], Mobileye [5], Nvidia [6], the architecture of autonomous machine systems still largely remains an open research question since existing solutions are often made on an ad hoc basis, and not only the development process takes a long time, but the design itself is neither scalable nor extensible [7]. In this article, we first review the advantages of dataflow architectures and their implementation difficulties. Then we summarize the observations of autonomous machine computing, and delve into the details of why dataflow architectures may be extremely well adapted for autonomous machine computing. 2 Dataflow Architectures Before delving into the details of autonomous machines computing, let us first review the benefits of dataflow architectures and their implementation roadblocks. 1 ar X iv :2 10 9. 07 04 7v 1 [ cs .A R ] 1 5 Se p 20 21 Dataflow concepts originated in the 1970s and 1980s, with pioneering work by Jack Dennis and Arvind, and several others [8, 9]. The central idea of dataflow architectures was to replace the classic control flow, or von Neumann, architectures. In a von Neumann architecture, the processor follows an explicit control flow, executing instructions one after another. In a dataflow architecture, execution is event-driven such that an instruction is ready to execute as soon as all its inputs, or “tokens,” are available, rather than when the control flow gets to it. To bridge the gap between traditional architectures and the dataflow computing model, Gao et al. have developed the codelet execution model that incorporates the advantages of macro-dataflow and von Neumann model [10, 11]. The codelet execution model can be used to describe programs in massive parallel systems. Specifically, the observations of autonomous machine computing, as we summarize in section 3, reveal that when the programming model renders the appropriate level of abstraction, a hybrid dataflow and domain-specific accelerator (DSA) architecture is extremely well adapted for autonomous machine applications. Classic dataflow architecture, by representing a program as a dataflow graph, can naturally explore the instruction-level parallelism in a program by firing instructions as soon as their operands are ready, hence minimizing the control overheads. Despite the implementation difficulties of the pure dataflow concepts, such as the cost/benefit problem [12], the merit of dataflow graph representation and data driven execution has led to the emergence of superscalar architectures (using restricted dataflow graph to exploit instruction level parallelism), and multiple grid architectures (hybrids of control-flow and dataflow) [13, 14], which map dataflow graphs onto grid processors and execute operations in a dataflow fashion to enable concurrent execution. In addition to the general purpose processors, the dataflow architecture’s properties of decentralized control and data-driven execution have been successfully used to synthesize power efficient application specific hardware [15]. In more detail, the original dataflow architecture proposals specify that dataflow machines are fine-grain parallel computers, where the processes are about the size of a single instruction in a conventional computer. Instructions are known as nodes, and the data passed between different nodes are called tokens. A producing node is connected to a consuming node by an arc, and the point where an arc enters a node is called an input port. The execution of an instruction is called the firing of a node. Execution of a node only occurs if the node is enabled when each input port contains a token. Under the dataflow execution model, there is no such thing as control flow and the problem of synchronizing data and control flow has disappeared, making dataflow programs well suited for parallel processing. In a dataflow graph, the arcs between the instructions directly reflect the partial ordering imposed by their data dependencies [16]. The dataflow execution model is remarkably powerful and, in recent years, has been widely used in cloud and distributed computing [17]. Hardware architectures based on the dataflow execution model, however, have not had equal amount of success, primarily for four reasons: • Insufficient Amount of Parallelism: At the instruction level, many conventional programs do not have enough intrinsic parallelisms to utilize a realistic dataflow hardware except when processing large arrays. The lack of ILP in conventional programs begs the question: is instruction the right level of abstraction for dataflow architectures? • Explosion of Parallelism: Conversely, the very lack of central control, so central to allowing parallelism may lead to an uncontrolled (and uncontrollable) demand for parallel resources, leading to deadlocks. • Producer-Consumer Speed Mismatch: When the speed between the producer and consumer nodes are mismatched, either the producer (or consumer) will have to stall, leading to
更多
查看译文
关键词
dataflow architectures,processing systems,autonomous machines
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要