PPOAccel: A High-Throughput Acceleration Framework for Proximal Policy Optimization

IEEE Transactions on Parallel and Distributed Systems(2022)

引用 5|浏览34
暂无评分
摘要
Reinforcement Learning (RL) is a major branch of AI that enables agents to learn optimal decision making via interaction with the environment. Proximal Policy Optimization (PPO) is the state-of-the-art policy optimization based RL algorithm which achieves superior overall performance on various benchmarks. A PPO agent iteratively optimizes its policy - a function which chooses optimal actions approximated by a DNN, with each iteration consisting of two computationally intensive phases: Sample Generation - where agents inference on its policy and interact with the environment to collect data, and Model Update - where the policy is trained using the collected data. In this paper, we develop the first high-throughput PPO accelerator on CPU-FPGA heterogeneous platform. Our unified systolic-array based design accelerates both the inference and the training of the deep neural network used in a RL algorithm, and is generalizable to various MLP and CNN models across a wide range of RL applications. We develop novel optimizations to simultaneously reduce data access and computation latencies, specifically: (a) optimal data flow mapping to systolic array, (b) novel memory-blocked data layout to enable streaming stall-free data access in both forward and backward propagations, and, (c) a systolic array compute sharing technique to mitigate load imbalance in the training of two networks. We evaluate our design on widely used robotics and gaming benchmarks, achieving 1.4×–26× and 1.3×–2.7× improvements in throughput, respectively, when compared with state-of-the-art CPU/CPU-GPU implementations.
更多
查看译文
关键词
Reinforcement learning,hardware accelerators,FPGA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要