Accelerating Recommendation Inference via GPU Streams

Yuean Niu,Zhizhen Xu,Chen Xu, Jiaqiang Wang

Database Systems for Advanced Applications(2023)

引用 0|浏览1
暂无评分
摘要
Deep Learning based recommendation is common in various recommendation services and widely used in the industry. To predict user preferences accurately, state-of-the-art recommendation models contain an increasing number of features and various methods of feature interaction, which both lengthen inference time. We observe that the embedding lookup and feature interaction of different features in a recommendation model is independent of each other. However, current deep learning frameworks (e.g., TensorFlow, PyTorch) are oblivious to this independence, and schedule the operators to execute sequentially in a single computational stream. In this work, we exploit multiple CUDA streams to parallelize the execution of embedding lookup and feature interaction. To further overlap the processing of different sparse features and minimize synchronization overhead, we propose a topology-aware operator assignment algorithm to schedule operators to computational streams. We implement a prototype, namely StreamRec, based on TensorFlow XLA. Our experiments show that StreamRec is able to reduce latency by up to 27.8% and increase throughput by up to 52% in comparison to the original TensorFlow XLA.
更多
查看译文
关键词
gpu streams,recommendation inference
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要