Tetris: Memory-efficient Serverless Inference through Tensor Sharing

USENIX Annual Technical Conference (USENIX ATC)(2022)

引用 22|浏览32
暂无评分
摘要
Executing complex, memory-intensive deep learning inference services poses a major challenge for serverless computing frameworks, which would densely deploy and maintain inference models at high throughput. We observe the excessive memory consumption problem in serverless inference systems, due to the large-sized models and high data redundancy. We present TETRIS, a serverless platform catered to inference services with an order of magnitude lower memory footprint. TETRIS's design carefully considers the extensive memory sharing of runtime and tensors. It supports minimizing the runtime redundancy through a combined optimization of batching and concurrent execution and eliminates tensor redundancy across instances from either the same or different functions using a lightweight and safe tensor mapping mechanism. Our comprehensive evaluation demonstrates that TETRIS saves up to 93% memory footprint for inference services, and increases the function density by 30x without impairing the latency.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要