Switch-Assistant Loss Recovery for RDMA Transport Control

IEEE-ACM TRANSACTIONS ON NETWORKING(2023)

引用 0|浏览4
暂无评分
摘要
RoCEv2 (RDMA over Converged Ethernet version 2) is the canonical method for deploying RDMA in Ethernet-based datacenters. Traditionally, RoCEv2 runs over the lossless network which is in turn achieved by enabling Priority Flow Control (PFC) within the network. However, as the scale of the datacenter increases, PFC's side effects, such as head-of-line blocking, congestion spreading, and pause frame storm, are amplified. Datacenter operators can no longer tolerate these problems. In hence, they are seeking PFC alternatives for RDMA networks. Rather than aiming at the lossless RDMA network, we instead handle packet loss effectively to support RDMA over Ethernet. In this paper, we propose Switch-assistant Loss Recovery (SLR), a switch building block to enhance RoCEv2's loss recovery. Specifically, SLR-enabled switches send loss notifications to request fast retransmissions. To cooperate with go-back-N retransmission, SLR generates loss notifications only when expected packets (i.e., in-order packets expected by receivers) are dropped and then filters out unexpected packets, which can avoid timeouts and prevent exacerbating congestion. Further, we adapt SLR to multi-bottleneck scenarios by inferring expected packets among multiple switch views. We implement SLR prototypes on commodity programmable switches. Evaluations show that SLR reduces the 99.9th-percentile FCT slowdown by up to 21.6x compared to PFC and other state-of-the-arts.
更多
查看译文
关键词
Data center networks,transport protocol,RDMA,programmable switch
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要