Congestion Control for Large-Scale RDMA Deployments¶

Modern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 µ s per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a drop-free network. However, PFC can lead to poor application performance due to problems like head-of-line blocking and unfairness. To alleviates these problems, we introduce DCQCN, an end-to-end congestion control scheme for RoCEv2. To optimize DCQCN performance, we build a fluid model, and provide guidelines for tuning switch buffer thresholds, and other protocol parameters. Using a 3-tier Clos network testbed, we show that DCQCN dramatically improves throughput and fairness of RoCEv2 RDMA traffic. DCQCN is implemented in Mellanox NICs, and is being deployed in Microsoft’s datacenters.

Introduction¶

Datacenter applications like cloud storage [16] need high bandwidth (40Gbps or more) to meet rising customer demand. Traditional TCP/IP stacks cannot be used at such speeds, since they have very high CPU overhead [29]. The brutal economics of cloud services business dictates that CPU usage that cannot be monetized should be minimized: a core spent on supporting high TCP throughput is a core that cannot be sold as a VM. Other applications such as distributed memory caches [10, 30] and large-scale machine learning demand ultra-low latency (less than 10 µ s per hop) message transfers. Traditional TCP/IP stacks have far higher latency [10].

We are deploying Remote Direct Memory Access (RDMA) technology in Microsoft’s datacenters to provide ultra-low latency and high throughput to applications, with very low CPU overhead. With RDMA, network interface cards (NICs) transfer data in and out of pre-registered memory buffers at both end hosts. The networking protocol is implemented entirely on the NICs, bypassing the host networking stack. The bypass significantly reduces CPU overhead and overall latency. To simplify design and implementation, the protocol assumes a lossless networking fabric.

While the HPC community has long used RDMA in specialpurpose clusters [11, 24, 26, 32, 38], deploying RDMA on a large scale in modern, IP-routed datacenter networks presents a number of challenges. One key challenge is the need for a congestion control protocol that can operate efficiently in a high-speed, lossless environment, and that can be implemented on the NIC.

We have developed a protocol, called Datacenter QCN (DCQCN) for this purpose. DCQCN builds upon the congestion control components defined in the RoCEv2 standard. DCQCN is implemented in Mellanox NICs, and is currently being deployed in Microsoft’s datacenters.

To understand the need for DCQCN, it is useful to point out that historically, RDMA was deployed using InfiniBand (IB) [19, 21] technology. IB uses a custom networking stack, and purpose-built hardware. The IB link layer (L2) uses hopby-hop, credit-based flow control to prevent packet drops due to buffer overflow. The lossless L2 allows the IB transport protocol (L4) to be simple and highly efficient. Much of the IB protocol stack is implemented on the NIC. IB supports RDMA with so-called single-sided operations, in which a server registers a memory buffer with its NIC, and clients read (write) from (to) it, without further involvement of the server’s CPU.

However, the IB networking stack cannot be easily deployed in modern datacenters. Modern datacenters are built with IP and Ethernet technologies, and the IB stack is incompatible with these. DC operators are reluctant to deploy and manage two separate networks within the same datacenter. Thus, to enable RDMA over Ethernet and IP networks, the RDMA over Converged Ethernet (RoCE) [20] standard, and its successor RoCEv2 [22] have been defined. RoCEv2 retains the IB transport layer, but replaces IB networking layer (L3) with IP and UDP encapsulation, and replaces IB L2 with Ethernet. The IP header is needed for routing, while the UDP header is needed for ECMP [15].

To enable efficient operation, like IB, RoCEv2 must also be deployed over a lossless L2. To this end, RoCE is deployed using Priority-based Flow Control (PFC) [18]. PFC allows an Ethernet switch to avoid buffer overflow by forcing the immediate upstream entity (either another switch or a host NIC) to pause data transmission. However, PFC is a coarse-grained mechanism. It operates at port (or, port plus priority) level, and does not distinguish between flows. This can cause congestion-spreading, leading to poor performance [1, 37].

The fundamental solution to PFC’s limitations is a flowlevel congestion control protocol. In our environment, the protocol must meet the following requirements: (i) function over lossless, L3 routed, datacenter networks, (ii) incur low CPU overhead on end hosts, and (iii) provide hyper-fast start in the common case of no congestion. Current proposals for congestion control in DC networks do not meet all our requirements. For example, QCN [17] does not support L3 networks. DCTCP [2] and iWarp [35] include a slow start phase, which can result in poor performance for bursty storage workloads. DCTCP and TCP-Bolt [37] are implemented in software, and can have high CPU overhead.

Since none of the current proposals meet all our requirements, we have designed DCQCN. DCQCN is an end-toend congestion control protocol for RoCEv2, to enable deployment of RDMA in large, IP-routed datacenter networks. DCQCN requires only the standard RED [13] and ECN [34] support from the datacenter switches. The rest of the protocol functionality is implemented on the end host NICs. DCQCN provides fast convergence to fairness, achieves high link utilization, ensures low queue buildup, and low queue oscillations.

The paper is organized as follows. In §2 we present evidence to justify the need for DCQCN. The detailed design of DCQCN is presented in §3, along with a brief summary of hardware implementation. In §4 we show how to set the PFC and ECN buffer thresholds to ensure correct operation of DCQCN. In §5 we describe a fluid model of DCQCN, and use it to tune protocol parameters. In §6, we evaluate the performance of DCQCN using a 3-tier testbed and traces from our datacenters. Our evaluation shows that DCQCN dramatically improves throughput and fairness of RoCEv2 RDMA traffic. In some scenarios, it allows us to handle as much as 16x more user traffic. Finally, in §7, we discuss practical issues such as non-congestion packet losses.