HPCC: High Precision Congestion Control¶

Congestion control (CC) is the key to achieving ultra-low latency, high bandwidth and network stability in high-speed networks. From years of experience operating large-scale and high-speed RDMA networks, we find the existing high-speed CC schemes have inherent limitations for reaching these goals. In this paper, we present HPCC (High Precision Congestion Control), a new high-speed CC mechanism which achieves the three goals simultaneously. HPCC leverages in-network telemetry (INT) to obtain precise link load information and controls traffic precisely. By addressing challenges such as delayed INT information during congestion and overreaction to INT information, HPCC can quickly converge to utilize free bandwidth while avoiding congestion, and can maintain near-zero in-network queues for ultra-low latency. HPCC is also fair and easy to deploy in hardware. We implement HPCC with commodity programmable NICs and switches. In our evaluation, compared to DCQCN and TIMELY, HPCC shortens flow completion times by up to 95%, causing little congestion even under large-scale incasts.

Introduction¶

The link speed in data center networks has grown from 1Gbps to 100Gbps in the past decade, and this growth is continuing. Ultralow latency and high bandwidth, which are demanded by more and more applications, are two critical requirements in today’s and future high-speed networks.

Specifically, as one of the largest cloud providers in the world, we observe two critical trends in our data centers that drive the demand on high-speed networks. The first trend is new data center architectures like resource disaggregation and heterogeneous computing. In resource disaggregation, CPUs need high-speed networking with remote resources like GPU, memory and disk. According to a recent study [17], resource disaggregation requires 3-5µs network latency and 40-100Gbps network bandwidth to maintain good application-level performance. In heterogeneous computing environments, different computing chips, e.g. CPU, FPGA, and GPU, also need high-speed interconnections, and the lower the latency, the better. The second trend is new applications like storage on high I/O speed media, e.g. NVMe (non-volatile memory express) and large-scale machine learning training on high computation speed devices, e.g. GPU and ASIC. These applications periodically transfer large volume data, and their performance bottleneck is usually in the network since their storage and computation speeds are very fast.

Given that traditional software-based network stacks in hosts can no longer sustain the critical latency and bandwidth requirements [43], offloading network stacks into hardware is an inevitable direction in high-speed networks. In recent years, we deployed large-scale networks with RDMA (remote direct memory access) over Converged Ethernet Version 2 (RoCEv2) in our data centers as our current hardware-offloading solution.

Unfortunately, after operating large-scale RoCEv2 networks for years, we find that RDMA networks face fundamental challenges to reconcile low latency, high bandwidth utilization, and high stability. This is because high speed implies that flows start at line rate and aggressively grab available network capacity, which can easily cause severe congestion in large-scale networks. In addition, high throughput usually results in deep packet queueing, which undermines the performance of latency-sensitive flows and the ability of the network to handle unexpected congestion. We highlight two representative cases among the many we encountered in practice to demonstrate the difficulty:

Case-1: PFC (priority flow control) storms. A cloud storage (test) cluster with RDMA once encountered a network-wide, largeamplitude traffic drop due to a long-lasting PFC storm. This was triggered by a large incast event together with a vendor bug which caused the switch to keep sending PFC pause frames indefinitely. Because incast events and congestion are the norm in this type of cluster, and we are not sure whether there will be other vendor bugs that create PFC storms, we decided to try our best to prevent any PFC pauses. Therefore, we tuned the CC algorithm to reduce rates quickly and increase rates conservatively to avoid triggering PFC pauses. We did get fewer PFC pauses (lower risk), but the average link utilization in the network was very low (higher cost).

Case-2: Surprisingly long latency. A machine learning (ML) application complained about > 100us average latency for short messages; its expectation was a tail latency of < 50µs with RDMA. The reason for the long latency, which we finally dug out, was the in-network queues occupied majorly by a cloud storage system that is bandwidth intensive in the same cluster. As a result, we have to separate the two applications by deploying the ML application to a new cluster. The new cluster had low utilization (higher cost) given that the ML application is not very bandwidth hungry.

To address the difficulty to reconcile latency, bandwidth/utilization, and stability, we believe a good design of CC is the key. This is because CC is the primary mechanism to avoid packet buffering or loss under high traffic loads. If CC fails frequently, backup methods like PFC or packet retransmissions can either introduce stability concerns or suffer a large performance penalty. Unfortunately, we found state-of-art CC mechanisms in RDMA networks, such as DCQCN [43] and TIMELY [31], have some essential limitations:

Slow convergence. With coarse-grained feedback signals, such as ECN or RTT, current CC schemes do not know exactly how much to increase or decrease sending rates. Therefore, they use heuristics to guess the rate updates and try to iteratively converge to a stable rate distribution. Such iterative methods are slow for handling large-scale congestion events[25], as we can see in Case-1.

Unavoidable packet queueing. A DCQCN sender leverages the one-bit ECN mark to judge the risk of congestion, and a TIMELY sender uses the increase of RTT to detect congestion. Therefore, the sender starts to reduce flow rates only after a queue builds up. These built-up queues can significantly increase the network latency, and this is exactly the issue met by the ML application at the beginning in Case-2.

Complicated parameter tuning. The heuristics used by current CC algorithms to adjust sending rates have many parameters to tune for a specific network environment. For instance, DCQCN has 15 knobs to set up. As a result, operators usually face a complex and time-consuming parameter tuning stage in daily RDMA network operations, which significantly increases the risk of incorrect settings that cause instability or poor performance.

The fundamental cause of the preceding three limitations is the lack of fine-grained network load information in legacy networks – ECN is the only feedback an end host can get from switches, and RTT is a pure end-to-end measurement without switches’ involvement. However, this situation has recently changed. With In-network telemetry (INT) features that have become available in new switching ASICs [2–4], obtaining fine-grained network load information and using it to improve CC has become possible in production networks.

In this paper, we propose a new CC mechanism, HPCC (High Precision Congestion Control), for large-scale, high-speed networks. The key idea behind HPCC is to leverage the precise link load information from INT to compute accurate flow rate updates. Unlike existing approaches that often require a large number of iterations to find the proper flow rates, HPCC requires only one rate update step in most cases. Using precise information from INT enables HPCC to address the three limitations in current CC schemes. First, HPCC senders can quickly ramp up flow rates for high utilization or ramp down flow rates for congestion avoidance. Second, HPCC senders can quickly adjust the flow rates to keep each link’s input rate slightly lower than the link’s capacity, preventing queues from being built-up as well as preserving high link utilization. Finally, since sending rates are computed precisely based on direct measurements at switches, HPCC requires merely 3 independent parameters that are used to tune fairness and efficiency.

On the flip side, leveraging INT information in CC is not straightforward. There are two main challenges to design HPCC. First, INT information piggybacked on packets can be delayed by link congestion, which can defer the flow rate reduction for resolving the congestion. In HPCC, our CC algorithm aims to limit and control the total inflight bytes to busy links, preventing senders from sending extra traffic even if the feedback gets delayed. Second, despite that INT information is in all the ACK packets, there can be destructive overreactions if a sender blindly reacts to all the information for fast reaction (§3.2). Our CC algorithm selectively uses INT information by combining per-ACK and per-RTT reactions, achieving fast reaction without overreaction.

HPCC meets our goals of achieving ultra-low latency, high bandwidth, and high stability simultaneously in large-scale high-speed networks. In addition, HPCC also has the following essential properties for being practical: (i) Deployment ready: It merely requires standard INT features (with a trivial and optional extension for efficiency) in switches and is friendly to implementation in NIC hardware. (ii) Fairness: It separates efficiency and fairness control. It uses multiplicative increase and decrease to converge quickly to the proper rate on each link, ensuring efficiency and stability, while it uses additive increase to move towards fairness for long flows.

HPCC’s stability and fairness are guaranteed in theory (Appendix A). We implement HPCC on commodity NIC with FPGA and commodity switching ASIC with P4 programmability. With testbed experiments and large-scale simulations, we show that compared with DCQCN, TIMELY and other alternatives, HPCC reacts faster to available bandwidth and congestion and maintains close-to-zero queues. In our 32-server testbed, even under 50% traffic load, HPCC keeps the queue size zero at the median and 22.9KB (only 7.3µs queueing delay) at the 99th-percentile , which results in a 95% reduction in the 99th-percentile latency compared to DCQCN without sacrificing throughput. In our 320-server simulation, even under incast events where PFC storms happen frequently with DCQCN and TIMELY, PFC pauses are not triggered with HPCC.

Note that despite HPCC having been designed from our experiences with RDMA networks, we believe its insights and designs are also suitable for other high-speed networking solutions in general.

核心特点In-Network Telemetry (INT)是如何实现的

HPCC（High Precision Congestion Control）的核心是通过In-Network Telemetry (INT) 获取精确的网络负载信息，从而进行精准的拥塞控制。它的实现是一个涉及发送端、交换机和接收端的完整闭环系统：

第一步：交换机在数据包中嵌入负载信息
- 当一个数据包从发送端发出，沿途经过的每一个交换机都会利用其INT功能，在数据包的包头中插入一小段元数据
- 这些元数据精确地报告了该数据包出端口（Egress Port）的当前负载状况
第二步：接收端将信息传回给发送端
- 当接收端收到这个带有INT元数据的数据包后，它并不会自己处理这些信息
- 相反，接收端会将数据包中由沿途所有交换机记录的元数据，原封不动地复制到它即将发回给发送端的ACK包中
第三步：发送端根据精确信息调整发送行为
- 发送端每收到一个ACK包，就会读取其中包含的、来自网络路径上每一跳的精确负载信息
- 基于这些信息（如同有了一份详细的“路况报告”），发送端就可以精确计算出自己应该如何调整发送速率或发送窗口大小

如何确保“实时性”？难道不会有延迟吗？延迟的INT信息还能给到参考吗？

INT信息本身确实会因为数据包（及其ACK包）被延迟而延迟; 延迟的INT信息当然是有价值的，前提是控制算法必须能妥善处理这种延迟

机制1: 通过控制“在途字节数”（Inflight Bytes）来容忍延迟

它直接控制“在途字节数”（Inflight Bytes），即已经发送但尚未收到确认的数据量

发送端能发送的数据总量受限于其当前的发送窗口大小。一旦窗口用尽，无论反馈（ACK）被延迟了多久，发送端都必须停止发送

机制2: 结合“每ACK响应”和“每RTT响应”的混合策略，实现快速且不“过激”的反应

仅“每ACK响应”：对每个收到的ACK都立即做出反应。优点是快，但缺点是容易“反应过激”（Overreaction）。因为短时间内连续收到的几个ACK可能描述的是同一拥塞事件
仅“每RTT响应”：只在一个完整的RTT周期后调整一次。优点是稳定，不会过激，但缺点是反应太慢，无法应对突发的拥塞 (incast)