PowerTCP: Pushing the Performance Limits of Datacenter Networks¶

Increasingly stringent throughput and latency requirements in datacenter networks demand fast and accurate congestion control. We observe that the reaction time and accuracy of existing datacenter congestion control schemes are inherently limited. They either rely only on explicit feedback about the network state (e.g., queue lengths in DCTCP) or only on variations of state (e.g., RTT gradient in TIMELY). To overcome these limitations, we propose a novel congestion control algorithm, PowerTCP, which achieves much more fine-grained congestion control by adapting to the bandwidth-window product (henceforth called power). PowerTCP leverages in-band network telemetry to react to changes in the network instantaneously without loss of throughput and while keeping queues short. Due to its fast reaction time, our algorithm is particularly well-suited for dynamic network environments and bursty traffic patterns. We show analytically and empirically that PowerTCP can significantly outperform the state-of-the-art in both traditional datacenter topologies and emerging reconfigurable datacenters where frequent bandwidth changes make congestion control challenging. In traditional datacenter networks, PowerTCP reduces tail flow completion times of short flows by 80% compared to DCQCN and TIMELY, and by 33% compared to HPCC even at 60% network load. In reconfigurable datacenters, PowerTCP achieves 85% circuit utilization without incurring additional latency and cuts tail latency by at least 2x compared to existing approaches.

数据中心网络日益严苛的吞吐量与延迟要求，对快速且精确的拥塞控制提出了迫切需求。我们观察到，现有的数据中心拥塞控制方案在反应时间和精确性上存在固有的局限性。这些方案或仅依赖于关于网络状态的显式反馈（例如，DCTCP中的队列长度），或仅依赖于状态的变化（例如，TIMELY中的RTT梯度）。

为克服这些局限性，我们提出了一种名为POWER TCP的新颖拥塞控制算法，它通过适应带宽窗口乘积（下文称之为“功率”）来实现更为细粒度的拥塞控制。POWER TCP利用网内遥测技术（In-band Network Telemetry, INT）来对网络变化做出瞬时反应，在不损失吞吐量的同时保持极短的队列。鉴于其快速的反应时间，我们的算法尤其适用于动态网络环境和突发性流量模式。

我们从理论分析和实证评估两方面证明，无论是在传统的数据中心拓扑中，还是在频繁的带宽变化使拥塞控制极具挑战的新兴可重构数据中心里，POWER TCP的性能均能显著优于当前最先进的方案。在传统数据中心网络中，即便在60%的网络负载下，与DCQCN和TIMELY相比，POWER TCP仍能将短流的尾部流完成时间缩短80%；与HPCC相比，也能将其缩短33%。在可重构数据中心中，与现有方法相比，POWER TCP能在不引入额外延迟的情况下达到85%的链路利用率，并将尾延迟削减至少两倍。

Introduction¶

The performance of more and more cloud-based applications critically depends on the underlying network, requiring datacenter networks (DCNs) to provide extremely low latency and high bandwidth. For example, in distributed machine learning applications that periodically require large data transfers, the network is increasingly becoming a bottleneck [36]. Similarly, stringent performance requirements are introduced by today’s trend of resource disaggregation in datacenters where fast access to remote resources (e.g., GPUs or memory) is pivotal for the overall system performance [36]. Building systems with strict performance requirements is especially challenging under bursty traffic patterns as they are commonly observed in datacenter networks [12,16,47,53,55].

These requirements introduce the need for fast and accurate network resource management algorithms that optimally utilize the available bandwidth while minimizing packet latencies and flow completion times. Congestion control (CC) plays an important role in this context being ‘‘a key enabler (or limiter) of system performance in the datacenter’’ [34]. In fact, fast reacting congestion control is not only essential to efficiently adapt to bursty traffic [29,48], but is also becoming increasingly important in the context of emerging reconfigurable datacenter networks (RDCNs) [13,14,20,33,38,39,50]. In these networks, a congestion control algorithm must be able to quickly ramp up its sending rate when high-bandwidth circuits become available [43].

越来越多云应用的性能关键性地依赖于底层网络，这要求数据中心网络（DCNs）必须提供极低的延迟和极高的带宽。例如，在需要周期性进行大规模数据传输的分布式机器学习应用中，网络正日益成为性能瓶颈[36]。同样，当今数据中心资源解耦的趋势也引入了严苛的性能要求，在这种架构中，对远程资源（如GPU或内存）的快速访问对整体系统性能至关重要[36]。在数据中心网络中常见的突发性流量模式下，构建满足严格性能要求的系统尤其具有挑战性[12,16,47,53,55]。

这些要求催生了对快速且精确的网络资源管理算法的需求，这些算法需要能够在最小化数据包延迟和流完成时间的同时，优化利用可用带宽。在此背景下，拥塞控制（CC）扮演着重要角色，它是“数据中心系统性能的关键促成者（或限制者）”[34]。事实上，快速反应的拥塞控制不仅对于有效适应突发流量至关重要[29,48]，而且在新型可重构数据中心网络（RDCNs）[13,14,20,33,38,39,50]的背景下也变得日益重要。在这类网络中，拥塞控制算法必须能够在高带宽链路可用时迅速提升其发送速率[43]。

Traditional congestion control in datacenters revolves around a bottleneck link model: the control action is related to the state i.e., queue length at the bottleneck link. A common goal is to efficiently control queue buildup while achieving high throughput. Existing algorithms can be broadly classified into two types based on the feedback that they react to. In the following, we will use an analogy to electrical circuits 1 to describe these two types. The first category of algorithms react to the absolute network state, such as the queue length or the RTT: a function of network ‘‘effort’’ or voltage defined as the sum of the bandwidth-delay product and in-network queuing. The second category of algorithms rather react to variations, such as the change of RTT. Since these changes are related to the network ‘‘flow’’, we say that these approaches depend on the current defined as the total transmission rate. We tabulate our analogy and corresponding network quantities in Table 1. According to this classification, we call congestion control protocols such as CUBIC [21], DCTCP [7], or Vegas [15] voltage-based CC algorithms as they react to absolute properties such as the bottleneck queue length, delay, Explicit Congestion Notification (ECN), or loss. Recent proposals such as TIMELY [41] are current-based CC algorithms as they react to the variations, such as the RTT-gradient. In conclusion, we find that existing congestion control algorithms are fundamentally limited to one of the two dimensions (voltage or current) in the way they update the congestion window.

We argue that the input to a congestion control algorithm should rather be a function of the two-dimensional state of the network (i.e., both voltage and current) to allow for more informed and accurate reaction, improving performance and stability. In our work, we show that there exists an accurate relationship between the optimal adjustment of the congestion window, the network voltage and the network current. We analytically show that the optimal window adjustment depends on the product of network voltage and network current. We call this product network power: current × voltage, a function of both queue lengths and queue dynamics.

传统的拥塞控制围绕着一个瓶颈链路模型展开：控制行为与瓶颈链路的状态（即队列长度）相关。一个共同的目标是在实现高吞吐量的同时，有效控制队列的累积。根据其所响应的反馈类型，现有算法大致可分为两类。下文中，我们将使用一个电路学的类比来描述这两类算法。

第一类算法响应的是网络的绝对状态，如队列长度或RTT：这可以看作是网络“功”或电压的函数，定义为带宽延迟积与网络内排队之和。

第二类算法则响应的是状态的变化量，如RTT的变化。由于这些变化与网络“流动”相关，我们称这些方法依赖于电流，其定义为总传输速率。

alt text

我们在表1中列出了我们的类比及相应的网络量。根据此分类，我们将诸如CUBIC [21]、DCTCP [7]或Vegas [15]等拥塞控制协议称为基于电压的拥塞控制算法，因为它们响应的是瓶颈队列长度、延迟、显式拥塞通知（ECN）或丢包等绝对属性。而近期的方案如TIMELY [41]则是基于电流的拥塞控制算法，因为它们响应的是变化量，如RTT梯度。总之，我们发现现有的拥塞控制算法在更新拥塞窗口的方式上，从根本上被限制在这两个维度（电压或电流）之一。

我们认为，拥塞控制算法的输入应是一个关于网络二维状态（即同时包含电压和电流）的函数，以便做出更明智且精确的反应，从而提升性能和稳定性。在我们的工作中，我们证明了拥塞窗口的最优调整与网络电压和网络电流之间存在一种精确的关系。我们从理论上表明，最优的窗口调整依赖于网络电压与网络电流的乘积。我们将此乘积称为网络功率：电流 × 电压，它是一个同时取决于队列长度和队列动态的函数。

alt text

Figure 1 illustrates our classification. Existing protocols depend on a single dimension, voltage or current. This can result in imprecise congestion control as the protocol is unable to distinguish between fundamentally different scenarios, and, as a result, either reacts too slowly or overreacts, both impeding performance. Accounting for both voltage and current, i.e., power, balances accurate inflight control and fast reaction, effectively providing the best of both worlds.

图1阐释了我们的分类。现有协议依赖于单一维度，即电压或电流。这可能导致不精确的拥塞控制，因为协议无法区分本质上不同的场景，其结果要么是反应过慢，要么是反应过激，两者都会阻碍性能。而同时考虑电压和电流，即功率，则平衡了精确的在途流量控制与快速反应能力，有效地集两家之长。

In this paper we present PowerTCP, a novel power-based congestion control algorithm that accurately captures both voltage and current dimensions for every control action using measurements taken within the network and propagated through in-band network telemetry (INT). PowerTCP is able to utilize available bandwidth within one or two RTTs while being stable, maintaining low queue lengths, and resolving congestion rapidly. Furthermore, we show that P OW ER TCP is Lyapunov-stable, as well as asymptotically stable and has a convergence time as low as five update intervals (Appendix A). This makes PowerTCP highly suitable for today’s datacenter networks and dynamic network environments such as in reconfigurable datacenters.

在本文中，我们提出了PowerTCP，一种新颖的基于功率的拥塞控制算法，它利用网内测量并通过网内遥测技术（INT）传播，从而在每次控制动作中都能精确地捕捉电压和电流两个维度。

PowerTCP能够在一到两个RTT内利用可用带宽，同时保持稳定、维持低队列长度并迅速解决拥塞。此外，我们证明了PowerTCP是李雅普诺夫稳定的、渐近稳定的，并且其收敛时间可低至五个更新周期（附录A）。这使得PowerTCP非常适用于当今的数据中心网络以及诸如可重构数据中心等动态网络环境。

PowerTCP leverages in-network measurements at programmable switches to accurately obtain the bottleneck link state. Our switch component is lightweight and the required INT header fields are standard in the literature [36]. We also discuss an approximation of PowerTCP for use with nonprogrammable, legacy switches.

To evaluate PowerTCP, we focus on a deployment scenario in the context of RDMA networks where the CC algorithm is implemented on a NIC. Our results from largescale simulations show that PowerTCP reduces the 99.9percentile short flow completion times by 80% compared to DCQCN [56] and by 33% compared to the state-of-the-art low-latency protocol HPCC [36]. We show that P OW ER TCP maintains near-zero queue lengths without affecting throughput or incurring long flow completion times even at 80% load. As a case study, we explore the benefits of PowerTCP in reconfigurable datacenter networks where it achieves 80−85% circuit utilization and reduces tail latency by at least 2× compared to the state-of-the-art [43]. Finally, as a proof-of-concept, we implemented PowerTCP in the Linux kernel and the telemetry component on an Intel Tofino programmable line-rate switch using P4 [18].

PowerTCP利用可编程交换机内的测量来精确获取瓶颈链路状态。我们的交换机组件是轻量级的，并且所需的INT头字段在文献中已是标准[36]。我们还讨论了PowerTCP在非可编程的传统交换机上的一种近似实现。

为评估PowerTCP，我们聚焦于一个在RDMA网络环境中的部署场景，其中拥塞控制算法在网卡（NIC）上实现。我们的大规模仿真结果表明，与DCQCN [56]相比，PowerTCP将99.9百分位的短流完成时间缩短了80%；与当前最先进的低延迟协议HPCC [36]相比，也缩短了33%。我们证明，即使在80%的负载下，PowerTCP也能维持近乎为零的队列长度，且不影响吞吐量或导致长流完成时间增加。作为一个案例研究，我们探索了PowerTCP在可重构数据中心网络中的优势，在这些网络中，它实现了80-85%的链路利用率，并将尾延迟比当前最先进的方法[43]减少了至少2倍。最后，作为概念验证，我们在Linux内核中以及在一台Intel Tofino可编程线速交换机上使用P4 [18]实现了PowerTCP及其遥测组件。

In summary, our key contributions in this paper are:

• We reveal the shortcomings of existing congestion control approaches which either only react to the current state or the dynamics of the network, and introduce the notion of power to account for both.

• PowerTCP, a power-based approach to congestion control at the end-host which reacts faster to changes in the network such as an arrival of burst, fluctuations in available bandwidth etc.,

• An evaluation of the benefits of PowerTCP in traditional DCNs and RDCNs.

• As a contribution to the research community and to facilitate future work, all our artefacts have been made publicly available at: https://powertcp.self-adjusting.net.

总而言之，我们在本文中的主要贡献是：

我们揭示了现有拥塞控制方法或仅对当前状态、或仅对网络动态做出反应的缺点，并引入了“功率”的概念以同时兼顾两者
提出了PowerTCP，一种在终端主机上实现的、基于功率的拥塞控制方法，它能对网络变化（如突发流量的到来、可用带宽的波动等）做出更快的反应
评估了PowerTCP在传统数据中心网络和可重构数据中心网络中的优势
作为对研究社区的贡献并为促进未来工作，我们所有的项目产出都已在以下网址公开：https://powertcp.self-adjusting.net