TIMELY: RTT-based Congestion Control for the Datacenter¶
Datacenter transports aim to deliver low latency messaging together with high throughput. We show that simple packet delay, measured as round-trip times at hosts, is an effective congestion signal without the need for switch feedback. First, we show that advances in NIC hardware have made RTT measurement possible with microsecond accuracy, and that these RTTs are sufficient to estimate switch queueing. Then we describe how TIMELY can adjust transmission rates using RTT gradients to keep packet latency low while delivering high bandwidth. We implement our design in host software running over NICs with OS-bypass capabilities. We show using experiments with up to hundreds of machines on a Clos network topology that it provides excellent performance: turning on TIMELY for OS-bypass messaging over a fabric with PFC lowers 99 percentile tail latency by 9X while maintaining near line-rate throughput. Our system also outperforms DCTCP running in an optimized kernel, reducing tail latency by 13X. To the best of our knowledge, TIMELY is the first delay-based congestion control protocol for use in the datacenter, and it achieves its results despite having an order of magnitude fewer RTT signals (due to NIC offload) than earlier delay-based schemes such as Vegas.
数据中心传输协议旨在同时实现低延迟消息传递与高吞TP量。本文表明,由主机端测量的往返时间(RTT)所体现的简单数据包延迟,是一种无需交换机反馈的有效拥塞信号。首先,我们证明了网卡(NIC)硬件的进步使得以微秒级精度测量RTT成为可能,并且这些RTT足以用来估算交换机的排队状况。接着,我们阐述了TIMELY协议如何利用RTT的梯度(即变化率)来调整发送速率,从而在保持高带宽的同时维持数据包的低延迟。
我们将该设计实现在主机软件中,该软件运行于具备操作系统旁路(OS-bypass)能力的网卡之上。我们通过在Clos网络拓扑中部署多达数百台机器的实验证明,该系统展现了卓越的性能:在一个启用了PFC(基于优先级的流控)的交换网络(fabric)中,为操作系统旁路消息传递启用TIMELY后,其99百分位尾延迟降低了9倍,同时吞吐量仍能维持在接近线速的水平。此外,我们的系统性能也优于在优化内核中运行的DCTCP,其尾延迟降低了13倍。
据我们所知,TIMELY是首个应用于数据中心的基于延迟的拥塞控制协议。并且,尽管由于网卡卸载(NIC offload)导致其可用的RTT信号比早期的同类方案(如Vegas)少一个数量级,它依然取得了前述的优异成果。
Introduction¶
Datacenter networks run tightly-coupled computing tasks that must be responsive to users, e.g., thousands of backend computers may exchange information to serve a user request, and all of the transfers must complete quickly enough to let the complete response to be satisfied within 100 ms [24]. To meet these requirements, datacenter transports must simultaneously deliver high bandwidth ( ≫ Gbps) and utilization at low latency ( ≪ msec), even though these aspects of performance are at odds. Consistently low latency matters because even a small fraction of late operations can cause a ripple effect that degrades application performance [21]. As a result, datacenter transports must strictly bound latency and packet loss.
数据中心网络运行着紧密耦合的计算任务,这些任务必须能对用户请求作出快速响应。例如,为服务一个用户请求,成千上万台后端计算机可能需要交换信息,并且所有的数据传输都必须足够快地完成,以确保整个响应能在100毫秒内完成[24]。为满足这些要求,数据中心传输协议必须在实现高带宽(远超Gbps)和高利用率的同时,提供极低的延迟(远低于毫秒),尽管这些性能指标在本质上是相互矛盾的。持续的低延迟至关重要,因为即使是一小部分缓慢的操作也可能引发连锁反应,从而降低整体应用性能[21]。因此,数据中心传输协议必须严格控制延迟和数据包丢失。
Since traditional loss-based transports do not meet these strict requirements, new datacenter transports [10,18,30,35, 37,47], take advantage of network support to signal the onset of congestion (e.g., DCTCP [35] and its successors use ECN), introduce flow abstractions to minimize completion latency, cede scheduling to a central controller, and more. However, in this work we take a step back in search of a simpler, immediately deployable design.
The crux of our search is the congestion signal. An ideal signal would have several properties. It would be fine-grained and timely to quickly inform senders about the extent of congestion. It would be discriminative enough to work in complex environments with multiple traffic classes. And, it would be easy to deploy.
由于传统的基于丢包的传输协议无法满足这些严苛要求,新型的数据中心传输协议[10,18,30,35,37,47]开始利用网络设备的支持来指示拥塞的发生(例如,DCTCP[35]及其后续者使用ECN技术),或引入流抽象以最小化完成延迟,或将调度决策交由中心控制器处理,等等。然而,在本文中,我们回归本源,旨在寻求一种更简单且可立即部署的设计。
我们研究的核心在于拥塞信号。一个理想的信号应具备若干特性:它应当是细粒度且及时的,以便迅速告知发送方拥塞的程度;它应当具有足够的区分度,以适应包含多种流量类型的复杂环境;并且,它应当易于部署。
Surprisingly, we find that a well-known signal, properly adapted, can meet all of our goals: delay in the form of RTT measurements. RTT is a fine-grained measure of congestion that comes with every acknowledgment. It effectively supports multiple traffic classes by providing an inflated measure for lower-priority transfers that wait behind higher-priority ones. Further, it requires no support from network switches.
出人意料的是,我们发现一种众所周知的信号,在经过适当改造后,能够满足我们所有的目标:以RTT(往返时间)测量形式体现的延迟。RTT是一种细粒度的拥塞度量,它伴随着每一个确认包(ACK)而来。通过为等待在高优先级传输之后的低优先级传输提供一个“膨胀”的测量值,它能有效地支持多种流量类型。此外,它无需网络交换机的任何支持。
如何理解“通过为等待在高优先级传输之后的低优先级传输提供一个“膨胀”的测量值,它能有效地支持多种流量类型”
- 高优先级流量(流量H)的体验: 由于它是高优先级,它几乎不需要排队,或者只在其他高优先级包后面短暂排队 -> 测得的RTT比较低
- 低优先级流量(流量L)的体验: 流量L的发送方测量到的RTT值就会变得非常大
膨胀”的测量值:指的就是低优先级流量(流量L)测量到的那个非常大的RTT值。这个值因为包含了等待高优先级流量(流量H)的排队时间,所以比它“本应”有的RTT值要大得多,仿佛被“撑大”或“膨胀”了
- 对于流量L(低优先级):它的发送方(比如TIMELY协议)看到RTT急剧增大,就会认为“网络发生拥塞了!”,于是自动降低发送速率
- 对于流量H(高优先级):它的发送方看到RTT一直很小且稳定,就会认为“网络很空闲”,于是继续保持高速率发送
最终的效果就是:
系统自动地、优雅地实现了我们想要的效果! 高优先级的流量可以“横行无阻”,而低优先级的流量则非常“识趣”,一旦感知到高优先级流量的存在(通过被撑大的RTT),就主动减速让路
Delay has been explored in the wide-area Internet since at least TCP Vegas [16], and some modern TCP variants use delay estimates [44, 46]. But this use of delay has not been without problems. Delay-based schemes tend to compete poorly with more aggressive, loss-based schemes, and delay estimates may be wildly inaccurate due to host and network issues, e.g., delayed ACKs and different paths. For these reasons, delay is typically used in hybrid schemes with other indicators such as loss.
Moreover, delay has not been used as a congestion signal in the datacenter because datacenter RTTs are difficult to measure at microsecond granularity. This level of precision is easily overwhelmed by host delays such as interrupt processing for acknowledgments. DCTCP eschews a delaybased scheme saying “the accurate measurement of such small increases in queueing delay is a daunting task.” [35]
Our insight is that recent NIC advances do allow datacenter RTTs to be measured with sufficient precision, while the wide-area pitfalls of using delay as a congestion signal do not apply. Recent NICs provide hardware support for high-quality timestamping of packet events [1,3,5,8,9], plus hardware-generated ACKs that remove unpredictable host response delays. Meanwhile, datacenter host software can be controlled to avoid competition with other transports, and multiple paths have similar, small propagation delays.
自TCP Vegas[16]时代起,学术界就已在广域互联网中对延迟信号进行了探索,一些现代TCP变体也使用了延迟估算[44, 46]。但延迟的使用并非没有问题。基于延迟的方案在与更激进的、基于丢包的方案竞争时通常表现不佳,并且由于主机和网络层面的问题(例如,延迟确认和路径不对称),延迟估算可能极不准确。基于这些原因,延迟通常被用于混合方案中,与其他指标(如丢包)结合使用。
此外,延迟信号过去未被用于数据中心,因为在数据中心环境中,以微秒级的粒度精确测量RTT是极其困难的。这种精度要求很容易被主机延迟(如处理确认包所需的中断时间)所淹没。DCTCP就因“精确测量如此微小的排队延迟增量是一项艰巨的任务”而摒弃了基于延迟的方案[35]。
我们的洞见在于,近期网卡(NIC)的进步使得数据中心的RTT能够被足够精确地测量,同时广域网中使用延迟信号的那些陷阱在数据中心环境中并不适用。现代网卡为高质量的数据包事件时间戳提供了硬件支持[1,3,5,8,9],并能通过硬件直接生成确认包,从而消除了不可预测的主机响应延迟。与此同时,数据中心的主机软件是可控的,可以避免与其他传输协议的竞争,并且网络中的多条路径也具有相似且微小的传播延迟。
In this paper, we show that delay-based congestion control provides excellent performance in the datacenter. Our key contributions include:
-
We experimentally demonstrate how multi-bit RTT signals measured with NIC hardware are strongly correlated with network queueing.
-
We present Transport Informed by MEasurement of LatencY (TIMELY): an RTT-based congestion control scheme. TIMELY uses rate control and is designed to work with NIC offload of multi-packet segments for high performance. Unlike earlier schemes [16,46], we do not build the queue to a fixed RTT threshold. Instead, we use the rate of RTT variation, or the gradient, to predict the onset of congestion and hence keep the delays low while delivering high throughput.
-
We evaluate TIMELY with an OS-bypass messaging implementation using hundreds of machines on a Clos network topology. Turning on TIMELY for RDMA transfers on a fabric with PFC (Priority Flow Control) lowers 99 percentile tail latency by 9X. This tail latency is 13X lower than that of DCTCP running in an optimized kernel.
在本文中,我们证明了基于延迟的拥塞控制在数据中心内能提供卓越的性能。我们的主要贡献包括:
- 我们通过实验证明,利用网卡硬件测量的多比特RTT信号与网络排队情况高度相关
- 我们提出了TIMELY(Transport Informed by MEasurement of LatencY,即基于延迟测量的传输协议):一种基于RTT的拥塞控制方案。TIMELY采用速率控制,并被设计为可与网卡对多包分段的卸载功能协同工作以实现高性能。与早期方案[16,46]不同,我们不将队列填充至一个固定的RTT阈值,而是利用 RTT的变化率(即梯度) 来预测拥塞的发生,从而在提供高吞吐量的同时保持低延迟
- 我们通过在一个Clos网络拓扑中部署数百台机器,对一个采用操作系统旁路(OS-bypass)消息传递的TIMELY实现进行了评估。在一个启用了PFC(优先级流控)的交换网络上为RDMA传输开启TIMELY后,其99百分位尾延迟降低了9倍。该尾延迟比在优化内核中运行的DCTCP低13倍