Data Center TCP (DCTCP)¶

ABSTRACT¶

Cloud data centers host diverse applications, mixing workloads that require small predictable latency with others requiring large sustained throughput. In this environment, today’s state-of-the-art TCP protocol falls short. We present measurements of a 6000 server production cluster and reveal impairments that lead to high application latencies, rooted in TCP’s demands on the limited buffer space available in data center switches. For example, bandwidth hungry “background” flows build up queues at the switches, and thus impact the performance of latency sensitive “foreground” traffic.

To address these problems, we propose DCTCP, a TCP-like protocol for data center networks. DCTCP leverages Explicit Congestion Notification (ECN) in the network to provide multi-bit feedback to the end hosts. We evaluate DCTCP at 1 and 10Gbps speeds using commodity, shallow buffered switches. We find DCTCP delivers the same or better throughput than TCP, while using 90% less buffer space. Unlike TCP, DCTCP also provides high burst tolerance and low latency for short flows. In handling workloads derived from operational measurements, we found DCTCP enables the applications to handle 10X the current background traffic, without impacting foreground traffic. Further, a 10X increase in foreground traffic does not cause any timeouts, thus largely eliminating incast problems.

云数据中心承载着多样化的应用程序，其工作负载通常混合了:

要求低且可预测延迟的应用（即“前台”流量）
要求高持续吞吐量的应用（即“后台”流量）

在此类混合环境中，当前最先进的TCP协议已表现出其固有的局限性。本文通过对一个包含6000台服务器的生产环境集群进行测量，揭示了导致应用延迟升高的性能瓶颈。其问题的根源在于，TCP协议对数据中心交换机内有限的缓冲空间提出了严苛的要求。例如，高带宽消耗的“后台”流会在交换机处造成队列堆积，从而对延迟敏感的“前台”流的性能产生负面影响。

为应对上述挑战，本文提出了一种适用于数据中心网络的类TCP协议 —— DCTCP (Data Center TCP)。DCTCP利用网络中的显式拥塞通知（ECN, Explicit Congestion Notification）机制，向终端主机提供多比特的拥塞反馈信号。我们使用商用的浅缓冲交换机，在1Gbps和10Gbps的速率下对DCTCP进行了性能评估。研究发现，DCTCP在提供与TCP相当甚至更高吞吐量的同时，其占用的交换机缓冲空间减少了90%。与TCP不同，DCTCP还为短流提供了高突发容忍度和低延迟保障。在处理源自真实业务测量的工作负载时，DCTCP能使应用程序在不影响前台流量性能的前提下，处理高达10倍的后台流量。此外，即便前台流量增加10倍，也未引发任何超时（timeout）事件，从而在很大程度上缓解了“Incast”（拥塞崩溃）问题。

INTRODUCTION¶

In recent years, data centers have transformed computing, with large scale consolidation of enterprise IT into data center hubs, and with the emergence of cloud computing service providers like Amazon, Microsoft and Google. A consistent theme in data center design has been to build highly available, highly performant computing and storage infrastructure using low cost, commodity components [16]. A corresponding trend has also emerged in data center networks. In particular, low-cost switches are common at the top of the rack, providing up to 48 ports at 1Gbps, at a price point under $2000 — roughly the price of one data center server. Several recent research proposals envision creating economical, easyto-manage data centers using novel architectures built atop these commodity switches [2, 12, 15].

Is this vision realistic? The answer depends in large part on how well the commodity switches handle the traffic of real data center applications. In this paper, we focus on soft real-time applications, supporting web search, retail, advertising, and recommendation systems that have driven much data center construction. These applications generate a diverse mix of short and long flows, and require three things from the data center network: low latency for short flows, high burst tolerance, and high utilization for long flows.

近年来，随着企业IT向数据中心枢纽的大规模整合，以及亚马逊、微软和谷歌等云计算服务商的崛起，数据中心已经彻底改变了计算模式。数据中心设计中一个贯穿始终的主题是，使用低成本的商用组件来构建高可用、高性能的计算与存储基础设施 [16]。数据中心网络也呈现出相应的趋势。特别是在机架顶部，低成本交换机已十分普遍，它们能以低于2000美元（约等于一台数据中心服务器的价格）的售价提供多达48个1Gbps端口。近期的一些研究方案也设想了在这些商用交换机之上，构建经济且易于管理的新型数据中心架构 [2, 12, 15]。

这一愿景是否现实？答案在很大程度上取决于商用交换机处理真实数据中心应用流量的能力。本文中，我们聚焦于软实时应用，这些应用为网页搜索、零售、广告和推荐系统提供支持，并已成为数据中心建设的主要驱动力。这些应用产生了混合了短流和长流的多样化流量，并对数据中心网络提出了三点要求：短流的低延迟、高突发容忍度，以及长流的高利用率。

The first two requirements stem from the Partition/Aggregate (described in § 2.1) workflow pattern that many of these applications use. The near real-time deadlines for end results translate into latency targets for the individual tasks in the workflow. These targets vary from ∼10ms to ∼100ms, and tasks not completed before their deadline are cancelled, affecting the final result. Thus, application requirements for low latency directly impact the quality of the result returned and thus revenue. Reducing network latency allows application developers to invest more cycles in the algorithms that improve relevance and end user experience.

The third requirement, high utilization for large flows, stems from the need to continuously update internal data structures of these applications, as the freshness of the data also affects the quality of the results. Thus, high throughput for these long flows is as essential as low latency and burst tolerance.

前两项要求源于许多此类应用所使用的“分区/聚合”（Partition/Aggregate，详见 § 2.1）工作流模式。最终结果的近实时交付时限（deadline）被转化为工作流中各个独立任务的延迟目标。这些目标从约10毫秒到100毫秒不等，未能在时限前完成的任务将被取消，从而影响最终结果的质量。

因此，应用对低延迟的要求直接影响返回结果的质量，进而影响收益。降低网络延迟能使应用开发者将更多的计算周期投入到能提升相关性和终端用户体验的算法上。

第三项要求，即长流的高利用率，源于需要持续更新这些应用的内部数据结构，因为数据的时效性同样影响结果质量。因此，这些长流的高吞吐量与低延迟和突发容忍度同等重要。

In this paper, we make two major contributions. First, we measure and analyze production traffic (>150TB of compressed data), collected over the course of a month from ∼6000 servers ( § 2), extracting application patterns and needs (in particular, low latency needs), from data centers whose network is comprised of commodity switches. Impairments that hurt performance are identified, and linked to properties of the traffic and the switches.

Second, we propose Data Center TCP (DCTCP), which addresses these impairments to meet the needs of applications ( § 3). DCTCP uses Explicit Congestion Notification (ECN), a feature already available in modern commodity switches. We evaluate DCTCP at 1 and 10Gbps speeds on ECN-capable commodity switches ( § 4). We find DCTCP successfully supports 10X increases in application foreground and background traffic in our benchmark studies.

本文主要有两大贡献。

首先，我们对一个由约6000台服务器构成的数据中心网络（其网络由商用交换机组成）进行了为期一个月的生产环境流量测量与分析（压缩后数据量 >150TB）（§ 2），从中提取了应用的模式与需求（特别是低延迟需求）。我们识别了损害性能的瓶颈，并将其与流量和交换机的特性关联起来。

其次，我们提出了数据中心TCP（DCTCP），该协议旨在解决上述性能瓶颈，以满足应用需求（§ 3）。DCTCP利用了现代商用交换机已支持的特性——显式拥塞通知（ECN）。我们在支持ECN的商用交换机上，于1Gbps和10Gbps的速率下对DCTCP进行了评估（§ 4）。我们发现，在基准测试研究中，DCTCP成功地支持了应用前台和后台流量10倍的增长。

The measurements reveal that 99.91% of traffic in our data center is TCP traffic. The traffic consists of query traffic (2KB to 20KB in size), delay sensitive short messages (100KB to 1MB), and throughput sensitive long flows (1MB to 100MB). The query traffic experiences the incast impairment, discussed in [32, 13] in the context of storage networks. However, the data also reveal new impairments unrelated to incast. Query and delay-sensitive short messages experience long latencies due to long flows consuming some or all of the available buffer in the switches. Our key learning from these measurements is that to meet the requirements of such a diverse mix of short and long flows, switch buffer occupancies need to be persistently low, while maintaining high throughput for the long flows. DCTCP is designed to do exactly this.

测量结果显示，我们数据中心内99.91%的流量是TCP流量。这些流量包含查询流量（大小在2KB到20KB之间）、延迟敏感的短消息（100KB到1MB）以及吞吐量敏感的长流（1MB到100MB）。

查询流量遭受了在存储网络背景下被讨论过的Incast（拥塞崩溃）问题 [32, 13]。然而，数据也揭示了与Incast无关的新性能瓶颈。由于长流消耗了交换机中部分或全部可用缓冲区，查询流量和延迟敏感的短消息经历了很长的延迟。

我们从这些测量中得到的关键认知是：为了满足这种混合了短流与长流的多样化需求，交换机的缓冲区占用率必须持续保持在低水平，同时维持长流的高吞吐量。 DCTCP的设计目标正是如此。

DCTCP combines Explicit Congestion Notification (ECN) with a novel control scheme at the sources. It extracts multibit feedback on congestion in the network from the single bit stream of ECN marks. Sources estimate the fraction of marked packets, and use that estimate as a signal for the extent of congestion. This allows DCTCP to operate with very low buffer occupancies while still achieving high throughput. Figure 1 illustrates the effectiveness of DCTCP in achieving full throughput while taking up a very small footprint in the switch packet buffer, as compared to TCP.

DCTCP将 显式拥塞通知（ECN） 与一种新颖的源端控制方案相结合。它能从ECN标记的单位比特流中提取出关于网络拥塞的多比特反馈信息。源端通过估算被标记数据包的比例，并将此估算值作为拥塞程度的信号。这使得 DCTCP能够在极低的缓冲区占用下运行，同时仍能实现高吞吐量。

图1展示了相较于TCP，DCTCP在实现全吞吐量的同时，在交换机数据包缓冲区中占用空间极小的有效性。

While designing DCTCP, a key requirement was that it be implementable with mechanisms in existing hardware — meaning our evaluation can be conducted on physical hardware, and the solution can be deployed to our data centers. Thus, we did not consider solutions such as RCP [6], which are not implemented in any commercially-available switches.

We stress that DCTCP is designed for the data center environment. In this paper, we make no claims about suitability of DCTCP for wide area networks. The data center environment [19] is significantly different from wide area networks. For example, round trip times (RTTs) can be less than 250 µ s, in absence of queuing. Applications simultaneously need extremely high bandwidths and very low latencies. Often, there is little statistical multiplexing: a single flow can dominate a particular path. At the same time, the data center environment offers certain luxuries. The network is largely homogeneous and under a single administrative control. Thus, backward compatibility, incremental deployment and fairness to legacy protocols are not major concerns. Connectivity to the external Internet is typically managed through load balancers and application proxies that effectively separate internal traffic from external, so issues of fairness with conventional TCP are irrelevant.

在设计DCTCP时，一个关键要求是它必须能通过现有硬件中的机制来实现——这意味着我们的评估可以在物理硬件上进行，并且该方案可以部署到我们的数据中心。因此，我们没有考虑像RCP [6] 那样在任何商用交换机中都未实现的方案。

我们必须强调，DCTCP是为数据中心环境设计的。在本文中，我们并未声称DCTCP适用于广域网。数据中心环境 [19] 与广域网有显著不同。例如，在没有排队的情况下，往返时间（RTT）可以低于250微秒。应用同时需要极高的带宽和极低的延迟。通常，统计复用效应很小：单个流就可能占据某条路径的主导地位。与此同时，数据中心环境也提供了一些便利。网络在很大程度上是同质的，并且处于单一管理控制之下。因此，向后兼容性、增量部署以及对传统协议的公平性并非主要考虑因素。与外部互联网的连接通常通过负载均衡器和应用代理进行管理，这有效地将内部流量与外部流量分离，因此与传统TCP的公平性问题无关。

We do not address the question of how to apportion data center bandwidth between internal and external (at least one end point outside the data center) flows. The simplest class of solutions involve using Ethernet priorities (Class of Service) to keep internal and external flows separate at the switches, with ECN marking in the data center carried out strictly for internal flows.

我们没有解决如何在数据中心内部分配内部流量和外部流量（即至少一端在数据中心之外的流量）的带宽问题。最简单的一类解决方案是使用以太网优先级（服务等级）在交换机处将内部和外部流量分开，并且仅对数据中心内部的流量执行ECN标记。

The TCP literature is vast, and there are two large families of congestion control protocols that attempt to control queue lengths: (i) Delay-based protocols use increases in RTT measurements as a sign of growing queueing delay, and hence of congestion. These protocols rely heavily on accurate RTT measurement, which is susceptible to noise in the very low latency environment of data centers. Small noisy fluctuations of latency become indistinguishable from congestion and the algorithm can over-react. (ii) Active Queue Management (AQM) approaches use explicit feedback from congested switches. The algorithm we propose is in this family.

Having measured and analyzed the traffic in the cluster and associated impairments in depth, we find that DCTCP provides all the benefits we seek. DCTCP requires only 30 lines of code change to TCP, and the setting of a single parameter on the switches.

TCP的文献浩如烟海，有两大类拥塞控制协议试图控制队列长度:

（i）基于延迟的协议将RTT的增加视为排队延迟增长的标志，也即拥塞的标志。这类协议严重依赖精确的RTT测量，而在数据中心的极低延迟环境中，这种测量很容易受到噪声干扰。微小的噪声波动与拥塞变得难以区分，可能导致算法反应过度

（ii）主动队列管理（AQM）方法使用来自拥塞交换机的显式反馈。我们提出的算法属于这一类

在对集群中的流量及相关性能瓶颈进行了深入的测量与分析后，我们发现DCTCP提供了我们所寻求的所有益处。DCTCP仅需对TCP进行30行代码的修改，并在交换机上设置一个单一参数即可。

TL; DR

核心问题：数据中心普遍使用廉价的商用交换机，但这些交换机缓存空间很小。当处理混合了需要低延迟的短流量（如网页请求）和需要高吞吐量的长流量（如数据同步）时，传统TCP协议会导致交换机缓存被长流量占满，从而给短流量带来严重延迟，影响应用性能
解决方案与成果：为此，论文在测量真实数据中心流量的基础上，提出了一种名为DCTCP的新协议。DCTCP巧妙地利用交换机已有的ECN（显式拥塞通知）功能，更精确地判断拥塞程度，从而在保持高吞吐量的同时，将交换机缓存占用率降低90%。实验证明，DCTCP能有效解决延迟问题，并支持10倍的流量增长，且易于在现有硬件上部署