Communication in Data Centers¶
To understand the challenges facing data center transport protocols, we first describe a common application structure, Partition/Aggregate, that motivates why latency is a critical metric in data centers. We measure the synchronized and bursty traffic patterns that result from these application structure, and identify three performance impairments these patterns cause.
为理解数据中心传输协议所面临的挑战,我们首先描述一种常见的应用结构——“分区/聚合”(Partition/Aggregate),它揭示了为何延迟是数据中心的一个关键性能指标。我们测量了由这种应用结构所产生的同步且突发的流量模式,并识别了这些模式所导致的三个性能损伤问题。
Partition/Aggregate¶
The Partition/Aggregate design pattern shown in Figure 2 is the foundation of many large scale web applications. Requests from higher layers of the application are broken into pieces and farmed out to workers in lower layers. The responses of these workers are aggregated to produce a result. Web search, social network content composition, and advertisement selection are all based around this application design pattern. For interactive, soft-real-time applications like these, latency is the key metric, with total permissible latency being determined by factors including customer impact studies [21]. After subtracting typical Internet and rendering delays, the “backend” part of the application is typically allocated between 230-300ms. This limit is called an all-up SLA.
Many applications have a multi-layer partition/aggregate pattern workflow, with lags at one layer delaying the initiation of others. Further, answering a request may require iteratively invoking the pattern, with an aggregator making serial requests to the workers below it to prepare a response (1 to 4 iterations are typical, though as many as 20 may occur). For example, in web search, a query might be sent to many aggregators and workers, each responsible for a different part of the index. Based on the replies, an aggregator might refine the query and send it out again to improve the relevance of the result. Lagging instances of partition/aggregate can thus add up to threaten the all-up SLAs for queries. Indeed, we found that latencies run close to SLA targets, as developers exploit all of the available time budget to compute the best result possible.
To prevent the all-up SLA from being violated, worker nodes are assigned tight deadlines, usually on the order of 10-100ms. When a node misses its deadline, the computation continues without that response, lowering the quality of the result. Further, high percentiles for worker latencies matter. For example, high latencies at the 99.9 th percentile mean lower quality results or long lags (or both) for at least 1 in 1000 responses, potentially impacting large numbers of users who then may not come back. Therefore, latencies are typically tracked to 99.9 th percentiles, and deadlines are associated with high percentiles. Figure 8 shows a screen shot from a production monitoring tool, tracking high percentiles.
With such tight deadlines, network delays within the data center play a significant role in application design. Many applications find it difficult to meet these deadlines using state-of-the-art TCP, so developers often resort to complex, ad-hoc solutions. For example, our application carefully controls the amount of data each worker sends and adds jitter. Facebook, reportedly, has gone to the extent of developing their own UDP-based congestion control [29].
如图2所示的“分区/聚合”设计模式是许多大规模Web应用的基础。来自应用上层的请求被分解成多个部分,并分发给下层的工作节点(worker)。这些工作节点的响应被聚合起来以产生最终结果。网页搜索、社交网络内容组合以及广告选择等应用都基于此设计模式。对于这类交互式的软实时应用而言,延迟是关键指标,其总允许延迟由客户影响研究[21]等因素决定。在减去典型的互联网和渲染延迟后,应用的“后端”部分通常被分配230-300毫秒的时间预算。这个限制被称为总体服务等级协议(all-up SLA)。
许多应用具有多层的“分区/聚合”模式工作流,其中一层的滞后会延迟其他层的启动。 此外,一个请求的应答可能需要迭代地调用该模式,即一个聚合节点(aggregator)为了准备响应而向下层的工作节点发出串行请求(通常迭代1到4次,但多达20次的情况也可能发生)。例如,在网页搜索中,一个查询可能被发送给许多聚合节点和工作节点,每个节点负责索引的不同部分。聚合节点可能根据收到的回复来优化查询,并再次发送以提高结果的相关性。因此,滞后的“分区/聚合”实例会累积起来,对查询的总体SLA构成威胁。事实上我们发现,由于开发者会充分利用所有可用的时间预算来计算最佳结果,因此延迟通常都非常接近SLA的目标值。
为了防止违反总体SLA,工作节点被分配了非常严格的交付时限,通常在10-100毫秒的量级。当一个节点错过了它的时限,计算将在缺少其响应的情况下继续进行,从而降低了结果的质量。此外,工作节点延迟的高百分位值(high percentiles)至关重要。例如,99.9百分位的延迟较高,意味着至少每1000个响应中就有1个会遭遇结果质量下降或长时间滞后(或两者兼有),这可能影响大量用户,并可能导致他们不再访问。因此,延迟通常被追踪到99.9百分位,并且交付时限与高百分位值相关联。图8展示了一个生产环境监控工具的截图,该工具正在追踪高百分位的延迟。
在如此严格的时限下,数据中心内部的网络延迟在应用设计中扮演着重要角色。许多应用发现使用当前最先进的TCP协议难以满足这些时限,因此开发者常常诉诸于复杂的、临时的解决方案。例如,我们的应用会仔细控制每个工作节点发送的数据量并增加抖动(jitter)。据报道,Facebook甚至为此开发了他们自己基于UDP的拥塞控制协议[29]。
Workload Characterization¶
We next measure the attributes of workloads in three production clusters related to web search and other services. The measurements serve to illuminate the nature of data center traffic, and they provide the basis for understanding why TCP behaves poorly and for the creation of benchmarks for evaluating DCTCP.
We instrumented a total of over 6000 servers in over 150 racks. The three clusters support soft real-time query traffic, integrated with urgent short message traffic that coordinates the activities in the cluster and continuous background traffic that ingests and organizes the massive data needed to sustain the quality of the query responses. We use these terms for ease of explanation and for analysis, the developers do not separate flows in simple sets of classes. The instrumentation passively collects socket level logs, selected packet-level logs, and app-level logs describing latencies – a total of about 150TB of compressed data over the course of a month.
Each rack in the clusters holds 44 servers. Each server connects to a Top of Rack switch (ToR) via 1Gbps Ethernet. The ToRs are shallow buffered, shared-memory switches; each with 4MB of buffer shared among 48 1Gbps ports and two 10Gbps ports.
接下来,我们测量了三个与网页搜索及其他服务相关的生产环境集群中的工作负载属性。这些测量旨在阐明数据中心流量的性质,并为理解TCP表现不佳的原因以及为评估DCTCP创建基准测试提供了基础。
我们总共对超过150个机架中的6000多台服务器部署了测量工具。这三个集群支持软实时的查询流量,并整合了用于协调集群活动的紧急短消息流量,以及用于接收和组织海量数据以保障查询响应质量的持续性后台流量。为便于解释和分析,我们使用了这些术语,但应用开发者并未将流量简单地划分为这几类。测量工具被动地收集了套接字级别日志、选定的数据包级别日志以及描述延迟的应用级别日志——在一个月的时间里,总计约150TB的压缩数据。
集群中的每个机架容纳44台服务器。每台服务器通过1Gbps以太网连接到一台机架顶部(ToR)交换机。这些ToR交换机是浅缓冲、共享内存的交换机;每台交换机拥有4MB的缓冲区,由48个1Gbps端口和两个10Gbps端口共享。
Query Traffic. Query traffic in the clusters follows the Partition/Aggregate pattern. The query traffic consists of very short, latency-critical flows, with the following pattern. A high-level aggregator (HLA) partitions queries to a large number of mid-level aggregators (MLAs) that in turn partition each query over the 43 other servers in the same rack as the MLA. Servers act as both MLAs and workers, so each server will be acting as an aggregator for some queries at the same time it is acting as a worker for other queries. Figure 3(a) shows the CDF of time between arrivals of queries at mid-level aggregators. The size of the query flows is extremely regular, with queries from MLAs to workers being 1.6KB and responses from workers to MLAs being 1.6 to 2KB.
查询流量 (Query Traffic)。 集群中的查询流量遵循“分区/聚合”模式。查询流量由非常短的、对延迟极其敏感的数据流组成 ,其模式如下:一个高层聚合节点(HLA)将查询分发给大量的中层聚合节点(MLA),而每个MLA又将查询分发给与其同机架的其他43台服务器。 服务器既充当MLA又充当工作节点 ,因此每台服务器在为某些查询充当聚合节点的同时,也为其他查询充当工作节点。图3(a)展示了查询到达中层聚合节点的时间间隔的累积分布函数(CDF)。查询流的大小极其规整,从MLA到工作节点的查询为1.6KB,从工作节点到MLA的响应为1.6至2KB。
Background Traffic. Concurrent with the query traffic is a complex mix of background traffic, consisting of both large and small flows. Figure 4 presents the PDF of background flow size, illustrating how most background flows are small, but most of the bytes in background traffic are part of large flows. Key among background flows are large, 1MB to 50MB, update flows that copy fresh data to the workers and time-sensitive short message flows, 50KB to 1MB in size, that update control state on the workers. Figure 3(b) shows the time between arrival of new background flows. The inter-arrival time between background flows reflects the superposition and diversity of the many different services supporting the application:
(1) the variance in interarrival time is very high, with a very heavy tail;
(2) embedded spikes occur, for example the 0ms inter-arrivals that explain the CDF hugging the y-axis up to the 50 th percentile;
(3) relatively large numbers of outgoing flows occur periodically, resulting from workers periodically polling a number of peers looking for updated files.
后台流量 (Background Traffic)。与查询流量同时存在的是一个复杂的后台流量混合体,包含大大小小的数据流 。图4展示了后台流量大小的概率密度函数(PDF),说明了大多数后台流虽小,但后台流量中的大部分字节都属于大流量。后台流量中的关键是大型的、1MB到50MB的更新流(用于向工作节点复制新数据),以及大小为50KB到1MB的、对时间敏感的短消息流(用于更新工作节点上的控制状态)。图3(b)显示了新后台流量的到达时间间隔。后台流量的到达间隔反映了支撑应用的多种不同服务的叠加和多样性:
(1) 到达间隔的方差非常大,呈现出重尾分布
(2) 存在内嵌的尖峰,例如0ms的到达间隔解释了CDF在达到50百分位之前紧贴y轴的现象
(3) 周期性地出现相对大量的出向流,这是由于工作节点周期性地轮询多个对等节点以查找更新文件所致
Flow Concurrency and Size. Figure 5 presents the CDF of the number of flows a MLA or worker node participates in concurrently (defined as the number of flows active during a 50ms window). When all flows are considered, the median number of concurrent flows is 36, which results from the breadth of the Partition/Aggregate traffic pattern in which each server talks to 43 other servers. The 99.99th percentile is over 1,600, and there is one server with a median of 1,200 connections.
When only large flows (> 1MB) are considered, the degree of statistical multiplexing is very low — the median number of concurrent large flows is 1, and the 75th percentile is 2. Yet, these flows are large enough that they last several RTTs, and can consume significant buffer space by causing queue buildup.
In summary, throughput-sensitive large flows, delay sensitive short flows and bursty query traffic, co-exist in a data center network. In the next section, we will see how TCP fails to satisfy the performance requirements of these flows.
数据流并发数与大小 (Flow Concurrency and Size)。图5展示了一台MLA或工作节点同时参与的数据流数量的CDF(并发定义为在50ms窗口内活跃的流数量)。当考虑所有流量时,并发流数量的中位数是36,这是由“分区/聚合”流量模式的广度(每台服务器与43台其他服务器通信)所导致的。99.99百分位超过1600,且有一台服务器的中位连接数达到1200。
当只考虑大流量(>1MB)时,统计复用的程度非常低——并发大流量的中位数是1,75百分位是2。然而,这些流量足够大,能够持续数个往返时间(RTT),并可能通过造成队列堆积而消耗大量缓冲空间。
总之,对吞吐量敏感的大流量、对延迟敏感的短流量以及突发性的查询流量,在数据中心网络中并存。在下一节,我们将看到TCP为何无法满足这些流量的性能要求。
Understanding Performance Impairments¶
We found that to explain the performance issues seen in the production cluster, we needed to study the interaction between the long and short flows in the cluster and the ways flows interact with the switches that carried the traffic.
我们发现,要解释在生产集群中观察到的性能问题,我们需要研究集群中长流和短流之间的相互作用,以及数据流与承载这些流量的交换机之间的相互作用方式。
Switches¶
Like most commodity switches, the switches in these clusters are shared memory switches that aim to exploit statistical multiplexing gain through use of logically common packet buffers available to all switch ports. Packets arriving on an interface are stored into a high speed multi-ported memory shared by all the interfaces. Memory from the shared pool is dynamically allocated to a packet by a MMU. The MMU attempts to give each interface as much memory as it needs while preventing unfairness [1] by dynamically adjusting the maximum amount of memory any one interface can take. If a packet must be queued for an outgoing interface, but the interface has hit its maximum memory allocation or the shared pool itself is depleted, then the packet is dropped. Building large multi-ported memories is very expensive, so most cheap switches are shallow buffered, with packet buffer being the scarcest resource. The shallow packet buffers cause three specific performance impairments, which we discuss next.
与大多数商用交换机一样,这些集群中的交换机是 共享内存交换机,旨在通过使用所有交换机端口可用的逻辑公共数据包缓冲区来利用统计复用增益。到达某个接口的数据包被存入一个由所有接口共享的高速多端口内存中 。共享池中的内存由内存管理单元(MMU)动态分配给一个数据包。MMU试图在给予每个接口所需内存的同时,通过动态调整任何单个接口可占用的最大内存量来防止不公平[1]。 如果一个数据包必须为某个出向接口排队,但该接口已达到其最大内存分配或共享池本身已耗尽,则该数据包将被丢弃 。
构建大型多端口内存非常昂贵,因此大多数廉价交换机都是浅缓冲的,数据包缓冲区是其最稀缺的资源。这种浅缓冲设计导致了三个特定的性能损伤问题,我们接下来将进行讨论。
Incast¶
As illustrated in Figure 6(a), if many flows converge on the same interface of a switch over a short period of time, the packets may exhaust either the switch memory or the maximum permitted buffer for that interface, resulting in packet losses. This can occur even if the flow sizes are small. This traffic pattern arises naturally from use of the Partition/Aggregate design pattern, as the request for data synchronizes the workers’ responses and creates incast [32] at the queue of the switch port connected to the aggregator.
The incast research published to date [32, 13] involves carefully constructed test lab scenarios. We find that incast-like problems do happen in production environments and they matter — degrading both performance and, more importantly, user experience. The problem is that a response that incurs incast will almost certainly miss the aggregator deadline and be left out of the final results.
We capture incast instances via packet-level monitoring. Figure 7 shows timeline of an observed instance. Since the size of each individual response in this application is only 2KB (2 packets) 1 , loss of a packet almost invariably results in a TCP time out. In our network stack, the RTO min is set to 300ms. Thus, whenever a timeout occurs, that response almost always misses the aggregator’s deadline.
Developers have made two major changes to the application code to avoid timeouts on worker responses. First, they deliberately limited the size of the response to 2KB to improve the odds that all the responses will fit in the memory of the switch. Second, the developers added application-level jittering [11] to desynchronize the responses by deliberating delaying them by a random amount of time (typically a mean value of 10ms). The problem with jittering is that it reduces the response time at higher percentiles (by avoiding timeouts) at the cost of increasing the median response time (due to added delay). This is vividly illustrated in Figure 8.
Proposals to decrease RTO min reduce the impact of timeouts [32], but, as we show next, these proposals do not address other important sources of latency.
如图6(a)所示, 如果许多数据流在短时间内汇聚到交换机的同一个接口,这些数据包可能会耗尽交换机内存或该接口允许的最大缓冲区,从而导致丢包。即使这些流的规模很小,这种情况也可能发生。
这种流量模式是“分区/聚合”设计模式的自然产物,因为对数据的请求同步了工作节点的响应,并在连接到聚合节点的交换机端口队列处造成了Incast[32]。
迄今为止已发表的Incast研究[32, 13]涉及精心构建的测试实验室场景。我们发现,类似Incast的问题确实在生产环境中发生并且影响重大 —— 它既降低了性能,更重要的是,也损害了用户体验。问题在于,一个遭遇Incast的响应几乎肯定会错过聚合节点的交付时限,并被排除在最终结果之外。
我们通过数据包级别的监控捕获了Incast实例。图7显示了一个观测实例的时间线。由于此应用中单个响应的大小仅为2KB(2个数据包), 一个数据包的丢失几乎总是导致TCP超时 。在我们的网络协议栈中,最小重传超时(RTO min)被设置为300ms。因此,一旦发生超时,该响应几乎总是会错过聚合节点的时限。
开发者对应用代码做了两项主要修改以避免工作节点响应超时。
首先,他们有意将响应大小限制在2KB,以提高所有响应都能容纳在交换机内存中的几率。
其次,开发者增加了应用级别的抖动[11],通过将响应随机延迟一段时间(通常平均值为10ms)来打破其同步性。
抖动的问题在于,它以增加中位响应时间(因增加了延迟)为代价,换取了在高百分位上响应时间的减少(因避免了超时)。这一点在图8中得到了生动地展示。
降低RTO min的提议能减少超时的影响[32],但是,正如我们接下来将展示的,这些提议并未解决其他重要的延迟来源。
RTO
RTO: Retransmission TimeOut
含义: 指TCP协议中,发送方在发送一个数据包后,启动一个计时器,等待接收方确认(ACK)的最长时间
- 如果在这个时间内收到了确认,那么一切正常,计时器被取消
- 如果超过了这个时间(即“超时”)还没有收到确认,发送方就会认为它发送的数据包在网络中丢失了,因此会重新发送这个数据包
Queue buildup¶
Long-lived, greedy TCP flows will cause the length of the bottleneck queue to grow until packets are dropped, resulting in the familiar sawtooth pattern (Figure 1). When long and short flows traverse the same queue, as shown in Figure 6(b), two impairments occur. First, packet loss on the short flows can cause incast problems as described above. Second, there is a queue buildup impairment: even when no packets are lost, the short flows experience increased latency as they are in queue behind packets from the large flows. Since every worker in the cluster handles both query traffic and background traffic (large flows needed to update the data structures on the workers), this traffic pattern occurs very frequently.
A closer look at Figure 7 shows that arrivals of the responses are distributed over ∼12ms. Since the total size of all responses is only 43 × 2KB = 86KB — roughly 1ms of transfer time at 1Gbps — it is surprising that there would be any incast losses in such transfers. However, the key issue is the occupancy of the queue caused by other flows - the background traffic - with losses occurring when the long flows and short flows coincide.
To establish that long flows impact the latency of query responses, we measured the RTT between the worker and the aggregator: this is the time between the worker sending its response and receiving a TCP ACK from the aggregator labeled as “RTT+Queue” in Figure 7. We measured the intra-rack RTT to approximately 100 µ s in absence of queuing, while inter-rack RTTs are under 250 µ s. This means “RTT+queue” is a good measure of the the length of the packet queue headed to the aggregator during the times at which the aggregator is collecting responses. The CDF in Figure 9 is the distribution of queue length for 19K measurements. It shows that 90% of the time a response packet sees < 1ms of queueing, and 10% of the time it sees between 1 and 14ms of queuing (14ms is the maximum amount of dynamic buffer). This indicates that query flows are indeed experiencing queuing delays. Further, note that answering a request can require multiple iterations, which magnifies the impact of this delay.
Note that this delay is unrelated to incast. No packets are being lost, so reducing RTO min will not help. Further, there need not even be many synchronized short flows. Since the latency is caused by queueing, the only solution is to reduce the size of the queues.
长周期的、贪婪的TCP流会导致瓶颈队列的长度持续增长,直到数据包被丢弃,从而产生我们所熟悉的锯齿状模式(图1)。当长流和短流穿过同一个队列时,如图6(b)所示,会发生两种损伤:
- 首先, 短流上的丢包可能导致如上所述的Incast问题
- 其次, 存在队列堆积损伤:即使没有数据包丢失,短流也会因排在长流的数据包后面而经历延迟增加
由于集群中的每个工作节点既处理查询流量也处理后台流量(更新工作节点数据结构所需的大流量),这种流量模式非常频繁地发生
仔细观察图7可以发现,响应的到达分布在约12ms的时间内。由于所有响应的总大小仅为43 × 2KB = 86KB —— 在1Gbps速率下大约需要1ms的传输时间——在这样的传输中竟会出现任何Incast丢包是令人惊讶的。然而,关键问题在于由其他流(即后台流量)所造成的队列占用——当长流和短流同时出现时,丢包便发生了。
为了证实长流会影响查询响应的延迟,我们测量了工作节点和聚合节点之间的RTT:这是工作节点发送响应到收到聚合节点TCP ACK之间的时间,在图7中标记为“RTT+Queue”。我们测得在没有排队的情况下,机架内RTT约为100µs,而跨机架RTT低于250µs。这意味着“RTT+queue”是在聚合节点收集响应期间,通往该节点的包队列长度的一个良好度量。图9中的CDF是19000次测量的队列长度分布。它显示,90%的情况下一个响应数据包遇到的排队时间<1ms,而在10%的情况下它会遇到1到14ms的排队时间(14ms是动态缓冲区的最大值)。这表明查询流确实在经历排队延迟。此外,请注意一个请求的应答可能需要多次迭代,这会放大此延迟的影响。
值得注意的是,这种延迟与Incast无关。没有数据包丢失,因此降低RTO min也无济于事。此外,甚至不需要有许多同步的短流。由于延迟是由排队引起的,唯一的解决方案就是减小队列的长度。
为什么会说: 短流上的丢包可能导致如上所述的Incast问题? 长流丢包会不会导致incast?
Incast问题的本质,是在一个“多对一”的通信模式下,多个同步的响应流(短流)因为无法在应用规定的 极短时限(deadline) 内完成,从而导致上层应用判定任务失败或结果质量下降
短流丢包会精准地触发这个“失败”过程:
- Incast的场景就是短流的场景:论文中的Incast场景是“分区/聚合”模式下的查询响应。聚合器(Aggregator)向43个工作节点(Worker)发请求,这43个工作节点几乎同时返回响应。这些响应流每个都只有2KB,是典型的短流
- 短流的“脆弱性”:一个2KB的短流可能只包含2个数据包。如果其中任何一个数据包丢失,整个数据流就无法完成,必须等待重传
- 灾难性的RTO超时:当数据包丢失后,TCP协议栈会启动 重传超时(RTO) 计时器。根据论文,在他们的数据中心里,这个RTO的最小值被设置为 300ms
- 然而,应用层给每个响应规定的时限(Deadline)是多少呢?论文中提到是 10-100ms
- 核心矛盾:RTO (300ms) >> Deadline (10-100ms)
- 这意味着,一个短流只要发生一次丢包,它就必须等待一个漫长的TCP超时(300ms)才能重传。但此时,它早已错过了应用层给它的时限(比如50ms)
- 因此,对于聚合器来说,这个响应相当于永久丢失了。它不会再等这个响应,而是直接用已收到的其他响应来生成一个质量打折的结果
“短流上的丢包”是触发Incast问题的直接导火索。丢包本身并不可怕,可怕的是丢包触发了一个远比应用时限长得多的TCP超时机制,从而100%导致了应用层面的任务失败
复盘一下: 什么叫做 incast问题
短流多对一,导致上层应用等不及,得到的服务质量显著下降的现象
长流是什么机制
长流虽然是造成短流丢包的“元凶”之一(通过队列堆积和缓冲压力),但它自身并不会成为Incast问题的“受害者”
- 目标完全不同:长流(如1MB-100MB的后台数据更新)的目标是高吞吐量(Throughput),而不是在几十毫秒内完成。它的任务是“在合理的时间内把大量数据传完”,而不是“在50ms内给出是或否的答案”
- 单个丢包影响小:一个50MB的长流包含成千上万个数据包。丢失其中一两个包,对整体任务来说只是微不足道的小插曲
- 恢复机制更高效: 对于长流中的丢包,TCP通常不依赖漫长的RTO。它有更快的恢复机制,比如快速重传(Fast Retransmit)。
- 当发送方连续收到3个重复的ACK时,它会立刻意识到某个包丢了,并立即重传,完全不需要等待300ms的RTO
- 这使得长流能够相对平滑地从丢包中恢复,虽然吞吐量会暂时下降
- 没有“时限”概念:长流没有“必须在50ms内完成”这样的硬性约束
- 因此,即使某个丢包真的导致了RTO,也只是让整个长流的总完成时间增加了300ms
- 这对于一个可能需要几秒甚至几十秒才能传完的长流来说,影响不大
Buffer pressure¶
Given the mix of long and short flows in our data center, it is very common for short flows on one port to be impacted by activity on any of the many other ports, as depicted in Figure 6(c). Indeed, the loss rate of short flows in this traffic pattern depends on the number of long flows traversing other ports. The explanation is that activity on the different ports is coupled by the shared memory pool.
The long, greedy TCP flows build up queues on their interfaces. Since buffer space is a shared resource, the queue build up reduces the amount of buffer space available to absorb bursts of traffic from Partition/Aggregate traffic. We term this impairment buffer pressure. The result is packet loss and timeouts, as in incast, but without requiring synchronized flows.
鉴于我们数据中心中长流和短流的混合情况,一个端口上的短流受到其他许多端口上的活动影响是非常普遍的,如图6(c)所示。事实上,在这种流量模式下,短流的丢包率取决于穿越其他端口的长流量数量。其解释是,不同端口上的活动通过共享内存池耦合在了一起。
长期的、贪婪的TCP流会在其接口上建立起队列。由于缓冲空间是共享资源,这种队列堆积减少了可用于吸收“分区/聚合”流量突发的缓冲空间。我们将这种损伤称为缓冲压力。其结果是丢包和超时,与Incast类似,但并不需要同步的流。