Swift: Delay is Simple and Effective for Congestion Control in the DataCenter

ztex, Tony, Liu

3 min readMar 22, 2024

Background

Trends in storage workloads, host networking stacks, and datacenter switches drove swift.

Storage Workloads:

Storage is the dominant workload for our datacenter networks.
latency has become critical as cluster-wide storage systems have evolved to faster media (NVMe)
Tight network tail latency is important because storage access touches multiple devices. Overall latency for any single storage operation is dictated by the latency of the longest network operation. (Need low tail latency)

Host Networking Stacks:

Traditional congestion control runs as part of the host operating system -> limited by its APIs, expensive for innovation
Newer stacks such as RDMA and NVMe are designed from the ground up for low-latency storage operations. To avoid operating system overheads, they are typically implemented in OS bypass stacks such as Snap [36] or offloaded to the NIC
Swift = delay-based congestion control scheme to control the issue rate of RDMA operations based on precise timestamp measurements + OS bypass

Datacenter Switches:

Heterogeneity is inevitable
DCTCP relies on switches to mark packets with ECN when the queue size crosses a threshold. Selecting an appropriate threshold and maintaining the configurations as line speeds and buffer sizes vary is challenging at scale.

Idea

RTT increasing implies more packets are in the queue, so we should ask senders to lower the sending rate (adjusting window size)

Challenges:

How to measure accurately?
How to adjust cwnd?

Approach

measuring end-to-end RTT
decompose RTT = fabric (network) delay + end-host delay (two windows sizes)
cwnd = min(fcwnd, ecwnd)
AIMD to adjust window sizes

Delay-based Congestion Control: AIMD

Because it’s in Google’s datacenter, we can deduce the target delay

Components of delay:
• propagation
• transmission (serialization)
• queueing
Target delay depends on
• path length
• number of concurrent flows

d_target = d_base + hops * d_perhop
• d_base: base delay (constant)
• d_perhop: per-hop delay (constant)
• hops: obtained from TTL

Queue sizes
• can increase with the number of flows
• How do we account for this in the delay target?
With N flows
• queue size increases as sqrt(N)
• but cannot know queue size (real-time fluctuate rapidly)
Scale target by sqrt(cwnd)

• since and inversely proportional to N