Swift: Delay is Simple and Effective for Congestion Control in the DataCenter

ztex, Tony, Liu
3 min readMar 22, 2024

--

Background

Trends in storage workloads, host networking stacks, and datacenter switches drove swift.

Storage Workloads:

  • Storage is the dominant workload for our datacenter networks.
  • latency has become critical as cluster-wide storage systems have evolved to faster media (NVMe)
  • Tight network tail latency is important because storage access touches multiple devices. Overall latency for any single storage operation is dictated by the latency of the longest network operation. (Need low tail latency)

Host Networking Stacks:

  • Traditional congestion control runs as part of the host operating system -> limited by its APIs, expensive for innovation
  • Newer stacks such as RDMA and NVMe are designed from the ground up for low-latency storage operations. To avoid operating system overheads, they are typically implemented in OS bypass stacks such as Snap [36] or offloaded to the NIC
  • Swift = delay-based congestion control scheme to control the issue rate of RDMA operations based on precise timestamp measurements + OS bypass

Datacenter Switches:

  • Heterogeneity is inevitable
  • DCTCP relies on switches to mark packets with ECN when the queue size crosses a threshold. Selecting an appropriate threshold and maintaining the configurations as line speeds and buffer sizes vary is challenging at scale.

Idea

RTT increasing implies more packets are in the queue, so we should ask senders to lower the sending rate (adjusting window size)

Challenges:

  1. How to measure accurately?
  2. How to adjust cwnd?

Approach

  1. measuring end-to-end RTT
  2. decompose RTT = fabric (network) delay + end-host delay (two windows sizes)
  3. cwnd = min(fcwnd, ecwnd)
  4. AIMD to adjust window sizes

Delay-based Congestion Control: AIMD

Because it’s in Google’s datacenter, we can deduce the target delay

Components of delay:
• propagation
• transmission (serialization)
• queueing
Target delay depends on
• path length
• number of concurrent flows

d_target = d_base + hops * d_perhop
• d_base: base delay (constant)
• d_perhop: per-hop delay (constant)
• hops: obtained from TTL

Queue sizes
• can increase with the number of flows
• How do we account for this in the delay target?
With N flows
• queue size increases as sqrt(N)
• but cannot know queue size (real-time fluctuate rapidly)
Scale target by sqrt(cwnd)

• since and inversely proportional to N

Incast

當 1000 workers 送東西給 aggregator 而 queue size = 500

Question: Is cwnd = 1 small enough?

Consider the case below:

there’re 1000 workers sending packages to an aggregator which has a queue size of 500

Even if cwnd=1, there will still be packet drops

DCTCP: cwnd = 1 is the best we can do

Swift: How about we allow cwnd < 1 (using pacing, timer)

  1. pacing: set a timer: what’s the last time I sent a packet?
  2. when the timer goes off: send it

References:

  1. Snap: a microkernel approach to host networking

--

--

ztex, Tony, Liu

Incoming-Intern, CPU emulation software @Apple, Ex-SDE @Amazon. Working on embedded system, Free-RTOS, RISC-V etc.