DCTCP: Datacenter TCP

ztex, Tony, Liu
3 min readMar 22, 2024

--

Background

  1. Merchant Silicon
  2. ECN: Explicit Congestion Notification (Queue full > set bit in the IP header)
  3. Partition Aggregation Computations (need: low latency, high burst tolerance) (low latency for interactive small transfers and high throughput for large transfers)
  4. IETF standard: (1. ues bit in IP header 2. router can mark when experience congestion 3. cuts cwnd in1/2)
  5. Incast

Problems

shallow buffered

  • Most cheap switches are shallow-buffered
  • packet buffer being the scarcest resource
  • Memory is shared across all ports.
  • Dynamically allocate depending on traffic.
  • Large traffic can grab memory, and a packet of small flow might drop.
  • Many flows converge on the same interface of a switch over a short period of time; the packets may exhaust either the switch memory.

Incast

Partition-Aggregate

switch can run out of memory:

  • Response may drop
  • retransmitted after RTO_min
  • can miss deadline

DCTCP Components

Marking at switch

if the instantaneous queue size is larger than K, then set the packet’s CE bit

Different from the RED algorithm:

  • track average queue size
  • avg < lower bound: do nothing
  • lower ≤ avg ≤ upper bound: randomly mark
  • Avg> upper bound: mark every packet

why instantaneous instead of average:

  1. avoid packet loss at all cost
  2. no statistical mux

Conveying CE Signals at the Receiver

  • Send CE bit with every ACK

The simplest way to do this is to ACK every packet, setting the ECN-Echo flag if and only if the packet has a marked CE codepoint.

However, using Delayed ACKs is important for various reasons, including reducing the load on the data sender. To use delayed ACKs (one cumulative ACK for every m consecutively received packets), the DCTCP receiver uses the trivial two-state state machine shown in Figure 10 to determine whether to set ECN- Echo bit.

Takeaway: We don’t want to send ACK every packet to reduce the load -> send every m consecutively received packets.

Estimating Congestion Levels at Sender

Use EWMA (exponentially weighted moving average ) to adjust

Reducing Congestion Window at the Sender

Benefits from DCTCP

Queue buildup

when queue length on an interface > K, DCTCP senders start reacting.

  • Reduces queueing delays on congested switch ports, which minimizes the impact of long flows on the completion time of small flows.
  • More buffer space is available as headroom to absorb transient micro-bursts, greatly mitigating costly packet losses that can lead to timeouts.

Buffer pressure

DCTCP also solves the buffer pressure problem because a congested port’s queue length does not grow exceedingly large. Therefore, in shared memory switches, a few congested ports will not exhaust the buffer resources, harming flows passing through other ports.

Incast

  • Large number of synchronized small flows hit the same queue, is the most difficult to handle
  • there isn’t much DCTCP — or any congestion control scheme that does not attempt to schedule traffic — can do to avoid packet drops.

Takeaway: DCTCP cannot solve it

Analysis

Reference:

--

--

ztex, Tony, Liu
ztex, Tony, Liu

Written by ztex, Tony, Liu

Incoming-Intern, CPU emulation software @Apple, Ex-SDE @Amazon. Working on embedded system, Free-RTOS, RISC-V etc.

No responses yet