DCTCP: Datacenter TCP

ztex, Tony, Liu

3 min readMar 22, 2024

Background

Merchant Silicon
ECN: Explicit Congestion Notification (Queue full > set bit in the IP header)
Partition Aggregation Computations (need: low latency, high burst tolerance) (low latency for interactive small transfers and high throughput for large transfers)
IETF standard: (1. ues bit in IP header 2. router can mark when experience congestion 3. cuts cwnd in1/2)
Incast

Problems

shallow buffered

Most cheap switches are shallow-buffered
packet buffer being the scarcest resource
Memory is shared across all ports.
Dynamically allocate depending on traffic.
Large traffic can grab memory, and a packet of small flow might drop.
Many flows converge on the same interface of a switch over a short period of time; the packets may exhaust either the switch memory.

Incast

Partition-Aggregate

switch can run out of memory:

Response may drop
retransmitted after RTO_min
can miss deadline

DCTCP Components

Marking at switch

if the instantaneous queue size is larger than K, then set the packet’s CE bit

Different from the RED algorithm:

track average queue size
avg < lower bound: do nothing
lower ≤ avg ≤ upper bound: randomly mark
Avg> upper bound: mark every packet

why instantaneous instead of average:

avoid packet loss at all cost
no statistical mux

Conveying CE Signals at the Receiver

Send CE bit with every ACK

The simplest way to do this is to ACK every packet, setting the ECN-Echo flag if and only if the packet has a marked CE codepoint.

However, using Delayed ACKs is important for various reasons, including reducing the load on the data sender. To use delayed ACKs (one cumulative ACK for every m consecutively received packets), the DCTCP receiver uses the trivial two-state state machine shown in Figure 10 to determine whether to set ECN- Echo bit.

Takeaway: We don’t want to send ACK every packet to reduce the load -> send every m consecutively received packets.

Estimating Congestion Levels at Sender

Use EWMA (exponentially weighted moving average ) to adjust

Reducing Congestion Window at the Sender

Benefits from DCTCP

Queue buildup

when queue length on an interface > K, DCTCP senders start reacting.

Reduces queueing delays on congested switch ports, which minimizes the impact of long flows on the completion time of small flows.
More buffer space is available as headroom to absorb transient micro-bursts, greatly mitigating costly packet losses that can lead to timeouts.

Buffer pressure

DCTCP also solves the buffer pressure problem because a congested port’s queue length does not grow exceedingly large. Therefore, in shared memory switches, a few congested ports will not exhaust the buffer resources, harming flows passing through other ports.

Incast

Large number of synchronized small flows hit the same queue, is the most difficult to handle
there isn’t much DCTCP — or any congestion control scheme that does not attempt to schedule traffic — can do to avoid packet drops.

Takeaway: DCTCP cannot solve it