DCTCP: Datacenter TCP
Background
- Merchant Silicon
- ECN: Explicit Congestion Notification (Queue full > set bit in the IP header)
- Partition Aggregation Computations (need: low latency, high burst tolerance) (low latency for interactive small transfers and high throughput for large transfers)
- IETF standard: (1. ues bit in IP header 2. router can mark when experience congestion 3. cuts cwnd in1/2)
- Incast
Problems
shallow buffered
- Most cheap switches are shallow-buffered
- packet buffer being the scarcest resource
- Memory is shared across all ports.
- Dynamically allocate depending on traffic.
- Large traffic can grab memory, and a packet of small flow might drop.
- Many flows converge on the same interface of a switch over a short period of time; the packets may exhaust either the switch memory.
Incast
Partition-Aggregate
switch can run out of memory:
- Response may drop
- retransmitted after RTO_min
- can miss deadline
DCTCP Components
Marking at switch
if the instantaneous queue size is larger than K, then set the packet’s CE bit
Different from the RED algorithm:
- track average queue size
- avg < lower bound: do nothing
- lower ≤ avg ≤ upper bound: randomly mark
- Avg> upper bound: mark every packet
why instantaneous instead of average:
- avoid packet loss at all cost
- no statistical mux
Conveying CE Signals at the Receiver
- Send CE bit with every ACK
The simplest way to do this is to ACK every packet, setting the ECN-Echo flag if and only if the packet has a marked CE codepoint.
However, using Delayed ACKs is important for various reasons, including reducing the load on the data sender. To use delayed ACKs (one cumulative ACK for every m consecutively received packets), the DCTCP receiver uses the trivial two-state state machine shown in Figure 10 to determine whether to set ECN- Echo bit.
Takeaway: We don’t want to send ACK every packet to reduce the load -> send every m consecutively received packets.
Estimating Congestion Levels at Sender
Use EWMA (exponentially weighted moving average ) to adjust
Reducing Congestion Window at the Sender
Benefits from DCTCP
Queue buildup
when queue length on an interface > K, DCTCP senders start reacting.
- Reduces queueing delays on congested switch ports, which minimizes the impact of long flows on the completion time of small flows.
- More buffer space is available as headroom to absorb transient micro-bursts, greatly mitigating costly packet losses that can lead to timeouts.
Buffer pressure
DCTCP also solves the buffer pressure problem because a congested port’s queue length does not grow exceedingly large. Therefore, in shared memory switches, a few congested ports will not exhaust the buffer resources, harming flows passing through other ports.
Incast
- Large number of synchronized small flows hit the same queue, is the most difficult to handle
- there isn’t much DCTCP — or any congestion control scheme that does not attempt to schedule traffic — can do to avoid packet drops.
Takeaway: DCTCP cannot solve it