Review: Understanding Lifecycle Management Complexity of Datacenter Topologies

ztex, Tony, Liu
4 min readMar 10, 2024

--

Introduction

When building large-scale data center networks, performance metrics like low latency and high throughput are important, but complexity is another critical factor often overlooked. Complexity impacts the deployment, maintenance, and scalability of the network over its lifecycle. This post discusses new metrics proposed for quantifying data center network complexity and introduces FatClique, a novel topology designed to reduce complexity while maintaining performance.

Clos topology
Expander graph tology

Measuring Deployment Complexity

Deploying datacenter networks involves packaging switches, placing equipment, and bundling/connecting cabling. Three key metrics capture the deployment complexity:

  1. Number of Switches — More switches increase packaging and placement effort.
    2. Number of Patch Panels — Patch panels enable bundling cables but require more panels for inter-rack connections.
    3. Number of Bundle Types — Fewer unique cable bundle lengths/capacities simplify cabling.
Bundle and Patch panel
Comparison: Clos vs Jellyfish

Measuring Expansion Complexity

As demand grows, networks must expand gradually while minimizing disruption and maintaining capacity. Two metrics gauge the complexity:

  1. Number of Expansion Steps — Each step requires link rewiring that cannot be fully parallelized. Fewer steps are better.
    2. Rewired Links per Patch Panel Rack per Step — With parallelism, this dominates the expansion time for each step.
Re-wiring scenario

Comparing Existing Topologies

The paper uses the proposed metrics to evaluate traditional datacenter network topologies like multi-layer Clos and expander graph designs (Jellyfish, Xpander). Clos networks have fewer bundle types but require more patch panels, while expanders need more patch panels for their higher inter-rack connectivity.

For expansion, networks with a higher “north-south capacity ratio” (oversubscribed uplinks) can rewire more links per step while maintaining enough residual capacity. This “FatEdge” property enables faster, less disruptive expansions.

Introducing FatClique

The FatClique topology incorporates hierarchy, structure, and fat edges to reduce complexity based on the defined metrics:

  • Three hierarchy levels: sub-blocks, blocks, and the overall FatClique
    - Clique interconnects at each level for high connectivity
    - Constrained to have oversubscribed uplinks (fat edge) at every level
    - Structured design enables bundling and reduces patch panels
Fat edge: North-to-South Ratio > 1
FatClique Topology
Fat edge leads to fewer expansion steps

Evaluation Results

Compared to Clos and expander topologies across different scales:

- FatClique uses 50% fewer switches and 33% fewer patch panels at large scale
- 23–36% lower cabling costs due to reduced switches/cables
- Enables faster expansions at higher capacity SLOs due to the fat edge design

The structured, oversubscribed hierarchy of FatClique allows high performance while drastically reducing deployment and expansion complexity for large datacenter networks.

Comparison: # of patch panels
Comparison: cabling cost
Comparison: # of steps at different SLOs

Conclusion

While performance is crucial, this paper highlights the importance of also optimizing for lifecycle management complexity in massive datacenter networks. The proposed metrics quantify complexity, exposing the drawbacks of traditional topologies. FatClique demonstrates that intelligent structured design can achieve superior performance and manageability. Designing for simplicity and manageability will become even more critical as networks continue scaling.

Reference

  1. Youtube video: Understanding Lifecycle Management Complexity of Datacenter Topologies
  2. Youtube video: NSDI ’19 — Understanding Lifecycle Management Complexity of Datacenter Topologies
  3. Slides
  4. Understanding Lifecycle Management Complexity of Datacenter Topologies

--

--

ztex, Tony, Liu
ztex, Tony, Liu

Written by ztex, Tony, Liu

Incoming-Intern, CPU emulation software @Apple, Ex-SDE @Amazon. Working on embedded system, Free-RTOS, RISC-V etc.

No responses yet