Reading Review: B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google’s Software-Defined WAN
Background
Growth across multiple dimensions: traffic, topology size, #flow groups, # tunnels
Higher availability requirements:
[Network perspective] Service-level objective (SLO): guarantees a level of service
Availability SLO of 99.99%, Google guarantees that …
- packet loss < some threshold
- bandwidth > some threshold
- … for 99.99% of time over 30 days
- downtime of 4 minutes in one month
Challenges
- Scale WAN capacity at a site
- Address capacity asymmetry
- Manage switch rule table size
Scaling B4 Sites: if WAN bandwidth increases, add another site at the exact location:
What the Downside:
a ) With more sites, the TE (traffic engineering) algorithm slows down, reducing the availability
b) increase switch forwarding tables size
c) complicates capacity planning
Solution: Re-design site to add capacity incrementally
Original Site Design:
Saturn-based design:
- 2-level clos
- 4 saturn chassis lower level
- 2, expandable to 4, upper level
If more capacity needed:
- add another site
Stargate Site Design:
- up to 4 supernodes
- There is no need for Freedomes
Supernode:
2-stage Clos using 32x40G chips
8x compared to the original design
Notice sidelinks in site topology:
- helps address asymmetry capacity, important, why?
Sidelinks, Capacity Asymmetry:
Impact of abstraction
- TE treats the site as a single-node
- does not know about supernode details
- must spread incoming traffic equally
At a site with multiple supernodes:
- Failure can cause capacity asymmetry
- Some supernodes have more capacity than others
Example:
- Site A has two supernodes A1 and A2
- A1 has capacity of 10 to the site B
- A2, because of failure, has capacity 2
Need hierarchical TE:
- site-level, then supernode-level
Hierarchical TE challenges:
TE requires tunneling
- since packets may not follow shortest paths
- using IP-in-IP encapsulation
- original B4 used this
Hierarchical TE requires
- two-levels of IP-in-IP encapsulation (site+supernode level)
- hard to do load-balancing using hashing
Why?
- nonhierarchical TE doesn’t scale (200x slower)
- shortest path routing can’t use sidelink (cuz they are longer paths)
Approach:
- Hierarchical TE without two-level encapsulation
- Using traffic splits within site
Switch Split Groups (SSGs)
SSGs
- determine traffic splits inside Stargate Clos
- Two-stage load balancing over backend switches;
- use shortest path routing with capacity-based weighted cost multipathing (WCMP)
Tunnel Switch Group:
In original B4, a tunnel group
- describes how to split traffc in a fow group
- across tunnels
With Hierarchical TE, a tunnel split group
(TSG)
- describes how to split tunnel traffc
- at each supernode of a site
Takeaway:
- No encapsulation
- TSGs control how traffic is split at the supernode level
- use progressive waterfill to manage capacity asymmetry via sidelink
- loop-free and blackhole-free route update across supernodes
Generating TSGs
- Problem Formulation: The TSG calculation is treated as an independent network flow problem for each directed site-level link.
- Graph Generation:
- Vertices in the graph GTSG include supernodes in the source site (Si) and the destination site (D).
- Two types of links are added:
- Full mesh links among supernodes in the source site.
- Links between each supernode in the source site and the destination site.
- Aggregate capacities of supertrunks are associated with each link.
- Flow Generation:
- Infinite demand flows are generated from each supernode Si towards the destination site D.
- Two types of pathing groups (PG) with one hop and two hop paths are utilized.
- Algorithm:
- A greedy exhaustive waterfill algorithm is employed to iteratively allocate bottleneck capacity in a max-min fair manner.
Sequencing TSGs
- need to update switches at sites to avoid traffic drops
- use the fact that GTSG is a DAG
- TSG updates are applied to each supernode in reverse topological ordering.
- The algorithm is proven to avoid transient loops or blackholes during the transition.
- This dependency-based TSG sequencing approach is likened to how certain events are managed in Interior Gateway Protocol (IGP), ensuring a reliable and stable network operation.
Minimizing Table Sizes
Switches have different tables:
- ACL tables for fow matches
- LPM tables for longest prefx matches
- ECMP tables for ECMP groups
Limitations:
- ACL table uses TCAM, expensive so limited
- LPM uses SRAM, much larger tables possible
Design can exceed table sizes
- FG to TG matching can exceed ACL table
- WCMP splitting can exceed ECMP table
Hierarchical FG Matching
- Hierarchical TE splits are decoupled into two-stage hashing across switches in the two-stage Clos supernode.
- This decoupling optimizes traffic splits and prevents a 6% throughput loss that may occur due to coarser traffic splits.
- partition FG matching into two hierarchical stages (Figure 8b)
Two-stage Hashing
To overcome limitations in switch table size, a strategy is implemented to decouple traffic splitting rules across two levels in the Clos fabric. The process involves the following steps:
1. **Traffic Splitting Decoupling**:
— Edge switches determine the tunnel (TG split) and the destination site for incoming traffic (TSG split part 1).
— The decision is propagated to backend switches by encapsulating packets with an IP address for the selected tunnel and a special source MAC address representing the self-/next-site target.
2. **Backend Switch Decision Making**:
— Backend switches use the tunnel IP address and source MAC address to determine the peer supernode for forwarding (TSG split part 2) and the egress edge switch with connectivity to the target supernode (SSG split).
3. **Efficient Splitting Rules**:
— To reduce the number of splitting rules on ingress switches, a link aggregation group (LAG) is configured for each edge switch towards viable backend switches.
— A backend switch is considered viable if it is active and has active connections to all edge switches in the supernode.
By decoupling traffic splitting rules and efficiently propagating decisions between edge and backend switches, the B4 network optimizes traffic distribution, enhances resource utilization, and ensures effective communication within the network fabric.
Results: Availability Improvements
For many service classes:
- an order of magnitude availability improvements
Comes from:
- faster TE, so losses repaired quickly = Two-stage Hashing + Hierarchical FG Matching + Minimizing Table Sizes + Hierarchical TE
- use of side-links, so less capacity drop
Reference:
- B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google’s Software-Defined WAN