Reading Review: B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google’s Software-Defined WAN

ztex, Tony, Liu
5 min readMar 14, 2024

--

Background

Growth across multiple dimensions: traffic, topology size, #flow groups, # tunnels

Higher availability requirements:

[Network perspective] Service-level objective (SLO): guarantees a level of service

Availability SLO of 99.99%, Google guarantees that …

  • packet loss < some threshold
  • bandwidth > some threshold
  • … for 99.99% of time over 30 days
  • downtime of 4 minutes in one month

Challenges

  1. Scale WAN capacity at a site
  2. Address capacity asymmetry
  3. Manage switch rule table size

Scaling B4 Sites: if WAN bandwidth increases, add another site at the exact location:

What the Downside:

a ) With more sites, the TE (traffic engineering) algorithm slows down, reducing the availability

b) increase switch forwarding tables size

c) complicates capacity planning

Solution: Re-design site to add capacity incrementally

Original Site Design:

saturn site

Saturn-based design:

  • 2-level clos
  • 4 saturn chassis lower level
  • 2, expandable to 4, upper level

If more capacity needed:

  • add another site

Stargate Site Design:

  • up to 4 supernodes
  • There is no need for Freedomes

Supernode:

2-stage Clos using 32x40G chips

8x compared to the original design

Notice sidelinks in site topology:

  • helps address asymmetry capacity, important, why?

Sidelinks, Capacity Asymmetry:

Impact of abstraction

  • TE treats the site as a single-node
  • does not know about supernode details
  • must spread incoming traffic equally

At a site with multiple supernodes:

  • Failure can cause capacity asymmetry
  • Some supernodes have more capacity than others

Example:

  • Site A has two supernodes A1 and A2
  • A1 has capacity of 10 to the site B
  • A2, because of failure, has capacity 2

Need hierarchical TE:

  • site-level, then supernode-level

Hierarchical TE challenges:

TE requires tunneling

  • since packets may not follow shortest paths
  • using IP-in-IP encapsulation
  • original B4 used this

Hierarchical TE requires

  • two-levels of IP-in-IP encapsulation (site+supernode level)
  • hard to do load-balancing using hashing

Why?

  • nonhierarchical TE doesn’t scale (200x slower)
  • shortest path routing can’t use sidelink (cuz they are longer paths)

Approach:

  • Hierarchical TE without two-level encapsulation
  • Using traffic splits within site

Switch Split Groups (SSGs)

SSGs

  • determine traffic splits inside Stargate Clos
  • Two-stage load balancing over backend switches;
  • use shortest path routing with capacity-based weighted cost multipathing (WCMP)

Tunnel Switch Group:

In original B4, a tunnel group

  • describes how to split traffc in a fow group
  • across tunnels

With Hierarchical TE, a tunnel split group
(TSG)

  • describes how to split tunnel traffc
  • at each supernode of a site

Takeaway:

  • No encapsulation
  • TSGs control how traffic is split at the supernode level
  • use progressive waterfill to manage capacity asymmetry via sidelink
  • loop-free and blackhole-free route update across supernodes

Generating TSGs

  1. Problem Formulation: The TSG calculation is treated as an independent network flow problem for each directed site-level link.
  2. Graph Generation:
  • Vertices in the graph GTSG include supernodes in the source site (Si) and the destination site (D).
  • Two types of links are added:
  1. Full mesh links among supernodes in the source site.
  2. Links between each supernode in the source site and the destination site.
  • Aggregate capacities of supertrunks are associated with each link.
  1. Flow Generation:
  • Infinite demand flows are generated from each supernode Si towards the destination site D.
  • Two types of pathing groups (PG) with one hop and two hop paths are utilized.
  1. Algorithm:
  • A greedy exhaustive waterfill algorithm is employed to iteratively allocate bottleneck capacity in a max-min fair manner.

Sequencing TSGs

  • need to update switches at sites to avoid traffic drops
  • use the fact that GTSG is a DAG
  • TSG updates are applied to each supernode in reverse topological ordering.
  • The algorithm is proven to avoid transient loops or blackholes during the transition.
  • This dependency-based TSG sequencing approach is likened to how certain events are managed in Interior Gateway Protocol (IGP), ensuring a reliable and stable network operation.

Minimizing Table Sizes

Switches have different tables:

  • ACL tables for fow matches
  • LPM tables for longest prefx matches
  • ECMP tables for ECMP groups

Limitations:

  • ACL table uses TCAM, expensive so limited
  • LPM uses SRAM, much larger tables possible

Design can exceed table sizes

  • FG to TG matching can exceed ACL table
  • WCMP splitting can exceed ECMP table

Hierarchical FG Matching

  • Hierarchical TE splits are decoupled into two-stage hashing across switches in the two-stage Clos supernode.
  • This decoupling optimizes traffic splits and prevents a 6% throughput loss that may occur due to coarser traffic splits.
  • partition FG matching into two hierarchical stages (Figure 8b)

Two-stage Hashing

To overcome limitations in switch table size, a strategy is implemented to decouple traffic splitting rules across two levels in the Clos fabric. The process involves the following steps:

1. **Traffic Splitting Decoupling**:
Edge switches determine the tunnel (TG split) and the destination site for incoming traffic (TSG split part 1).
— The decision is propagated to backend switches by encapsulating packets with an IP address for the selected tunnel and a special source MAC address representing the self-/next-site target.

2. **Backend Switch Decision Making**:
Backend switches use the tunnel IP address and source MAC address to determine the peer supernode for forwarding (TSG split part 2) and the egress edge switch with connectivity to the target supernode (SSG split).

3. **Efficient Splitting Rules**:
— To reduce the number of splitting rules on ingress switches, a link aggregation group (LAG) is configured for each edge switch towards viable backend switches.
— A backend switch is considered viable if it is active and has active connections to all edge switches in the supernode.

By decoupling traffic splitting rules and efficiently propagating decisions between edge and backend switches, the B4 network optimizes traffic distribution, enhances resource utilization, and ensures effective communication within the network fabric.

Results: Availability Improvements

For many service classes:

  • an order of magnitude availability improvements

Comes from:

  • faster TE, so losses repaired quickly = Two-stage Hashing + Hierarchical FG Matching + Minimizing Table Sizes + Hierarchical TE
  • use of side-links, so less capacity drop

Reference:

  • B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google’s Software-Defined WAN

--

--

ztex, Tony, Liu
ztex, Tony, Liu

Written by ztex, Tony, Liu

Incoming-Intern, CPU emulation software @Apple, Ex-SDE @Amazon. Working on embedded system, Free-RTOS, RISC-V etc.

No responses yet