Reading Review: B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google’s Software-Defined WAN

ztex, Tony, Liu

5 min readMar 14, 2024

Background

Growth across multiple dimensions: traffic, topology size, #flow groups, # tunnels

Higher availability requirements:

[Network perspective] Service-level objective (SLO): guarantees a level of service

Availability SLO of 99.99%, Google guarantees that …

packet loss < some threshold
bandwidth > some threshold
… for 99.99% of time over 30 days
downtime of 4 minutes in one month

Challenges

Scale WAN capacity at a site
Address capacity asymmetry
Manage switch rule table size

Scaling B4 Sites: if WAN bandwidth increases, add another site at the exact location:

What the Downside:

a ) With more sites, the TE (traffic engineering) algorithm slows down, reducing the availability

b) increase switch forwarding tables size

c) complicates capacity planning

Solution: Re-design site to add capacity incrementally

Original Site Design:

Saturn-based design:

2-level clos
4 saturn chassis lower level
2, expandable to 4, upper level

If more capacity needed:

add another site

Stargate Site Design:

up to 4 supernodes
There is no need for Freedomes

Supernode:

2-stage Clos using 32x40G chips

8x compared to the original design

Notice sidelinks in site topology:

helps address asymmetry capacity, important, why?

Sidelinks, Capacity Asymmetry:

Impact of abstraction

TE treats the site as a single-node
does not know about supernode details
must spread incoming traffic equally

At a site with multiple supernodes:

Failure can cause capacity asymmetry
Some supernodes have more capacity than others

Example:

Site A has two supernodes A1 and A2
A1 has capacity of 10 to the site B
A2, because of failure, has capacity 2

Need hierarchical TE:

site-level, then supernode-level

Hierarchical TE challenges:

TE requires tunneling

since packets may not follow shortest paths
using IP-in-IP encapsulation
original B4 used this

Hierarchical TE requires

two-levels of IP-in-IP encapsulation (site+supernode level)
hard to do load-balancing using hashing

Why?

nonhierarchical TE doesn’t scale (200x slower)
shortest path routing can’t use sidelink (cuz they are longer paths)

Approach:

Hierarchical TE without two-level encapsulation
Using traffic splits within site

Switch Split Groups (SSGs)

SSGs

determine traffic splits inside Stargate Clos
Two-stage load balancing over backend switches;
use shortest path routing with capacity-based weighted cost multipathing (WCMP)

Tunnel Switch Group:

In original B4, a tunnel group

describes how to split traffc in a fow group
across tunnels

With Hierarchical TE, a tunnel split group
(TSG)

describes how to split tunnel traffc
at each supernode of a site

Takeaway:

No encapsulation
TSGs control how traffic is split at the supernode level
use progressive waterfill to manage capacity asymmetry via sidelink
loop-free and blackhole-free route update across supernodes

Generating TSGs

Problem Formulation: The TSG calculation is treated as an independent network flow problem for each directed site-level link.
Graph Generation:

Vertices in the graph GTSG include supernodes in the source site (Si) and the destination site (D).
Two types of links are added:

Full mesh links among supernodes in the source site.
Links between each supernode in the source site and the destination site.

Aggregate capacities of supertrunks are associated with each link.

Flow Generation:

Infinite demand flows are generated from each supernode Si towards the destination site D.
Two types of pathing groups (PG) with one hop and two hop paths are utilized.

Algorithm:

A greedy exhaustive waterfill algorithm is employed to iteratively allocate bottleneck capacity in a max-min fair manner.

Sequencing TSGs

need to update switches at sites to avoid traffic drops
use the fact that GTSG is a DAG
TSG updates are applied to each supernode in reverse topological ordering.
The algorithm is proven to avoid transient loops or blackholes during the transition.
This dependency-based TSG sequencing approach is likened to how certain events are managed in Interior Gateway Protocol (IGP), ensuring a reliable and stable network operation.

Minimizing Table Sizes

Switches have different tables:

ACL tables for fow matches
LPM tables for longest prefx matches
ECMP tables for ECMP groups

Limitations:

ACL table uses TCAM, expensive so limited
LPM uses SRAM, much larger tables possible

Design can exceed table sizes

FG to TG matching can exceed ACL table
WCMP splitting can exceed ECMP table

Hierarchical FG Matching

Hierarchical TE splits are decoupled into two-stage hashing across switches in the two-stage Clos supernode.
This decoupling optimizes traffic splits and prevents a 6% throughput loss that may occur due to coarser traffic splits.
partition FG matching into two hierarchical stages (Figure 8b)

Two-stage Hashing

To overcome limitations in switch table size, a strategy is implemented to decouple traffic splitting rules across two levels in the Clos fabric. The process involves the following steps:

1. **Traffic Splitting Decoupling**:
— Edge switches determine the tunnel (TG split) and the destination site for incoming traffic (TSG split part 1).
— The decision is propagated to backend switches by encapsulating packets with an IP address for the selected tunnel and a special source MAC address representing the self-/next-site target.

2. **Backend Switch Decision Making**:
— Backend switches use the tunnel IP address and source MAC address to determine the peer supernode for forwarding (TSG split part 2) and the egress edge switch with connectivity to the target supernode (SSG split).

3. **Efficient Splitting Rules**:
— To reduce the number of splitting rules on ingress switches, a link aggregation group (LAG) is configured for each edge switch towards viable backend switches.
— A backend switch is considered viable if it is active and has active connections to all edge switches in the supernode.

By decoupling traffic splitting rules and efficiently propagating decisions between edge and backend switches, the B4 network optimizes traffic distribution, enhances resource utilization, and ensures effective communication within the network fabric.

Results: Availability Improvements

For many service classes:

an order of magnitude availability improvements

Comes from:

faster TE, so losses repaired quickly = Two-stage Hashing + Hierarchical FG Matching + Minimizing Table Sizes + Hierarchical TE
use of side-links, so less capacity drop

Reference:

B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google’s Software-Defined WAN