B4: Google’s Software-Defined WAN

ztex, Tony, Liu
6 min readMar 14, 2024

--

Background & Problems

  1. High cost: Provisioning WAN links at high capacity to protect against failures leads to unsustainable cost projections. (e.g only use 3–40%, waste 6–70%)
  2. Inefficient resource utilization: Treating all applications equally results in suboptimal utilization of network capacity.
  3. Lack of control: Limited ability to enforce relative application priorities and control bursts at the network edge.
  4. Scalability issues: Difficulty in scaling among multiple dimensions while maintaining cost efficiency and control.

Google WAN Design

  • Utilizes Software Defined Networking (SDN) with OpenFlow for control.
  • Operates two distinct WANs: user-facing network (B2) and B4 network.
  • Supports thousands of applications categorized into different traffic classes.
  • Uses relatively simple switches built from merchant silicon and drives links to near full utilization.
  • Splits application flows among multiple paths to balance capacity against application priority/demands.

In short: Centralized-control + different traffic classes + demand,priority aware

Inter Data Center Transfer:

Three types: all first-party applications:

  1. Replicate user data across different campuses (Low priority, significant volume traffic is elastic(flexible, tolerate delay))
  2. Applications accessing data in other campuses
  3. State synchronization between distributed applications

Takeaway:

  1. Since all first-party applications, Google controls everything
  2. Lower priority traffics are large volume and elastic
  3. the lower volume, the higer priority

B4 Approach:

Elements of Design:

  • WAN router using merchant-silicon
  • Hierarchical SDN
  • Centralized Traffic Engineering

WAN Router:

Use two-stage Clos:

  • to build a 128x10G router
  • from 16x10G merchant silicon

Router runs an OpenFlow Agent:

  • agent expose single-switch abstraction
  • but must program individual switches

B4 Architecture:

  • SDN architecture consists of three layers: switch hardware layer, site controller layer, and network control servers.
  • Switch hardware layer primarily forwards traffic without complex control software.
  • Site controller layer includes Network Control Servers (NCS) hosting OpenFlow controllers and Network Control Applications (NCAs).
  • OpenFlow controllers maintain network state and instruct switches on forwarding table entries based on changing network conditions.
  • Per-site infrastructure ensures fault tolerance for individual servers and control processes.

B4 Topology Sites & Architecture:

  • Paxos elects one of multiple available software replicas (placed on different physical servers) as the primary instance.
  • The global layer = an SDN Gateway + TE server: enable the central control of the entire network via the site-level NCAs
  • Gateway abstracts details of OpenFlow and switch hardware from the central TE server: Gateway is like an assitant, arrange information for TE sersver, so that it can make decision
  • Each cluster contains a set of BGP routers that peer with B4 switches at each WAN site (iBGP)
  • BGP because of its isolation properties between domains and operator familiarity with the protocol. The SDN-based B4 then had to support existing distributed routing protocols, both for interoperability with our non-SDN WAN implementation, and to enable a gradual rollout, fallback.
  • To scale, the TE abstracts each site into a single node with a single edge of given capacity to each remote site
  • Quagga (Open-Source BGP software: talk BGP = Raven)
  • Orion Runs on on-site controllers

Hierarchical SDN:

Two levels of SDK-like control

  1. local SDN control at each site
  2. Global centralized traffic engineering

Site-level Control

Openflow controller (later Orion):

  • controls the 4 routers

Two apps:

  • for BGP/IS-IS (Quagga/RAP)
  • for traffc engineering (TE Agent)

Controller replicated on three servers:

  • Uses Paxos for leader election

The Role of BGP/IS-IS

  • Determine the shortest paths
  • Fallback
  • Sites: i) talk iBGP and IS-iS with other sites ii) eBGP with clusters

Integrating Decentralized Routing

- Challenge: B4 integrates OpenFlow-based switch control with existing routing protocols like Quagga for BGP/ISIS on NCS.
- Uses a Routing Application Proxy (RAP) to connect Quagga with OF switches for route updates and interface changes.
- RAP translates routing information from Quagga RIB to OpenFlow tables for ECMP hashing and next-hop identification.
- BGP and ISIS sessions run across B4 hardware ports, with RAP proxying routing-protocol packets between Quagga and hardware switches.
- TE is layered on top of routing protocols to prioritize switch forwarding table entries and provide fault recovery mechanisms.

Traffic Engineering (TE)

Input:

  1. Network topology
  2. Node-to-node demand

Output:

  1. routing of demand over multiple paths
  2. may not be shortest paths
  3. routing satisfies global objective (e.g., maximize utilization)

Topology is site-to-site topology:

  1. Topology abstraction: helps scale the TE algorithm
  2. Each site is a node

Input demand: flow group(traffic abstraction)

  1. traffic across all apps
  2. between two sites

Output: tunnel groups

  1. site-to-site paths
  2. to split flow group on

How B4 TE Works:

  • The Network Topology graph represents sites as vertices and site to site connectivity as edges. The SDN Gateway consolidates topology events from multiple sites and individual switches to TE. TE aggregates trunks to compute site-site edges. This abstraction significantly reduces the size of the graph input to the TE Optimization Algorithm.
  • Flow Group (FG): For scalability, TE cannot operate at the granularity of individual applications. Therefore, we aggregate applications to a Flow Group defined as {source site, dest site, QoS} tuple.
  • A Tunnel (T) represents a site-level path in the network, e.g., a sequence of sites (A B C). B4 implements tunnels using IP in IP encapsulation.
  • A Tunnel Group (TG) maps FGs to a set of tunnels and corresponding weights. The weight specifies the fraction of FG traffic to be forwarded along each tunnel.

Bandwidth Enforcer: Summarize site-to-site traffic and report it to TE Optimization Algorithm (Flow Groups)

Input: Flow group

  1. from Bandwidth Enforcer (BwE)
  2. BwE talks to apps

Input: Site-to-site Topology

  1. Gateway builds using site controller events

TE server runs TE algorithm

  1. sends Tunnel groups to TE Database Manager
  2. programs groups across sites

Accounting for Application Priority:

Lower priority applications:

  • can get less bandwidth than their demand
  • can be routed on longer paths

Helps achieve high utilization:

  • longer paths use less utilized links
  • on failure, smaller allocation than demand

Problem:

  • applications with different priorities
  • … in a flow group

Bandwidth Functions

Application assigned weight $w_i$:

  • bandwidth allocated to application
  • … proportional to weight

Modeled as a per-application bandwidth
function:

  • of fair share

Bandwidth function for Flow Group:

  • piecewise linear sum of per-application
    functions
  • capped at total demand

TE algorithm:

Goal:

  • allocate as much of demand for each flow group as possible
  • without exceeding capacity
  • and each group’s traffic on shortest path possible
  • with each group getting fair share on each competing link

TE Algorithm: Tunnel group ordering

Two examples of TE Allocation with two FGs.

For each flow group

  • order tunnel groups by decreasing length

FG1(A → B)

  • A → B
  • A → C → B
  • A → D → C → B

FG2(A → C)

  • A → C
  • A → D → C
  • Shorter path first
  • Each step needs same fair share or saturated

The Evolution of B4:

  1. Increasing Network Scale
  2. Higher availability requirements†

Reference:

--

--

ztex, Tony, Liu

Incoming-Intern, CPU emulation software @Apple, Ex-SDE @Amazon. Working on embedded system, Free-RTOS, RISC-V etc.