B4: Google’s Software-Defined WAN

ztex, Tony, Liu

6 min readMar 14, 2024

Background & Problems

High cost: Provisioning WAN links at high capacity to protect against failures leads to unsustainable cost projections. (e.g only use 3–40%, waste 6–70%)
Inefficient resource utilization: Treating all applications equally results in suboptimal utilization of network capacity.
Lack of control: Limited ability to enforce relative application priorities and control bursts at the network edge.
Scalability issues: Difficulty in scaling among multiple dimensions while maintaining cost efficiency and control.

Google WAN Design

Utilizes Software Defined Networking (SDN) with OpenFlow for control.
Operates two distinct WANs: user-facing network (B2) and B4 network.
Supports thousands of applications categorized into different traffic classes.
Uses relatively simple switches built from merchant silicon and drives links to near full utilization.
Splits application flows among multiple paths to balance capacity against application priority/demands.

In short: Centralized-control + different traffic classes + demand,priority aware

Inter Data Center Transfer:

Three types: all first-party applications:

Replicate user data across different campuses (Low priority, significant volume traffic is elastic(flexible, tolerate delay))
Applications accessing data in other campuses
State synchronization between distributed applications

Takeaway:

Since all first-party applications, Google controls everything
Lower priority traffics are large volume and elastic
the lower volume, the higer priority

B4 Approach:

Elements of Design:

WAN router using merchant-silicon
Hierarchical SDN
Centralized Traffic Engineering

WAN Router:

Use two-stage Clos:

to build a 128x10G router
from 16x10G merchant silicon

Router runs an OpenFlow Agent:

agent expose single-switch abstraction
but must program individual switches

B4 Architecture:

SDN architecture consists of three layers: switch hardware layer, site controller layer, and network control servers.
Switch hardware layer primarily forwards traffic without complex control software.
Site controller layer includes Network Control Servers (NCS) hosting OpenFlow controllers and Network Control Applications (NCAs).
OpenFlow controllers maintain network state and instruct switches on forwarding table entries based on changing network conditions.
Per-site infrastructure ensures fault tolerance for individual servers and control processes.

B4 Topology Sites & Architecture:

Paxos elects one of multiple available software replicas (placed on different physical servers) as the primary instance.
The global layer = an SDN Gateway + TE server: enable the central control of the entire network via the site-level NCAs
Gateway abstracts details of OpenFlow and switch hardware from the central TE server: Gateway is like an assitant, arrange information for TE sersver, so that it can make decision
Each cluster contains a set of BGP routers that peer with B4 switches at each WAN site (iBGP)
BGP because of its isolation properties between domains and operator familiarity with the protocol. The SDN-based B4 then had to support existing distributed routing protocols, both for interoperability with our non-SDN WAN implementation, and to enable a gradual rollout, fallback.
To scale, the TE abstracts each site into a single node with a single edge of given capacity to each remote site
Quagga (Open-Source BGP software: talk BGP = Raven)
Orion Runs on on-site controllers

Hierarchical SDN:

Two levels of SDK-like control

local SDN control at each site
Global centralized traffic engineering

Site-level Control

Openflow controller (later Orion):

controls the 4 routers

Two apps:

for BGP/IS-IS (Quagga/RAP)
for traffc engineering (TE Agent)

Controller replicated on three servers:

Uses Paxos for leader election

The Role of BGP/IS-IS

Determine the shortest paths
Fallback
Sites: i) talk iBGP and IS-iS with other sites ii) eBGP with clusters

Integrating Decentralized Routing

- Challenge: B4 integrates OpenFlow-based switch control with existing routing protocols like Quagga for BGP/ISIS on NCS.
- Uses a Routing Application Proxy (RAP) to connect Quagga with OF switches for route updates and interface changes.
- RAP translates routing information from Quagga RIB to OpenFlow tables for ECMP hashing and next-hop identification.
- BGP and ISIS sessions run across B4 hardware ports, with RAP proxying routing-protocol packets between Quagga and hardware switches.
- TE is layered on top of routing protocols to prioritize switch forwarding table entries and provide fault recovery mechanisms.

Traffic Engineering (TE)

Input:

Network topology
Node-to-node demand

Output:

routing of demand over multiple paths
may not be shortest paths
routing satisfies global objective (e.g., maximize utilization)

Topology is site-to-site topology:

Topology abstraction: helps scale the TE algorithm
Each site is a node

Input demand: flow group(traffic abstraction)

traffic across all apps
between two sites

Output: tunnel groups

site-to-site paths
to split flow group on

How B4 TE Works:

The Network Topology graph represents sites as vertices and site to site connectivity as edges. The SDN Gateway consolidates topology events from multiple sites and individual switches to TE. TE aggregates trunks to compute site-site edges. This abstraction significantly reduces the size of the graph input to the TE Optimization Algorithm.
Flow Group (FG): For scalability, TE cannot operate at the granularity of individual applications. Therefore, we aggregate applications to a Flow Group defined as {source site, dest site, QoS} tuple.
A Tunnel (T) represents a site-level path in the network, e.g., a sequence of sites (A → B → C). B4 implements tunnels using IP in IP encapsulation.
A Tunnel Group (TG) maps FGs to a set of tunnels and corresponding weights. The weight specifies the fraction of FG traffic to be forwarded along each tunnel.

Bandwidth Enforcer: Summarize site-to-site traffic and report it to TE Optimization Algorithm (Flow Groups)

Input: Flow group

from Bandwidth Enforcer (BwE)
BwE talks to apps

Input: Site-to-site Topology

Gateway builds using site controller events

TE server runs TE algorithm

sends Tunnel groups to TE Database Manager
programs groups across sites

Accounting for Application Priority:

Lower priority applications:

can get less bandwidth than their demand
can be routed on longer paths

Helps achieve high utilization:

longer paths use less utilized links
on failure, smaller allocation than demand

Problem:

applications with different priorities
… in a flow group

Bandwidth Functions

Application assigned weight $w_i$:

bandwidth allocated to application
… proportional to weight

Modeled as a per-application bandwidth
function:

of fair share

Bandwidth function for Flow Group:

piecewise linear sum of per-application
functions
capped at total demand

TE algorithm:

Goal:

allocate as much of demand for each flow group as possible
without exceeding capacity
and each group’s traffic on shortest path possible
with each group getting fair share on each competing link

TE Algorithm: Tunnel group ordering

Two examples of TE Allocation with two FGs.

For each flow group

order tunnel groups by decreasing length

FG1(A → B)

A → B
A → C → B
A → D → C → B

FG2(A → C)

A → C
A → D → C

Shorter path first
Each step needs same fair share or saturated

The Evolution of B4:

Increasing Network Scale
Higher availability requirements†

Reference:

B4- Experience with a Globally-Deployed Software Defined WAN
解析Google基于SDN的B4网络