B4: Google’s Software-Defined WAN
Background & Problems
- High cost: Provisioning WAN links at high capacity to protect against failures leads to unsustainable cost projections. (e.g only use 3–40%, waste 6–70%)
- Inefficient resource utilization: Treating all applications equally results in suboptimal utilization of network capacity.
- Lack of control: Limited ability to enforce relative application priorities and control bursts at the network edge.
- Scalability issues: Difficulty in scaling among multiple dimensions while maintaining cost efficiency and control.
Google WAN Design
- Utilizes Software Defined Networking (SDN) with OpenFlow for control.
- Operates two distinct WANs: user-facing network (B2) and B4 network.
- Supports thousands of applications categorized into different traffic classes.
- Uses relatively simple switches built from merchant silicon and drives links to near full utilization.
- Splits application flows among multiple paths to balance capacity against application priority/demands.
In short: Centralized-control + different traffic classes + demand,priority aware
Inter Data Center Transfer:
Three types: all first-party applications:
- Replicate user data across different campuses (Low priority, significant volume traffic is elastic(flexible, tolerate delay))
- Applications accessing data in other campuses
- State synchronization between distributed applications
Takeaway:
- Since all first-party applications, Google controls everything
- Lower priority traffics are large volume and elastic
- the lower volume, the higer priority
B4 Approach:
Elements of Design:
- WAN router using merchant-silicon
- Hierarchical SDN
- Centralized Traffic Engineering
WAN Router:
Use two-stage Clos:
- to build a 128x10G router
- from 16x10G merchant silicon
Router runs an OpenFlow Agent:
- agent expose single-switch abstraction
- but must program individual switches
B4 Architecture:
- SDN architecture consists of three layers: switch hardware layer, site controller layer, and network control servers.
- Switch hardware layer primarily forwards traffic without complex control software.
- Site controller layer includes Network Control Servers (NCS) hosting OpenFlow controllers and Network Control Applications (NCAs).
- OpenFlow controllers maintain network state and instruct switches on forwarding table entries based on changing network conditions.
- Per-site infrastructure ensures fault tolerance for individual servers and control processes.
B4 Topology Sites & Architecture:
- Paxos elects one of multiple available software replicas (placed on different physical servers) as the primary instance.
- The global layer = an SDN Gateway + TE server: enable the central control of the entire network via the site-level NCAs
- Gateway abstracts details of OpenFlow and switch hardware from the central TE server: Gateway is like an assitant, arrange information for TE sersver, so that it can make decision
- Each cluster contains a set of BGP routers that peer with B4 switches at each WAN site (iBGP)
- BGP because of its isolation properties between domains and operator familiarity with the protocol. The SDN-based B4 then had to support existing distributed routing protocols, both for interoperability with our non-SDN WAN implementation, and to enable a gradual rollout, fallback.
- To scale, the TE abstracts each site into a single node with a single edge of given capacity to each remote site
- Quagga (Open-Source BGP software: talk BGP = Raven)
- Orion Runs on on-site controllers
Hierarchical SDN:
Two levels of SDK-like control
- local SDN control at each site
- Global centralized traffic engineering
Site-level Control
Openflow controller (later Orion):
- controls the 4 routers
Two apps:
- for BGP/IS-IS (Quagga/RAP)
- for traffc engineering (TE Agent)
Controller replicated on three servers:
- Uses Paxos for leader election
The Role of BGP/IS-IS
- Determine the shortest paths
- Fallback
- Sites: i) talk iBGP and IS-iS with other sites ii) eBGP with clusters
Integrating Decentralized Routing
- Challenge: B4 integrates OpenFlow-based switch control with existing routing protocols like Quagga for BGP/ISIS on NCS.
- Uses a Routing Application Proxy (RAP) to connect Quagga with OF switches for route updates and interface changes.
- RAP translates routing information from Quagga RIB to OpenFlow tables for ECMP hashing and next-hop identification.
- BGP and ISIS sessions run across B4 hardware ports, with RAP proxying routing-protocol packets between Quagga and hardware switches.
- TE is layered on top of routing protocols to prioritize switch forwarding table entries and provide fault recovery mechanisms.
Traffic Engineering (TE)
Input:
- Network topology
- Node-to-node demand
Output:
- routing of demand over multiple paths
- may not be shortest paths
- routing satisfies global objective (e.g., maximize utilization)
Topology is site-to-site topology:
- Topology abstraction: helps scale the TE algorithm
- Each site is a node
Input demand: flow group(traffic abstraction)
- traffic across all apps
- between two sites
Output: tunnel groups
- site-to-site paths
- to split flow group on
How B4 TE Works:
- The Network Topology graph represents sites as vertices and site to site connectivity as edges. The SDN Gateway consolidates topology events from multiple sites and individual switches to TE. TE aggregates trunks to compute site-site edges. This abstraction significantly reduces the size of the graph input to the TE Optimization Algorithm.
- Flow Group (FG): For scalability, TE cannot operate at the granularity of individual applications. Therefore, we aggregate applications to a Flow Group defined as {source site, dest site, QoS} tuple.
- A Tunnel (T) represents a site-level path in the network, e.g., a sequence of sites (A → B → C). B4 implements tunnels using IP in IP encapsulation.
- A Tunnel Group (TG) maps FGs to a set of tunnels and corresponding weights. The weight specifies the fraction of FG traffic to be forwarded along each tunnel.
Bandwidth Enforcer: Summarize site-to-site traffic and report it to TE Optimization Algorithm (Flow Groups)
Input: Flow group
- from Bandwidth Enforcer (BwE)
- BwE talks to apps
Input: Site-to-site Topology
- Gateway builds using site controller events
TE server runs TE algorithm
- sends Tunnel groups to TE Database Manager
- programs groups across sites
Accounting for Application Priority:
Lower priority applications:
- can get less bandwidth than their demand
- can be routed on longer paths
Helps achieve high utilization:
- longer paths use less utilized links
- on failure, smaller allocation than demand
Problem:
- applications with different priorities
- … in a flow group
Bandwidth Functions
Application assigned weight $w_i$:
- bandwidth allocated to application
- … proportional to weight
Modeled as a per-application bandwidth
function:
- of fair share
Bandwidth function for Flow Group:
- piecewise linear sum of per-application
functions - capped at total demand
TE algorithm:
Goal:
- allocate as much of demand for each flow group as possible
- without exceeding capacity
- and each group’s traffic on shortest path possible
- with each group getting fair share on each competing link
TE Algorithm: Tunnel group ordering
For each flow group
- order tunnel groups by decreasing length
FG1(A → B)
- A → B
- A → C → B
- A → D → C → B
FG2(A → C)
- A → C
- A → D → C
- Shorter path first
- Each step needs same fair share or saturated
The Evolution of B4:
- Increasing Network Scale
- Higher availability requirements†
Reference:
- B4- Experience with a Globally-Deployed Software Defined WAN
- 解析Google基于SDN的B4网络