Review: Jupiter Evolving: Transforming Google’s Datacenter
Network via Optical Circuit Switches and Software-Defined Networking
Abstract
1. Paper summary: What problem does the paper address (1–2 sentences or bullets)? What are the core novel ideas or technical contributions of the work (1–2 sentences or bullets)? What is the paper’s approach, what specific techniques/mechanisms does it use, and what are its main findings? (3–5 sentences)
The paper addresses some problems in the conventional Clos Topology:
a. The spike blocks become the bottleneck as we introduce faster aggregation blocks. (Eg. Introduce a 200G aggregation block into a 100G cluster) Spines are the bottleneck for heterogeneous fabric
b. Clos Topology works well when traffic is unstable, but observation shows that the traffic, in the long-timescale, is predicable (follow the gravity model)
The core novel ideas or technical contributions of the work:
a. It uses OCS (optical circuit switch) to eliminate spine blocks, so heterogeneous fabric is possible
b. It uses predictable traffic patterns, cope with SDN, to conduct traffic engineering
paper’s approach, specific techniques/mechanisms it uses, and its main findings:
a. The practical use of OCS
b. Observe that traffic in their data center, in the long timescale, is predictable, and this fact can be used to optimize the topology and routing
2. Strengths: 2–4 bulleted points (Explain in more detail in the detailed comments)
a. OCS enables direct connect architecture and heterogeneous fabric, which brings higher throughput and lower upgrade efforts
b. Unlike crossbar topology, OCS can achieve N-to-N connectivity without blocking
c. OCS brings lower energy consumption as it simply adjusts the mirror’s angle to determine the direction
d. A “Variable hedging” strategy is proposed, which mitigates the burst problem in load-balance
3. Weaknesses: 2–4 bulleted points (Explain in more detail in the detailed comments)
a. OCS is not affordable and available for every company. Without it, a direct connection and non-blocking architecture is hard to achieve.
b. Traffic engineering and topology engineering relied on traffic to be predictable. However, just because the traffic in Goolge’s data center follows the gravity model, it doesn’t imply that it is applicable to all other companies.
c. System complexity: The combination of direct-connect topology and traffic engineering increased the system complexity as we dynamically adjust the topology via SDN
4. Detailed comments on the paper: Elaborate in a few sentences on each of the strengths and weaknesses.
This paper proves the possibility of transforming the networks in the data center from uniform Clos topology to directly connect topology with heterogeneous fabric. Moreover, it demonstrates how we can utilize the gravity model to predict the traffic and adapt the topology accordingly. In other words, dynamic topology reconfiguration is achieved.
While the contributions above bring a 30% increment in throughput and a 40% reduction in power consumption, there are still some limitations.
Firstly, OCS, as a very fundamental of the paper, is not for every company. Without it, the proposed methodology is impossible. Secondly, traffic engineering and topology engineering are only useful when the traffic is certain for some scenarios. However, we can’t guarantee the model Google used to predict the traffic is applicable to other data centers. Finally, while the combination of direct-connect topology and traffic engineering brings better performance and lower power consumption, it also increases the system complexity.
In conclusion, this paper has many merits. The use of OCS opens the gate to heterogeneous fabric and enables traffic and topology engineering, which bring better performance, lower power consumption, and mitigate the burst problem in load balance. On the other hand, OCS and such the high system complexity are only for some huge companies.