Facebook data centers receive billions of user requests each day, and that number is increasing as the company adds members and introduces new features. All good for Facebook in general, but a challenge for Facebook’s networking crew. For instance, a data center topology that was sufficient five months ago is now inadequate.

So besides building mammoth data centers like the one in Altoona, Iowa, Facebook engineers are constantly optimizing the design of a data center’s network. That said, tweaking might not be the right term for what the engineers came up with and implemented in the Altoona facility. It’s more like they rewrote the networking design guide.

The old Facebook network

Before Altoona, Facebook engineers organized a data-center’s server racks into clusters similar to that shown in Figure A. In real life, instead of three racks, there would be hundreds. Also shown in the schematic is each rack’s Top of Rack (TOR) switch that acts as an intermediary between the servers and an upstream aggregation switch.

Figure A

This arrangement works, but presents Facebook engineers with several challenges. “First, the size of a cluster is limited by the port density of the cluster switch. To build the biggest clusters we needed the biggest networking devices, and those devices are available only from a limited set of vendors. Additionally, the need for so many ports in a box is orthogonal to the desire to provide the highest bandwidth infrastructure possible,” explains Alexey Andreyev, Facebook network engineer. “Even more difficult is maintaining an optimal long-term balance between cluster size, rack bandwidth, and bandwidth out of the cluster.”

Fabric: the new network topology

With those billions of requests each day as incentive, the engineers decided to eliminate the complicated, bandwidth-robbing, top-down network hierarchy and replace it with a new design called Fabric. The slide in Figure B depicts the new server-rack grouping called a pod. A single pod consists of 48 racks and TOR switches that are meshed to four fabric switches. Andreyev mentions, “Each TOR currently has 4 x 40G uplinks, providing 160G total bandwidth capacity for a rack of 10G-connected servers.”

Figure B

This approach has the following advantages:

  • Ease of deployment of a 48-node pod
  • Scalability is simplified and unlimited
  • Each pod is identical with equal connectivity

The next step is to connect all the fabric switches — the slide in Figure C depicts how that is accomplished. Andreyev says this is simpler (it is hard to imagine what it used to be like).

Figure C

Facebook engineers stayed with the 48-node theme when adding the spine switches. Andreyev explains, “To implement building-wide connectivity, we created four independent ‘planes’ of spine switches, each scalable up to 48 independent devices within a plane. Each fabric switch of each pod connects to each spine switch within its local plane.”

What Andreyev mentions next is mind-boggling, “Together, pods and planes form a modular network topology capable of accommodating hundreds of thousands of 10G-connected servers, scaling to multi-petabit bisection bandwidth, and covering our data-center buildings with non-oversubscribed rack-to-rack performance.”

Network operations

The Fabric network design standardizes on “layer 3” from the TOR switches to the network’s edge, supports IPv4 and IPv6, and uses Equal-Cost Multi-Path (ECMP) routing. “To prevent occasional ‘elephant flows’ from taking over and degrading an end-to-end path, we’ve made the network multi-speed — with 40G links between all switches, while connecting the servers on 10G ports on the TORs,” adds Andreyev. “We also have server-side means to ‘hash away’ and route around trouble spots if they occur.”

Physical layout

Andreyev writes the new building layout shown in Figure D is not that much different from earlier Facebook designs. One difference is locating Fabric’s new spine and edge switches on the first-level between data hall X and data hall Y and moving network connections to the outside world (MPOE) above the spine and edge switch area.

Figure D

Overcame the challenges

Facebook engineers appear to have surmounted their challenges. Hardware limitations are no longer an issue. The number of different components is reduced as is complexity. Andreyev says the team embraced the “KISS (Keep It Simple, Stupid) Principle,” adding in the paper’s conclusion, “Our new fabric was not an exception to this approach. Despite the large scale and complex-looking topology, it is a very modular system, with lots of repetitive elements. It’s easy to automate and deploy, and it’s simpler to operate than a smaller collection of customized clusters.”