Researchers from MIT and Facebook developed a new network component that arbitrates traffic flow and timing, eliminating next-hop queue latency in data-center networks.
Content providers and website hosts are looking to cut latency and increase traffic throughput. One way has been to place content and web servers in data centers near their consumers, which is one reason data centers are cropping up all over the place.
However, with the amount of content and the number of users burgeoning, moving content closer to the edge can only help so much. That is why, when a team of researchers from MIT's Computer Science and Artificial Intelligence Laboratory and Facebook discovered a way to decrease latency within the data center — another significant congestion point — content providers and data-center managers took notice.
The ideal data center infrastructure
In the team's research paper, Fastpass: A Centralized "Zero-Queue" Data center Network, they start by defining what would be an ideal network infrastructure within the data center. They determined a data center needs:
- High utilization (throughput) capability
- Low "end to end" latency
- Support for the network operator's objectives
Current data centers addressed the above needs, but not as effectively as the research team deemed necessary. The ineffectiveness, according to the team, resulted from using decentralized communication protocols. The authors of the paper explained: "Current data-center networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers."
Put simply, traffic as it moves through the data center or across the internet is forwarded from one device to the next (packet switching) until the traffic reaches its destination. Potentially, additive delays occur at each hop when incoming traffic overwhelms device queues. The researchers feel the queue delays are the problem, "We propose that each sender should delegate control — to a centralized arbiter — of when each packet should be transmitted and what path it should follow."
Fastpass: Traffic arbiter
This past August, the research team presented their new network-management system at ACM's annual conference on data communications. The main takeaway from the presentation: the team was able to reduce the average router queue latency in a Facebook data center by 99.6%. The team reported when traffic at the Facebook data center was at its heaviest, the delay from when traffic was requested to when it arrived dropped from 3.56 milliseconds to 0.23 milliseconds.
The report described how "Fastpass," the team's replacement network protocol, obtained the rather remarkable reduction in latency:
We propose that each packet's timing be controlled by a logically centralized arbiter, which also determines the packet's path (illustrated in Figure A). If this idea works, then flow rates can match available network capacity over the time-scale of individual packet transmission times, unlike over multiple round-trip times (RTTs) with distributed congestion control. Not only will persistent congestion be eliminated, but packet latencies will not rise and fall, queues will never vary in size, tail latencies will remain small, and packets will never be dropped due to buffer overflow.
An uncommon approach
In networking 101, students learned that employing a centralized network controller was a bad idea — it became a bottleneck and a single point of failure. However, the researchers streamlined the process to where bottlenecks did not occur, and the decrease in latency more than covered the round trip between the sending device and the Fastpass arbiter.
Figure B depicts the reduction in latency. The report mentioned, "Our results show that compared to the baseline network, the throughput penalty (round trip to arbiter) is small but queuing delays are reduced dramatically, flows shared resources more fairly and converged quickly, and the software arbiter implementation scaled to multiple cores and handled an aggregate data rate of 2.21 Terabits/s."
How Fastpass works
The Fastpass arbiter controls all network traffic. When an application on a server wants to send traffic, it sends a request to Fastpass specifying destination and the size of the packet. Fastpass then processes the request in two steps:
- Timeslot allocation: The arbiter assigns the sender a set of timeslots in which to transmit the data. Fastpass keeps track of the source-destination pairs assigned each timeslot.
- Path selection: The arbiter also chooses a path through the network for each packet and communicates this information to the requesting source.
With regards to having a single failure point, the research team included a primary Fastpass arbiter and several secondary arbiters. Primary and secondary arbiters receive every request. The secondary arbiters drop the requests unless the watchdog signal from the primary arbiter is missing. When that happens, one of the secondary arbiters assumes the primary role.
The new networking system developed by the research team has received much interest. In this MIT press release, George Varghese, principal researcher and partner at Microsoft mentioned, "It's sufficiently surprising that it (Fastpass) even scales to a pretty large network of 1,000 switches."
Varghese continued, "Nobody probably would have expected that you could have this complete control over both when you send and where you send. And they're scaling by adding multiple cores, which is promising."
For peer review purposes, the research team posted the Fastpass working code on this website.