How will the first generation of enterprise-ready ARM servers stack up against traditional datacentre boxes?
As the first enterprise-ready, ARM-based servers get nearer to release more details are emerging on what these energy-sipping systems will be capable of.
The upcoming 64-bit machines are being designed to tackle a far broader range of tasks than the few 32-bit ARM-based servers tested out by a handful of companies this year.
Rather than just web serving, these systems are being built to also power data analytics on Hadoop clusters, fetch and put data in NoSQL data stores, streaming media and high-performance computing, sharing processing duties with GPUs, FPGAs or ASICs.
Jobs like these can be split into computationally light workloads and processed in parallel by clusters of thousands of wimpy core processors. These dense clusters of low-power servers can handle these parallelisable tasks more efficiently than smaller number of powerful chips, delivering better performance per watt and per square foot of datacentre space, important measures for driving down the cost of running a large server estate.
Hence the interest in taking small, energy thrifty ARM-based chipsets, today more commonly found in mobile phones and tablets, and using them in tightly, packed server clusters.
A fair proportion of the software needed to handle these web serving, data analytics, streaming media and other jobs are on track to be ready for production use on ARM-based servers. But what about the hardware?
Powering these servers will be chipsets from a range of companies - but major players in the nascent ARM-based server space will be likely be Applied Micro with its X-Gene boards and AMD, which is branching out beyond x86 with its Opteron A1100 processor.
These forthcoming chips are based on the ARM v8 architecture, which introduces support for features considered critical by business. Not only is v8 the first ARM architecture to support 64-bit cores, it also brings additional enterprise-class features, such as error-correcting code (ECC) memory.
The companies behind these server chipsets were at the Hot Chips conference in Cupertino this week to detail the capabilities of their chips and the servers they will power.
Applied Micro X-Gene
When is it out?
Three generations of X-Gene system on a chips are planned. The first to hit the market in servers will be the X-Gene1 processor, is expected to be available in production systems this Autumn. The X-Gene processor is already being tested in HP Moonshot servers, has been demoed in HPC and enterprise-targeted systems from Eurotech, E4 and Mitac.
Its successor, the X-Gene 2, is available for sampling now and X-Gene 3 is due to be released for sampling in 2015.
The X-Gene 1 has eight cores running at 2.4 GHz. It is made to a 40nm process - the smaller the process the more transistors can be crammed onto the chips' surface, allowing for better processing power per watt. The chip's superscalar architecture allows it to handle more than one instruction per processor cycle, with a four-instruction wide processing pipeline that is capable of out-of-order execution, an optimisation that reduces delays in handling instructions. Applied Micro say the chip can handle "more than 100 instructions in flight".
Each pair of processor cores shares L1 instruction and data cache, as well as L2 cache. Connected to the cores via a network link that keeps data coherent between caches is 8MB of L3 cache and two dual-channel DDR3 memory controllers. The chipset can support up to 128GB of DDR memory capable of 1,600 MT/s.
The chipset integrates networking hardware, removing the need for discrete cards, such as I/O controller hub, NIC and baseboard management controller - reducing additional cost and power draw.
For I/O the chipset supports four 10 gigabit Ethernet connections and six PCI-E 3.0 slot, as well as multiple Sata 3 ports.
Future releases of the X-Gene will bring further performance improvements and allow servers based on the board to tackle workloads where low application latency is necessary. The X-Gene 2 will add RDMA over Converged Ethernet, or RoCE. RoCE is important feature in distributed systems as it reduces latency between servers in the cluster. This feature allows one server node in an X-Gene cluster to transfer data directly to and from memory of another node over 10 Gbps Ethernet, reducing the work carried out by each node's CPU and improving data transfer speed. Using Roce the X-Gene 2 has shown itself capable of reducing application latency to about 5 microseconds, up to ten times faster than the X-Gene 1, according to Applied Micro.
X-Gene 2 will be made to a 28nm process, have up to 16 cores clocked at a maximum of 2.8 GHz and support four channels of memory. Architectural changes will be made to the processor core to boost performance.
What is important for the types of workloads suited to being handled in parallel on a cluster of low-energy servers - the likes of web front ends, search engines, NoSQL data stores, data analytics work like Hadoop, and media serving - are factors beyond clock speed. Applied Micro believes the X-Gene delivers on core metrics for these workloads, such as instruction issue width, the number of tiers in the processor cache hierarchy, the size of the cache per CPU and the memory bandwidth of the processor.
The graph shows how the X-Gene 2 beats compares to competitors on these measures - from left to right is the ThunderX Arm SoC from Cavium, Intel's microserver-targeted eight-core C2000 Atom processor and, in green, the X-Gene 2. On the far right is the Intel Xeon E5-2600 v2 processor, which while higher performing costs more.
In the SPEC2006_rate processor benchmarks the X-Gene 2 delivers 55 percent better performance per watt than the X-Gene 1 and a 25 percent performance boost in ApacheBench web serving score.
Compared to Intel servers the X-Gene will be competing against, Applied Micro claims the first generation chipset can deliver the performance of an Ivy Bridge or Haswell Xeon, while the X-Gene 2 will offer greater performance at lower power and be suited to latency-sensitive clustered applications.
Applied Micro says a rack of X-Gene 2 systems will burn about 30 kilowatts and pack 6,480 threads running at 2.8 GHz. The cluster will provide 50 TB of memory and 48 TBps of memory bandwidth. It will handle 750 million transactions per second on the memcached test with 95 percent of the transactions coming in at under 40 milliseconds. A cluster of 80 two-socket machines based on Intel's Xeon E5-2630 v2 processors, with six cores and twelve threads per socket, delivers 1,920 threads and deliver around 400 million transactions per second on the same memcached test in the same power envelope of around 30 KW. These benchmarks are provided by Applied Micro, however, so need to be treated with the appropriate level of skepticism until verified.
Intel said Applied Micro's performance estimates are impossible to verify as "no-one has ever seen X-Gene 1-based system benchmarked using industry standard applications" and indicated the Xeon setup used in the comparison could be weighted in the X-Gene's favour.
Intel has its own range of energy sipping, less powerful SoCs aimed at the server market, the Avoton series in its Intel Atom family, and for its part Intel claims these are more power efficient.
"X-Gene 1 is based on 40nm process and has 8 cores and roughly 35 - 40W TDP [which reflects the maximum power consumption of the machine]. For comparison, Atom C2000 (Avoton) has 8 cores as well with 20W TDP," said an Intel spokeswoman.
"X-Gene is expected to have 35 -40 W TDP for 8 cores, node power 59W, vs 8-cores, 20W Avoton and 28-35W node power. Best case scenario for them - same performance for twice as power."
By the time the X-Gene 2 hits productions servers Intel is also likely to have refreshed its server chip line-up with its Broadwell-EP and Broadwell-EX Xeon chips - further improving its performance per watt.
X-Gene 3 will increase the core count to a maximum of 64, increase the clock speed to 3GHz and introduce 2nd generation RoCE. It will move the X-Gene to a 16nm manufacturing process, with FinFET transistors.
What can you use them for?
Applied Micro say the X-Gene family will be able to be used for "pretty much anything that runs in the datacentre today".
That includes hosting large-scale web sites and services; web search services such as data serving and harvesting; NoSQL data storage and retrieval; data analytics services such as information classification and filtering and extraction; and hosting and streaming of media.
The X-Gene 2 will be suited to a wider range of cloud and HPC applications than its predecessor, due to its low-latency, inter-server data transfer enabled by Roce.
The X-Gene one has already been demoed tackling HPC and other datacentre workloads when paired with Nvidia Tesla GPU K20 accelerators. The X-Gene/ Nvidia Tesla accelerator pairing is being used in servers from Cirrascale, E4 and Eurotech. Each server is designed to specialise in different workloads, the Cirrascale on HPC and enterprise workloads, while the E4 is focused on seismic, signal and image processing, as well as running jobs against big data sets using map-reduce.
AMD "Seattle" Opteron 1100
When is it out?
Due to ship in volume by the fourth quarter of 2014
System on a chip based around eight ARM Cortex A57 processor cores, clocked at above 2GHz. Each pair of processor cores share 48KB of L1 instruction and 32KB of L2 data cache, as well as 1MB of L2 cache - providing up to 4MB of L2 cache for the entire chip. A total 8MB of unified L3 cache is shared between the cores.
Support for up to 128GB of DDR3 or DDR4 ECC memory as unbuffered DIMMs, registered DIMMS or SODIMMs.
The chipset uses ARM's System Memory Management Unit that allows various hypervisors to keep guest operating systems in separate pools of RAM.
The SoC, which is made using a 28nm process, also includes support for a wide range of data I/O, including an eight-lane PCI Express 3 controller, two 10 GB/s Ethernet connections and eight SATA 3 ports. It also has a dedicated 1GbE system management port (RGMII).
A system control processor, an ARM Cortex A5-based chip, is used to control power, configure the system, initiate booting, and act as a service processor for system management functions.
A cryptographic co-processor acts as a dedicated accelerator for encryption and decryption, as well as compression and decompression, algorithms. Accelerated algorithms are Advanced Encryption Standard, Elliptic Curve Cryptography, RSA, Secure Hash Algorithm, Zlib compression, Zlib decompression and True Hardware Random Number Generator.
AMD is also working on a pin-compatible version of ARM and x86 chips - allowing them to plug into the same socket and be swapped out as needed.
Based on comments from AMD, the technology site AnandTech has also estimated the eight-core variant could achieve a score of 80 in the SPECint_rate benchmark, a total of 10 per core.
Power consumption is unconfirmed but Anandtech estimates a TDP of 25W.
What can you use them for?
AMD expects the Opteron A1100 to be suited to handling workloads whose compute demands are light and where data needs to be rapidly shuttled on and off the processor.
"For such workloads, processors like 'Seattle,' with smaller cores and caches, can deliver the equivalent performance as traditional server processors with large cores and caches, but using much less power and area," AMD said in a presentation at the Hot Chips conference.
Possible uses could include LAMP stack web servers, as well as memcached and cold storage servers. Facebook has already experimented with using an ARM-based system as the basis of an OCP Open Vault storage array.
Sean White, an engineer at AMD was also quoted at the Hot Chip conference in Cupertino as saying the company would consider customising the processor to meet specific industry needs. Intel has also recently expanded the options for large customers who want custom silicon.
What other Arm server boards are coming out?
This year several other ARM-based system-on-a-chip (SoC) processors are planned to launch, designed to carry out a range of datacentre tasks - from handling server workloads, to running storage arrays and virtualised network functions.
To meet these needs, ARM-based SoCs are in the works from various companies, including Broadcom, Cavium and Texas Instruments.