ScyllaDB, an open source NoSQL column store database directly compatible with existing Apache Cassandra deployments, was unveiled on September 22, 2015. Rather than relying on the Java Virtual Machine, as Cassandra does, Scylla is written using C++14 and GCC 5.1, which when used in conjunction with other design decisions for low-level optimization, provide a tenfold performance increase in throughput and latency compared to the Java-powered Cassandra.
Why is Cassandra inefficient?
There are a lot of reasons inefficiencies exist in Cassandra — though, Cassandra is not particularly bad, relative to the intent of the program. The strength of Cassandra compared to other database solutions is in workload scaling and distribution.
In 2012, researchers at the University of Toronto found in a comparison of enterprise database systems (PDF) that "Cassandra achieves the highest throughput for the maximum number of nodes in all experiments... at the price of a high write and read latencies." These latencies can be attributed to, in part, the choice of Java, the inherent complexities of the Java Virtual Machine, and the practical necessities of memory allocation tricks to work against the design of the garbage collector create barriers between the workload and the power of the hardware it runs on.
How does Scylla improve performance?
Generally speaking, Scylla operates at a lower level than Cassandra. It is written in C++14, and is built using GCC 5.1 — by foregoing the Java Virtual Machine, it necessarily is closer to the hardware. There are important changes to the way task management and memory allocation is handled as well.
Scylla uses the open source Seastar framework (also written by the authors of Scylla), which uses a share nothing model — it runs one application, per thread, per core. The transfer of requests or information between cores requires explicit message passing, not a shared memory core.
The design of Seastar prevents CPU locks, as explained by the developers:
Seastar uses futures, promises, and continuations (f/p/c). Where conventional event-driven programming using epoll and userspace libraries such as libevent has made it very difficult to write complex applications, f/p/c makes it easier to write complex asynchronous code.
For example, the following interaction between a sender core, C0, and a receiver core, C1, can take place with no locking required.
- C0: sender -> wait for queue entry (usually immediate) -> enqueue request, allocate promise.
- C1: dequeue request; execute it -> move result to request object -> enqueue request on response queue
- C0: dequeue request; extract response, use it to fulfill promise; destroy request
Each actual queue, one for requests and a return queue for fulfilled requests, is a simple queue of pointers.
There is one request queue and one return queue per pair of CPU cores on the system. Because a core does not pair with itself, a 16-core system will have 240 request queues and 240 return queues.
The hardware configuration is also of particular consequence — the NIC is addressed directly, without the use of the bond interface in the kernel, and rx and tx requests on the NIC were each bound to a separate CPU, increasing throughput, and therefore performance. The configuration of the benchmark system provides more detail about the benchmarking server, which is fairly typical of a smaller Cassandra deployment.
In the benchmark tests, the Scylla team found that the current beta version of the software is capable of one million operations per second, which in a mix of read and write operations, is about 10x faster than Cassandra 2.1.9.
When will Scylla be available, and how do I migrate?
Scylla is not yet available, though the authors are hoping to reach general availability in January 2016. The big advantage of Scylla is that it can be used as a drop-in replacement for Cassandra, with full compatibility — removing any need for changes in database format or server configuration.
In an interview with ZDNet, the CEO of ScyllaDB noted that Apple has 75,000 Cassandra servers in use — something mentioned at Cassandra Summit 2014 — only a fraction of which would be needed with Scylla.
The authors of Scylla are perhaps best known in the world of open source for the KVM hypervisor, which requires an intimate knowledge of the intricacies of Intel processors. This knowledge is instrumental in the transition between a managed language such as Java, to a language such as C++, with which closer access to the bare metal — and therefore higher performance — can be achieved.
With this in mind, do you plan on migrating your current Cassandra installation to Scylla? How large of an impact will the performance increase have on your IT budget? Share your thoughts in the comments.
- KVM creators open-source fast Cassandra drop-in replacement Scylla (ZDNet)
- Microsoft and DataStax tie up Cassandra on Azure deal as new Titan graph database rolls out (ZDNet)
- Apple's secret NoSQL sauce includes a hefty dose of Cassandra
- NoSQL databases are on a roll
Note: TechRepublic and ZDNet are CBS Interactive sites.
James Sanders is a Java programmer specializing in software as a service and thin client design, and virtualizing legacy programs for modern hardware.