By 2011, device reputation and fraud protection firm iovation had a legacy Oracle database that was beginning to show its limits. With that monolithic system it faced steep licensing and service costs, no effective scaling options, and monthly maintenance that required downtime.
The leadership at iovation decided to upgrade to an open-source stack including Apache Cassandra, which they chose for its linear scaling capability and fault tolerance. When they completed the project in 2013, iovation had a redundant, distributed system with three data centers and a set of new features, in addition to device reputation management, faster processing times, and a 600% increase in traffic volume.
Oracle: prohibitive cost and maintenance downtime
“When I got here in 2008,” said iovation CTO Scott Waddell, “iovation had grown up on a pretty monolithic Oracle infrastructure, where most of the business logic lived in PL/SQL code, in the Oracle database.”
“Every time that we had something to update, enhance, or fix with Oracle, we had to take the whole service down for an hour because the code basically is the data, and we had to do things to Oracle, including maintenance, that required the service to be fully off-line,” explained Waddell.
The downtime maintenance took place on a monthly basis, which according to Waddell, “our customers didn’t exactly love.” Considering a new database infrastructure, Waddell added that “I did not want to have to take the service down in order to do surgical enhancements to individual components.”
“As our business grew, we began to look at what it was going to cost to scale our infrastructure,” said Waddell, “with an eye on what the cost model would look like over time, how we were going to make this more robust and reliable, and keep up with traffic.”
“We have really made a concerted effort over the last several years to completely move away from Oracle to an open-source stack,” added Waddell. And in going with open-source Apache Cassandra, Waddell explained that “we were looking at, literally, a difference of seven or eight times in cost, compared to growing on Oracle technologies.”
Seeking scalability, availability, flexibility, and less cost
For the shift to open source, Waddell said the “business drivers were scalability, availability, agility, and flexibility on the development side. And in the long run — cost. I need cost to be able to scale roughly with service growth, and I need to avoid this vendor lock-in situation (with Oracle) that makes me susceptible to predatory pricing.”
Waddell told me iovation’s device reputation solution takes encrypted customer data and feeds it “into a highly available, horizontally scaling data store, which was ultimately chosen to be Apache Cassandra. And then we front that with relational databases where needed, to build data marts that are application-specific.”
He noted that Cassandra provides “a lot of the agility, flexibility, and development to be able to go in and do surgical things to this service-oriented architecture, instead of having the Oracle monolith to battle that we had in the past.”
Choosing Cassandra: linear scaling and fault tolerance
Waddell and his IT team chose Apache Cassandra for two key reasons.
“First, it is well known to support linear scaling,” said Waddell. “So it’s a clustered, service system, where the data, rather than living on a large storage-array network, is now stored across clustered nodes. This means individual servers now have the data.”
“Secondly, that data is replicated across nodes in a fault-tolerant way,” added Waddell. “If individual nodes drop out across the cluster, you still have that data across other nodes, and it is very tunable, and configurable, to fit the designs of the specific service that you are building. So there’s a lot of flexibility.”
From fault tolerance to system redundancy
“That fault tolerance allowed us to go to a multi-data center active/active architecture,” explained Waddell. “We currently have three data centers, two in Portland, and one up in Seattle, that are connected via fast fiber and replicate the data three ways across the data centers.”
“And normal flow has transactions 50-50 balanced between the primary data center and Portland and the backup center in Seattle,” added Waddell. “But if we are doing software updates or maintenance, we can take an entire data center off-line seamlessly by asking our customers to follow our short TPLs on DNS. And there is no transaction loss, no hiccup in performance, and we still maintain our redundancy because we have that other replication facility here in Portland.”
“And so this is a really, really nice system that has maximized our ability to be nimble on the development side, and get things out quickly,” said Waddell, “while minimizing any kind of impact to the real-time, 24/7 flow for subscribers. That was the driving force for going with a capability like Cassandra.”
New features, 60% faster processing time, and 600% volume growth
“Back in 2009, when we were on the Oracle stack,” explained Waddell, “we were doing a lot less work, because mainly at that time we were a reputation service. We had an understanding of the relationship between accounts and devices, and we were basically taking a look at whether or not those had been tagged with evidence of past fraud or abuse.”
“Since then, we have added tons of new features,” said Waddell. “There are seven different classes of rules that are applied. there are velocities, geolocation details, anomaly rules, all kinds of things that look for evasion detection — whether the end user is trying to prevent detection and obfuscate their location — and so on.”
“In 2009, the transaction processing time for all of that was in the neighborhood of 220-250 milliseconds (ms),” added Waddell. “And that’s not terrible, but it’s at a pace where at scale, especially if you are a large subscriber, that quarter of the second, along with all the other things that you and the customer are trying to do in an interaction, all add up.”
Waddell said that with Cassandra they have brought their average processing time down to 100 ms, competitive by their industry standards, “while we have simultaneously added tons of additional capabilities over the past several years, along with our new risk, reputation, trust scoring, as well as — if you look at our 2009 traffic volume — growing more than six times in traffic volume between then and now.”