SAS is a data processing and statistical analysis language that was invented by Anthony J. Barr in the 1960s. It later became the mainstay product of the SAS Institute, which was incorporated in 1976 by Barr, Jim Goodnight, John Sall, and Jane T. Helwig.
Today, SAS's analytical strengths and its vast bastion of enterprise programmers trained in the language make it a natural choice for big data analytics queries. SAS allows healthcare providers to perform advanced analytics that probe care utilization patterns and quality-of-care issues and to share that information with their network providers; SAS enables bankers to analyze their worldwide systems for liquidity gaps; and SAS gives retailers a way to study the consumer web shopping experience. SAS is also a procedural programming language that runs on IBM mainframes, UNIX, Linux, OpenVMS Alpha, and Microsoft Windows.
I recently interviewed David Smith, a data scientist with Revolution Analytics, which supports both the open source R community and the needs of commercial users of R with consulting, training, and technical support. He posited, "The question is whether all of the changes in the big data evolution over the past five years will require different approaches to programming."
"SAS is one of the legacy systems from the 1970s with an enormous user base, so it is a major big data 'incumbent,'" said Smith. "SAS is widely used, but the analytics it delivers originated in a different era that pre-dated parallel processing, server clusters, and Hadoop. Consequently, SAS is not suited for many modern and emerging big data requirements."
Smith says the steady movement away from legacy relational databases that languages like SAS operate on to parallel processing with Hadoop is why companies need to take a fresh look at how they are going to develop analytics programs that can process and produce results from big data. Because R is a non-procedural language developed expressly for big data that is being parallel processed, he believes that more companies will adopt R as part of their big data strategy.
"Much like Linux, R has had a rather slow but steady evolution," said Smith. "R was created when a couple of university professors wanted an open source system that could work on big data that was being parallel processed, and it really took off in the academic community, beginning with research projects. Today, R is being taught at universities throughout the country." Smith also says that R's non-procedural flexibility allows sites to "go outside the box" in their data analysis more readily than they could with SAS. "In some cases, what might take you one whole week to do with SAS, can take just half a day with R," he said.
The catch for many companies is where to find the people to do the work. It's also difficult to cross-train a SAS programmer so he or she can use R as well since both languages have steep learning curves.
Companies that want to use every programming approach at their disposal in their big data work should follow these four steps.
1: Take a look at R
If you're not familiar with the R analytics language, you should allocate research time to see what it has to offer and where and when it potentially fits in your big data analytics toolkit.
2: Create a joint SAS-R strategy
There is a reason why enterprises have years of investment in SAS: It has a long history of returning value to organizations. Conversely, new computing architectures will also require newly architected languages. There is a definite need for non-procedural programming languages like R in the world of Hadoop and parallel processing. Enterprise big data architects should focus on a dual strategy that takes advantage of the strengths of both SAS and R.
3: Consider getting new college grads to do your R work
Because of the steep learning curve, it could be difficult to train staff who have a history with SAS to learn R. You might consider hiring new college graduates who have just studied R.
4: Tune your efforts to the business
Some big data work will continue to thrive in the SAS environment, and in other cases, R will be the best methodology. Regardless of which programming language you select, your focus should always be on the types of answers that the business needs to derive from its data and how you're going to get them.
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President of Product Research and Software Development for Summit Information Systems, a computer software company; and Vice President of Strategic Planning and Technology at FSI International, a multinational manufacturing company in the semiconductor industry. Mary is a keynote speaker and has more than 1,000 articles, research studies, and technology publications in print.