Modern business is becoming more complex amidst constant changes, unpredictable events, and dynamic demands by end users — all happening at an unprecedented speed. It’s essential that IT operations and management adopt the right tools to optimize operations and handle the complexity and pace of change.

Doing more with less: Cost cuts in IT

IT budgets fell sharply in 2009; according to Gartner, budgets shrank 8.1% and another 1.1% in 2010. Though IT budgets started growing again in 2011, they are now only at the level they were in 2005.

At the same time, IT operations teams run with fewer people and resources, not only managing an increasing number of systems but also dealing with the new complexity that comes from hybrid environments and the rapid pace of changes brought by agile processes. Increasing productivity while lowering costs seems like a difficult proposition, especially since increased demands are placed on operations staff to manage a variety of rapidly evolving applications across the environment. How can IT operations teams best manage this situation?

The IT big data challenge: Managing enormous amounts of data

Everything from system successes to system failures and all points in between are logged and saved as IT operations data. IT services, applications, and technology infrastructure generate data every second of every day. All that raw, unstructured, or polystructured data is needed to manage operations successfully. The problem is that doing more with less requires a level of efficiency that can only come from complete visibility and intelligent control based on the detailed information coming out of IT systems.

Frequent changes occur in IT operations

Operations staff are responsible for the health of business services and all the involved layers (applications, network, and storage, for instance), and so it is understandable that they might resist introducing unpredictable changes within applications or the IT infrastructure. In fact, IT operations are often rewarded for consistency and for preventing the unexpected or unauthorized from happening.

However, solving business problems requires creativity and flexibility to meet the frequent changes dictated by business requirements. New agile approaches eschew the standard method of releasing software in infrequent, highly tested, comprehensive increments in favor of a near-constant development cycle that produces frequent yet relatively minor changes to production applications. With hundreds or thousands of dependencies, even if agile iterations are properly tested throughout development, unforeseen problems can arise in production that can seriously affect the stability.

Change grows error risk

Since every IT service is based on many parameters from different layers, platforms, and infrastructure, a small change in one of the parameters amongst the millions of others can create a significant impact. When this happens, finding the root cause can take hours or days, particularly given the pace and diversity of changes. In many cases, unplanned changes lie at the root of many failures and can create business and IT crises that must be resolved quickly to avoid productivity and revenue losses.

Traditional approaches have failed

Problems can be difficult to manage or even identify because so many businesses rely only on monitoring software, which is not sufficient on its own to address the above challenges. In fact, problems are often not detected until they have grown out of control. If these issues are not resolved quickly the result is downtime, which can be crippling.

The technological infrastructure running an enterprise or organization generates massive streams of data in such an array of unpredictable formats that it can be difficult to leverage using traditional methods or handle in a timely manner. IT operations management based on a collection of limited functions and non-integrated tools lacks the agility, automation, and intelligence required to maintain stability in today’s dynamic data centers. Collecting data, filtering it to make it more manageable, and presenting it in a dashboard is nice but not prescriptive.

Automation is not enough for reducing error

One of the holy grails for IT management is intelligent IT automation. There are pieces of activities that are automated and targeted at the repetitive, well-known, mundane activities. This can free up people and resources to perform more innovative activities, and offer a more agile, speedy response from IT.

However, while automation is an important tool in the kit, it’s just one of the tools. The effort to automate complex environments is proportional to the complexity. Essentially automation is just another generation of scripting of those activities that are running as part of operations designed to spawn and manage slave automation gofers.

New solutions: The rise of IT Operations Analytics (ITOA)

Given that changes to the operational model are almost guaranteed, a change in perspective is needed where IT operations takes a proactive approach to service management. Applying big data concepts like automated analytics to the reams of data collected by IT operations tools allows IT management software vendors to efficiently address a wide range of operational decisions. Because of the complexity of environments and processes and the dynamics of the environment, organizations need to have automation that is analytics driven. There is an action, there is a reaction, and then there is the analysis of the reaction, which produces corrective actions that can be taken in the future.

With all this data, ITOA tools stand as powerful solutions for IT, helping to sift through all of the big data to generate actionable insights. Analysts such as Gartner and other industry experts have shown their enthusiasm for the technology, saying “ITOA solutions can be used by I&O leaders to analyze previously untapped business-relevant data and context contained within IT components.”

ITOA can provide the necessary insight buried in piles of complex data and can help IT operations teams to proactively determine risks, impacts, or the potential for outages that may come out of various events that take place in the environment (e.g., application and infrastructure changes). Allowing a new way for operations to proactively manage IT system performance, availability, and security in complex and dynamic environments with less resources and greater speed, ITOA contributes both to the top and bottom line of any organization, cutting operations costs and increasing business value through greater user experience and reliability of business transactions.

One such provider of ITOA solutions is a company named Evolven, which provides a change management solution I wrote about last November. I had a talk about ITOA with Sasha Gilenson, the Founder and CEO of Evolven.

Scott Matteson: What’s going on now with IT Operations Analytics?

Sasha Gilenson: We’re going in two directions:

1) Expanding capabilities of analytics. We discovered and assessed historically and statistically new attributes of changes. We added multidimensional risk analysis calculating potential risk of detected changes.

2) The other direction for us is blending the sources of data through the focus on change. Change is one of the primary sources of operational issues. If you look at the change in a wide sense — in applications, infrastructure, workload, and data — your environment will be extremely stable and secure if no change takes place.

If you analyze your environment through APM or event data, you essentially need to reverse engineer it to identify a potential issue or to nail down a root cause of an incident. Your CPU is high, your transaction slows down, as there is an extremely slow Java query in your code. Why? A good chance is that you either changed your application or infrastructure configuration. However, it’s nearly impossible to deduce this from performance data. On the contrary, if you start from a change and assess it using other data sources this top-down analysis will be much more effective. So blending data sources through the prism of change can provide very accurate operational insights.

Scott Matteson: Thinking about IT Operations Analytics now, it seems to becoming a lot more popular as you know. What can you say about the topic that makes it different from other trends we’ve seen, that makes it all-encompassing and applicable to numerous environments? In other words, what is really the driving factor behind this?

Sasha Gilenson: Well, there are multiple driving factors that is making ITOA spread like wildfire. One is the people. Traditionally operational folks are recognized by their ability to solve problems. However, the amount of problems is growing due to the complexity and rapid pace of change in IT environments, while the amount of people is not (the infamous “do more with less” strategy), so you need to find a way to reduce the amount of fires.

Another factor is high complexity. The modern infrastructure combines elements of physical/virtual/cloud environments. The infrastructure and software used is multi-tiered, and mixes legacy and modern technologies. On top of each environment component comes with tons of configuration to accommodate needs of different users. We (Evolven) work with a number of Fortune 500 clients. All of them switched to DevOps and Agile development, changing their environment and systems every day. Complex environments which change all the time increase the number of fires that need to be put out. One of the approaches to reduce fires is to recognize what is happening in your environment that can lead to trouble. Big data and IT analytics are part of that. If you want to identify issues proactively — performance, capability, and security — you need to use all of the IT operational data and analyze it.

Scott Matteson: There’s a lot of interesting components there — lots of moving parts. Since we last spoke about Evolven, can you tell us what has changed since then; what new stuff have you guys developed?

Sasha Gilenson: In terms of development of new stuff, some things that we created are in risk analysis — essentially we look into different dimensions of change. Is the change consistent, is the change authorized, is the change frequent, is its potential impact severe? Is it happening in production or disaster recovery? We look at all of the multiple dimensions of change to calculate the risk factor. This leads to a proactive approach. You have all sorts of change happening in your environment, and it’s difficult to know what to focus on. We recommend focusing on the ones that are high-risk as found in the analysis. So one big piece is multi-dimensional risk analysis. The second is the search. We’ve found that frequently the users have an idea of the area or borders where things are happening and they want to be able to search across big data and make queries.

We’ve developed a search mechanism to narrow down the data used for analysis. We are also in the process of developing an engine with capabilities to add APM and log data, service management data, and other data sources as a dimension for change analysis. We can prune information from change management data to correlate and identify changes and see if they’re authorized. We can pull information from log systems to identify who was actually creating the change. Risk analysis, search, and blending of various sources of data are what we’re developing.

Scott Matteson: My readers are always interested in specific real-world examples of how product users performed certain functions. Can you share a couple of real-world examples?

Sasha Gilenson: Sure. We have a fantastic example which illustrates the value of the risk analysis component in the analytics approach. We have a client that was required to execute full production environment fallback into disaster recovery. Their production systems should run on the disaster recovery infrastructure for a couple of weeks. The first time this test was required there was a concern about moving all of their transactions to an infrastructure that had never been active. They used Evolven to compare the production and disaster recovery environment and analyze their consistency. This helped them in a very short period of time to discover critical differences, and they were able to make successful corrections to their disaster recovery environment to make it ready for use. This example shows how very complex environments can be aligned to save a lot of pain and trouble.

Another example we have is a customer that was moving from an old data center to a new one, and they were able to compare and analyze their configuration between the old and new sites to facilitate the transition. They found a wrong DLL file being used in the new data center even though all scripts were executed and no error messages were recorded. This could have caused a major problem, so as a result they saved hours and hours of availability issues.

A third example involves a change to the permissions of some files in a system. Analytics picked this up and identified it as a high-risk element of change, finding two files with wrong permissions out of thousands and thousands of deployed files.

We see customers on a daily basis who avoid various issues by using analytics on the mountains of information which pick out those changes and inconsistences that can create a massive negative impact.

Scott Matteson: Lots of change is coming, no pun intended. One of the things people always want to know about any given article is how this fits into the cloud. Is there any specific advantage or benefit, or no infringement using cloud systems versus in-house systems?

Sasha Gilenson: The role of analytics changes in the cloud. Cloud introduces a high level of automation and virtualization, reducing the amount of issues related to manual work. You don’t go and deploy a change manually; you use tools on the cloud infrastructure. You get to the point that you have infrastructure running automatically. However, if something happens, you don’t really know what you have or what has happened. How is the current state of cloud different from the systems that were working earlier? You have a mix of older and newer server instances that might be creating the issue you’re investigating. Being able to provide the layer to monitor the cloud environment and analyze all the environment dynamics reduces the amount of issues caused by changes and drastically accelerates issue resolution. In the cloud you have fewer issues, but they tend to be bigger than in traditional data centers.

Scott Matteson: What do you guys see coming down the road both in terms of what you’re doing and what’s happening in the data center? Say, five years from now?

Sasha Gilenson: I think we’ll see more and more intelligence in the data center. Analytics will drive this intelligence, allowing us to raise the level of IT automation including self-healing and self-administration of the environments. Within 5 to 10 years multiple vertical analytics solutions will merge into a unified analytics layer serving as a brain of automated IT environments.

Scott Matteson: One thing I like about the IT Operations Analytics environment is that it’s vendor-agnostic; it seems to apply to numerous platforms. You’re not tied into Microsoft or Linux or Dell or HP. Would you say there are any vendors that integrate better with ITOA? Or is the field level?

Sasha Gilenson: There actually are vendors that integrate better with ITOA. Analytics depends on the data and on being accurate. There are analytics technologies that are more open in terms of sharing the information they provide. There are vendors that are more structured and more organized with information that make it more accurate. For example, all of the database vendors are generally very good with configuration content having clear access (API or SQL queries) to the configuration. If you go to application servers, they become more convoluted and more difficult to work with. For example, Microsoft SharePoint can be a very powerful system in terms of content management, but the structure of its configuration data and interfaces to access it are very complex.

Scott Matteson: Anything else that we didn’t cover; any announcements or possibilities — things that might be good trivia for readers to know about?

Sasha Gilenson: We’re striving to maintain our position among the top analytics vendors. As such, we are a leader in developing IT knowledge of the analytics space.

The question is, what will happen and where are we going? Our approach is to look at CHANGE as the cornerstone of everything that happens in IT environments. Eventually there will be a unified analytics layer. We believe that unification can succeed only if based on our approach of blending all the big data of IT operations through the prism of change.

I’d like to thank Mr. Gilenson for taking the time to speak with me about ITOA. You can check out his company’s website for more information.