Big Data

Can data mining predict the future of your enterprise?

Data mining is one of the 10 emerging technologies that will change the world, according to MIT's Technology Review. This article provides a basic overview of this powerful technology.

Consolidating your enterprise data into marts and warehouses enables you to query them to gain high-value insights and information. Online analytical processing (OLAP) tools are quite powerful for conventional analysis and reporting but cannot easily identify unusual or complex cause/effect relationships. However, data mining technology represents another class of business intelligence tool, designed to ”mine” your data and extract this sort of complex information.

Since data mining tools are becoming more affordable, they are starting to find their way into the mainstream as desktop applications. Although data mining is powerful (one of MIT Technology Review's 10 emerging technologies that will “change the world”), you’ll need to develop a basic understanding of the tools before reaping the benefits of data mining technology. This article covers a few essential definitions, as well as some key issues you’ll want to consider when assessing your data mining options and strategy.

The promise of data mining is compelling. Metaphorically, it’s like having a crystal ball at your disposal to provide you with rich and meaningful business insights. For example, you might use data mining to do such things as:
  • Develop “bull's-eye” contact lists and targeted messages that dramatically improve results of sales and marketing initiatives.
  • Determine that a particular insurance claim is likely fraudulent.
  • Describe your customers and their buying patterns and then stock merchandise accordingly.
  • Predict that customers who purchase product "A" are likely to purchase product "B" at the same time.

Data mining defined
The term data mining is often erroneously used to describe routine slicing and dicing of data. I often hear phrases like, “We slice and dice our data to ‘mine’ information.” But data mining actually refers to a different concept altogether. It involves the application of intelligent and complex software algorithms against your data warehouse (or other data source) to recognize patterns not apparent through simpler analysis methods. Valuable predictive or descriptive models are developed in this way. With data mining, the interrogation of the data is done automatically by software containing the data mining algorithm, while with traditional OLAP, it is done hands-on by the user.

Data mining models
At the core of the data mining process is a model, which is something of a black box: Your data goes in one end, and useful predictions or descriptions come out the other. The model contains complex rules that produce accurate predictions and descriptions, rules that must be adjusted and fine-tuned using your data to increase their accuracy. In this way, models are said to be trained.

Once trained, the accuracy of the models is tested using your data. Because it’s best not to use the same data set for both training and testing, your data set is typically divided into two parts. Testing typically involves using the model against test data to generate predictions (or descriptions) and then comparing them with known outcomes. Once the model is trained and tested, it may be used to make predictions using a set of fresh input data.

Types of models
There are two classifications of data mining models: predictive and descriptive. Predictive models predict the value of a particular attribute (a dependent variable) based upon the values of other attributes (independent variables). In the following examples, dependent variables are shown in italics:
  • What’s the likelihood of a particular long-distance customer switching to a competitor?
  • What’s a particular insurance claim’s likelihood of being fraudulent?
  • How susceptible is a patient to acquiring a certain disease?
  • What’s the likelihood that a particular student will be successful at college?
  • How likely is a certain customer to place an order?
  • What’s the revenue a new customer will generate during the next year?

Descriptive models are the reverse of predictive models—you know the outcome and you’re looking for contributing attributes. These models are used to describe characteristics—for example, those of your long-distance customers (age, gender, income, education, number of children, etc.) who did switch to a competitor.

Of course, nothing is ever quite that simple, and there are various types of predictive and descriptive models, as you see here:
  • Classification models predict an outcome given a set of input characteristics. Predicted outcomes are then sorted into classes (e.g., fraudulent or not fraudulent).
  • Regression models predict a real number outcome (e.g., a customer’s expenditures over the next year, given a set of input characteristics).
  • Association models predict the occurrence of a second event, given the occurrence of another. For example, beer purchasers buy peanuts 75 percent of the time.
  • Sequencing models predict the sequence of events. Those who rent Star Wars then rent Empire Strikes Back, then Return of the Jedi, in that order.
  • Clustering models describe a natural group of things—such as vacation destinations by age group and income. Clustering is used extensively for determining market segments.

You still need an expert
Does the notion of having an affordable crystal ball at your disposal sound too good to be true? Probably so, and here’s the catch. While most experts agree that data mining can offer major advantages over traditional (manual) statistical approaches, many feel that the optimal use of data mining tools still requires the services of bona fide professionals.

At the heart of these discussions is the potential for reaching invalid conclusions through misuse of the tools and then acting on those conclusions to the detriment of the enterprise. Producing accurate and actionable predictions from raw data is serious business. Choosing the appropriate algorithm for the job is not a beginner activity and neither is the preparation of data to make sure it's worth mining in the first place.

While data mining is an evolving technology, the promise of its insights is often too compelling to ignore. However, there is a real possibility of failure—defined as obtaining no answer, obtaining the wrong answer, or misinterpreting the answer. My advice is to proceed with caution. Start by deciding exactly what you're trying to prove or understand. If you have no clear goal in mind, you'll be flailing in a sea of distracting correlations. Know that it requires skill to acquire clean, high-quality, and nonbiased data for input, to select appropriate mining algorithms, and to validate the reasonableness of any results through proper statistical testing. Perhaps all of data mining's benefits, as well as its difficulty and costs, lie in performing these steps well—and doing so requires more than superficial understanding.

Where’s your crystal ball?
What methods does your enterprise use to forecast future trends and analyze past and present performance? Have you tried data mining? Do you feel a need for more thorough data analysis than you receive from conventional processes? Share your experience by sending us an email or posting a comment below.

Editor's Picks