Big Data

Data warehousing and mining basics

Enterprise data is the lifeblood of a corporation, but it's useless if it's left to languish in data silos. Data warehousing and mining provide the tools to bring data out of the silos and put it to use.

Traditionally, enterprise data has been kept in information silos that are physically separate from other data repositories and serve specialized functions. Enterprise-wide reporting was difficult at best, requiring multiple data extracts and reformulation. All this data manipulation extracted a high cost in terms of accuracy and timeliness. Fortunately, the technology sector has anted up new data warehousing and mining tools to provide assistance.

Data warehousing
Data warehouses offer organizations the ability to gather and store enterprise information in a single conceptual enterprise repository. Basic data modeling techniques are applied to create relationship associations between individual data elements or data element groups. These associations, or “models,” often take the form of entity relationship diagrams (ERDs). More advanced techniques include the star schema and snowflake data model concepts. Regardless of the technique chosen, the goal is to build a metadata model that conceptually represents the information usage and relationships within the organization.

Leveraging the metadata model, enterprise users can then apply elementary data analysis techniques to gather business knowledge. For example, ad hoc queries can be run against the data warehouse to extract enterprise-level information. These queries would supply information that was impossible to obtain under the legacy system of disparate information silos.

More advanced data warehouse toolsets incorporate the concept of multidimensional data, or data cubes. This data structure allows information to be multi-indexed, which allows for rapid drill-down on data attributes. Data cubes are usually used to perform what-if scenarios over identified data indices. For example, suppose Company X sells jewelry and has offices in Detroit, Pittsburgh, and Atlanta. If the proper attributes were chosen as indices, a user could perform the following analysis.
  • What was the enterprise’s total revenue for 2001?
  • What was Atlanta’s revenue in November?
  • If there were a 30 percent increase in orders during the first quarter of 2002, what would my year-end revenue be for Pittsburgh?
  • If the Detroit office were closed, what would the impact be to the bottom line?

This multidimensional analysis of multiple business views is called Online Analytical Processing (OLAP). The primary function of OLAP systems is to provide users the ability to perform manual exploration and analysis of enterprise summary and detailed information. It is important to understand that OLAP requires the user to know what information he or she is searching for. OLAP techniques do not process enterprise data for hidden or unknown intelligence.

Data mining
Enter the concept of data mining. During the mid- to late 1990s, commercial vendors began exploring the feasibility of applying traditional statistical and artificial intelligence analysis techniques to large databases for the purpose of discovering hidden data attributes, trends, and patterns. This exploration evolved into formal data-mining toolsets based on a wide collection of statistical analysis techniques.

For a commercial business, the discovery of previously unknown statistical patterns or trends can provide valuable insight into the function and environment of their organization. Data-mining techniques allow businesses to make predictions of future events, whereas OLAP only gives an analysis of past facts. Data-mining techniques can generally be grouped into one of three categories: clustering, classifying, and predictive.

Clustering techniques group information based on a set of input patterns using an unsupervised or undirected algorithm. One example of clustering could be the analysis of business consumers for unknown attribute groupings. Input to this example would be well-defined consumer attributes over which the algorithm would search.

Classifying techniques group or assign objects to predetermined groupings based on well-defined attributes. The groupings are often clusters discovered using the above techniques. An example would be assigning a consumer to a particular sales cluster based on their income level.

Predictive techniques take as input known attributes regarding a particular object or category and apply those attributes to another similar group to identify expected behavior or outcomes. For example, if a group of individuals wearing helmets and shoulder pads is known to be a football team, we can expect another group of individuals with helmets and pads to be a football team as well.

Data-mining techniques
The following list describes many data-mining techniques in use today. Each of these techniques exists in several variations and can be applied to one or more of the categories above.
  • Regression modeling—This technique applies standard statistics to data to prove or disprove a hypothesis. One example of this is linear regression, in which variables are measured against a standard or target variable path over time. A second example is logistic regression, where the probability of an event is predicted based on known values in correlation with the occurrence of prior similar events.
  • Visualization—This technique builds multidimensional graphs to allow a data analyst to decipher trends, patterns, or relationships.
  • Correlation—This technique identifies relationships between two or more variables in a data group.
  • Variance analysis—This is a statistical technique to identify differences in mean values between a target or known variable and nondependent variables or variable groups.
  • Discriminate analysis—This is a classification technique used to identify or “discriminate” the factors leading to membership within a grouping.
  • Forecasting—Forecasting techniques predict variable outcomes based on the known outcomes of past events.
  • Cluster analysis—This technique reduces data instances to cluster groupings and then analyzes the attributes displayed by each group.
  • Decision trees—Decision trees separate data based on sets of rules that can be described in “if-then-else” language.
  • Neural networks—Neural networks are data models that are meant to simulate cognitive functions. These techniques “learn” with each iteration through the data, allowing for greater flexibility in the discovery of patterns and trends.

Organizations today are under tremendous pressure to compete in an environment of tight deadlines and reduced profits. Legacy business processes that require data to be extracted and manipulated prior to use will no longer be acceptable. Instead, enterprises need rapid decision support based on the analysis and forecasting of predictive behavior. Data-warehousing and data-mining techniques provide this capability.

Editor's Picks