When you hear or see the words "data warehousing," what comes to mind—a large building that has bits of information stored on shelves or bins waiting for someone to retrieve them, perhaps?
If that's your vision, conceptually you are not far off. In a traditional warehouse, goods are stored in such a way to quickly locate and identify them as well as to maximize the speed in which they can be retrieved. A data warehouse functions in much the same way.
Now you may be saying, "How is this different from a traditional relational database?" The answer is that they are similar but have some defining differences. Typically, in an organization, data is stored in many databases, usually divided up by system. There may be data for marketing, sales, payroll, engineering etc. all in their unique databases, with, very possibly, logical relationships among and between one another, but with no physical relationships.
Answering a business question that requires data from more than one of these disparate pots of data is not an easy exercise. It would require someone to understand multiple data sets and a call to IT to request that DBAs do some data extraction. Typically, this is not something that is easy to do on an ad hoc basis.
A data warehouse is used to consolidate this data into a central store and to arrange it to make analysis easier, which means the data is less normalized than your traditional application database. Data is typically transferred in batch to the central store of the data warehouse during off-peak hours of the owning application, but it can happen in real time as well. Timing these data warehouse refreshes around backup activities and other scheduled batch processes can be quite a daunting task!
Getting the data into the warehouse is only part of the concept. Once in, it has to be easily manipulated in order to answer the business questions from management and other users. This is accomplished by connecting the data to fast and easy-to-use tools known as Online Analytical Processing (OLAP) tools.
Using our warehouse example above, we know that forklifts are often used to move about a warehouse and retrieve products from the shelves. The forklift operators are knowledgeable about where the products are and the fastest way to get to them. In this example, OLAP tools can be thought of as super high-speed forklifts that have knowledge of the warehouse and the operators built into them in order to allow ordinary people off the street to jump in and quickly find products by asking English-like questions.
Within the OLAP server, data is reorganized to meet the reporting and analysis requirements of the business, including:
- Exception reporting
- Ad-hoc analysis
- Actual vs. budget reporting
- Data mining (looking for trends or anomalies in the data)
In order to process business queries at high speed, answers to common questions are preprocessed in some OLAP servers, resulting in exceptional query responses at the cost of having an OLAP database that may be several times bigger than the data warehouse itself.
There are two major categories of OLAPS: those that use multidimensional databases or MOLAPS and those that use relational databases or ROLAPS. A MOLAP is a two-tier client server architecture where the OLAP database serves as both the database layer and the application logic layer. Therefore, both application logic and database access reside on one system.
ROLAP usually uses a three-tier architecture in which the database layer and the application logic layer are separated and the data is stored in relational databases rather than in the multidimensional databases of MOLAP.
Which is better depends on your needs. MOLAPS tend to be faster and more responsive and flexible to user requests than ROLAPS, which require more processing time than MOLAPS. However, ROLAPS generally support greater amounts of data and more users than a MOLAP.
There is a third category of OLAP that is becoming more and more prevalent, although some argue that it is not a category in and of itself. This category is known as hybrid online analytical processing (HOLAP). As you may have already guessed, HOLAP is a fusion or blending of the other two methods in which both multidimensional and relational databases are used. The trick with a HOLAP is to balance performance demands with storage demands to get the most out of the system.
TechRepublic has joined forces with Auerbach Publications to create the most complete data management resource for IT professionals. With IT Professional's Guide to Data Storage Design and Management CD-ROM, you'll learn how to manage your enterprise's data storage needs today and in the future.
Once we have established a data warehouse and chosen a tool for retrieving and manipulating the data, we can begin to ask questions of our data in a meaningful way. One way of doing this is by performing data mining. As I briefly mentioned earlier, looking for trends, patterns, and relationships in data is referred to as data mining. Data mining techniques include:
- Association: Looking for patterns of connected events
- Classification: Looking for new patterns
- Clustering: Finding and visually documenting groups of facts not previously known
- Forecasting: Discovering patterns in data that allow you to predict the future
Using these methods, analysts dig deep into data warehouses to discern valuable information that would not be readily apparent by querying traditional data stores. The data mining technique is often used in genetics research.
Another way of looking at our data is through the use of a decision support system. Decision support systems analyze data, often through the use of expert systems or artificial intelligence, and present it to the user in a form that makes it easier for them to use in making business decisions. These forms can take the shape of charts, graphs, tables, and reports.
There are literally hundreds of vendors who sell data warehousing, OLAP, data mining, and decision support system software with IBM, Oracle, COGNOS, and Microsoft to name a few. There are even more sources of information regarding data warehousing and its associated tools. A quick Web search on OLAP or data warehousing will provide you with volumes of information on the subject—and for good reason.
Data warehousing is a complex subject that requires expertise in order to set it up and use it properly. A tremendous amount of resources can be wasted in creating a data warehouse should the project not receive the proper planning and expert assistance. This article has provided you a base from which you can expand your knowledge of the subject and be able to read and understand the literature without getting hung up in the jargon.