You’ve soaked up all the conceptual hype about data warehouses and business intelligence. You know all the amazing things your users will be able to do with a data warehouse. But what does it take to actually build one?
Well, let’s start out by stretching credibility even further, and throw in a cliche. When you’ve built a data warehouse, you’ve built something better than any application you already have, and you’ve done it with less effort. Why? Because data warehouses and the processes that spring from them are largely made out of creativity, and it will be supplied after the fact by your users.
A data warehouse, of course, isn’t actually an application itself, but an enabling tool that allows select individuals in your user community to develop high-powered forecasting and performance monitoring functions. You haven’t provided an application per se; you’ve given the company a very sophisticated application development kit that can be directly put to work by users.
If you’re an ERP house (running SAP R/3, Oracle, PeopleSoft, etc.), then you have available to you kits for assembling your warehouse, to which you need add only storage (these kits—SAP BW and Oracle Warehouse Builder—will be covered in future articles). But even if you are running on a more conventional database platform, the data warehouse is easily within your reach.
ERP or no, your basic task is this: Create a storage system that consolidates vast amounts of relevant data and stores it in such a way as to maximize its convertibility into useful information. This point is key: In a conventional operational system, applications turn data into information; in a data warehouse, data is converted into useful information at the time it is stored.
Data, data everywhere
Get the data into the warehouse
I won’t belabor the data warehouse or business intelligence concepts. Suffice it to say, you’re constructing a data acquisition system that will regularly pull in vast amounts of data from many sources and leave it stored and configured for optimum accessibility and manipulation. With this in mind, consider that since the whole point of a data warehouse is to consolidate data that is difficult to pull together otherwise, your design must allow for data acquisition from multiple databases in-house, from remote locations in your company WAN, and from external sources.
Transform the data for optimum usage as you load it
As you extract data from various sources, you will process it in several ways as you tuck it away in the warehouse:
- Integration–Common fields used in data structures from different sources must be reconciled, both structurally and content-wise. Differing measurements must be transformed. Fields of differing lengths and formats must be negotiated.
- Condensation–Where possible, condense data at the time of extraction and before loading into the warehouse. How to do this will be negotiated between you and your user community, as you go. Do you need every last transaction, or can you bounce them down to daily summaries? If you fail to ask this at every opportunity, your warehouse will rapidly collapse.
- Stabilization–How often does data change? The name of a product almost never does. Its price often does. Quantity/on-hand changes continuously. You can group data together by attribute, depending on the attribute’s stability. The power to configure data in this way has obvious design benefits.
- Normalization–Your data warehouse will be a high I/O environment, and you need to design it with this in mind. Consider that when data is normalized, you will have occurrences of related items in different locations. This isn’t I/O efficient. What you want to do is identify instances where the number of occurrences of a particular data item is stable enough (and, if you’re lucky, sequential) so that you can grab it with a single I/O. When your warehouse is up and really humming, it will matter.
Pay close attention to these processes. The more effort you apply to make them thorough and efficient, the easier your life will be when the system goes live.
Equip the system with superior tools
In addition to the extra storage you’ll need for warehouse data, you’re going to need several pieces of key software:
Extract-Transform-Load (ETL)–This software, the workhorse of your data warehouse, is self-explanatory. A word of warning: If you try to build this from scratch, you must have an extremely good reason or a touch of insanity. It will wind up costing you more than every other component combined. Go to the store and buy this (actually, you can buy ETL software, and you can buy nifty packages that will generate ETL software for you; check out the Data Warehouse Institute for a list of vendors).
Online Analytical Processing (OLAP)–If ETL is the horse, OLAP is the cart. Though you’ll do other things with your data warehouse, OLAP is the magic wand for the new tricks you want to empower your users to do. Unlike ETL, you can grow OLAP in your own basement.
Data mining and EIS–You know what these are. There are lots of vendors out there selling this software. Most of the products you can find with any search engine and research on the Web are specifically designed to cater to the data warehouse environment.
Low-level design considerations
Your data hierarchy is up to you, but utility for analysis will probably dictate: (1) Data, stored within (2) Tables, grouped by (3) Subject. And within this general hierarchy, you’ll be very conscious to date information at the lowest level possible, as most business analysis uses time as a critical factor.
Major design considerations
Granularity refers to the data’s level of detail. The more detailed a data item is, the more granular it is. One instance of a purchase order, for example, would have high granularity; a summary record of all purchase orders for a sales quarter would have low granularity.
This is the single greatest factor in determining how much storage you’re going to need for your warehouse over time, because it determines the volume of the data your analytics and reporting will require. As a rule, data is going to have high granularity levels (too high) coming in. In your ETL step, you need to break it down. Granularity vs. storage is a trade-off, to be sure, but remember that the more granular the data, the more flexible it is.
You must, for efficiency reasons, break data into physical units that can be easily handled using partitioning. In general, it is best to partition along Subject lines (see above). If you address this in design, life will be easier in development and operation later.
The users’ contribution: Iterative development
Don’t fool yourself (or senior management) into believing that the data warehouse is going to be completed on any particular date. A data warehouse can’t be explicitly defined, functionally, before it’s up and running, because neither you nor your executives, nor your trench-level users really know what you’re looking for. If you already knew what it looked like, you’d have created it in your operational system a long time ago. The end products of a data warehouse are one of those “you’ll-know-it-when-you-see-it” propositions.
For this reason, your data warehouse will never quite be finished. Your analytics will grow in sophistication with time; data will be refined and configured in new and different ways as results improve. A data warehouse is a growing, evolving thing, and so you’ll find it an iterative process, heavily dependent on your user community and their creative efforts to pull out information that improves performance. Don’t worry about defining it beforehand; sit back and enjoy the ride.