At first glance, it seems logical to look to your online transaction processing (OLTP) systems—those that capture data—to service your business intelligence (BI) requirements. In most cases, however, organizations that have tried to combine their OLTP and BI data systems find themselves data rich and information poor.
Part of the problem stems from the traditional Entity-Relationship (ER) data model, upon which the majority of OLTP systems are based. As I discussed in an earlier article, the ER model is great for pulling data in but problematic when you attempt to pull it out.
For effective use of BI data, you need dedicated systems to handle the extraction and timely distribution of high-value information. Such systems are almost always based on the dimensional data model, which is what you’ll want to use to structure your data warehouse (see “Use this architecture to structure your business intelligence solutions”). The dimensional model will help you organize your information efficiently, both in your database and in your mind, and facilitate both rapid system design and flexible information retrieval. In brief, to be effective at BI, you’ll want to learn to think dimensionally.
In this article, I’ll discuss the components of the dimensional data model, and I’ll tell you why building your data warehouse around it makes it much easier to understand and organize your data.
A star is born
The dimensional model is often called a “star schema” because of its shape (see Figure A below). It is simple and symmetrical, especially when contrasted with the ER model I mentioned above. Because of its simplicity, the dimensional model makes it easy to conceptualize your information.
The example shown in Figure A is a simple dimensional data warehouse structure, such as might be used for a chain of retail food stores, for instance. Note the single, prominent facttable shown at the center of the model. The fact table is the core of the model, around which the dimension tables are built. There is never more than one fact table, but there can be many dimensiontables. Because this is a relational database, keys relate the fact table’s facts with the appropriate dimensions. At a minimum, you’ll almost always see three dimension tables, though 10 or more are not unheard of. It can be confusing, however, if there are too many.
The principle of the dimensional model’s operation is simple. Note that each record in the fact table contains numeric measures [e.g., sales dollars, number of units, as well as (foreign) keys] that relate the fact to the appropriate member in each of the dimension tables. In this way, assuming our example is appropriately populated with data, you can state that your Nashville store sold 140 units of “House Brand Aspirin (100 ct.),” on April 14, 2000, totaling $219.50 in sales.
|A star schema representation of the dimensional data model|
A fact table may contain almost any piece of data that is relevant to the business of the enterprise. For example, in a supermarket chain, a fact may represent the total number of items “swiped” at the checkout or a daily summary of sales, by product. Depending on the organization of data, a single business fact can be represented by a single row or a single record. Because of its greater detail, the first (i.e., organization by row) is said to be more granular.
Granular data allows increased precision and flexibility of analysis. However, because it’s more expensive to both store and move around during consolidation, high-granularity data can drive up costs. In most cases, it’s better to err on the side of more granularity because of the flexibility it offers.
Each fact record is composed of a set of foreign keys—one key for each dimension—and one or more numeric measures. The appropriate measures are best determined by your business objectives and may include things like sales, number of defects, votes, employee absences, or hours worked. In the example cited above, units sold (140) and total sales ($219.50) represent measures. The store (Nashville), product (aspirin), and time (April 14, 2000) are dimensions that describe the measures in meaningful ways. The dimensional information comes from dimension tables by way of keys.
Measures are often used to calculate other measures. In our example, dividing yearly aspirin sales by the number of units sold will create an average price. These simple calculations are best done at runtime when the query is processed, so they’re often referred to as virtualmeasures.
It’s also important to consider how your measures combine, or aggregate, so that your summaries make sense. Measures are always numeric, but not always additive. You may sum product sales and arrive at a total sales number, just as you can with campaign contributions or repair costs. However, if you try to sum temperature, blood pressure, or inventory amounts across time, you won’t get meaningful numbers. These sorts of measures are termed nonadditive and make sense only if you average them. Sometimes it’s useful to simply count measures, to determine the number of transactions per day, for example.
Dimensions describe facts, so dimension tables contain many attributes that describe the fact in rich detail. Our food store’s product dimension contains attributes such as SKU number, product description, brand, category, and package size.
Each product dimension record, called a member, corresponds to a specific product, but many products may share a category. In this way, you can recover product sales by category.
Similarly, a store dimension’s attributes may include the store’s name, age, and square footage, as well as the geographic attributes of city, county, and state. Accordingly, you can easily recover product sales by category by state, or even product sales per square foot, if you’re inclined.
Dimensions often contain one or more logical hierarchies representing useful and logical summaries of data. Your store dimension might contain a geography hierarchy: country, state, county, and city. Your time dimension might also have two hierarchies, one representing the actual and one representing the fiscal calendar. Later, as you analyze your information, you can easily drill down along these hierarchies to view your information in greater detail.
General form of dimensional queries
The dimensional model makes it easy to conceptualize your queries by following the general form: Show me <fact> for <dimension> for <dimension> … for <dimension>. Substituting real values from our sample model, you can say Show me <sales> for <House Brand Aspirin (100 ct.)> for <April 14, 2000> for <Nashville>.
Learning this simple general form, along with the basics of dimensional modeling, will help you and your end users better conceptualize the set of possible queries based on your system’s data. Try doing that with an ER data model, and you’ll appreciate what I’m saying. Believe me, this clarity supports good system design and acts as a wonderful reference to help your end users understand the multidimensional BI system once it’s in use.
Armed with the simple terms and concepts I’ve discussed in this article, you’re well on the road to thinking dimensionally and describing your business information requirements in terms of measures and dimensions. Doing so will help tame the otherwise complex process of creating and deploying an effective BI system.
Many enterprises are rich in data but poor in access to useful, high-level information. Is yours one of those organizations? Or have you created a workable BI system? Share your thoughts and experiences with other TechRepublic members. Send us a note or post a comment below.