A database query in an online transaction processing (OLTP) system application is generally going to be one of a great many reads, pulling in a small amount of data: i.e., a single record, a single table entry, a lone object. A database query in an online analytical processing (OLAP) system application is generally going to be one of a few reads, pulling in a huge amount of data, such as a large chunk of sequential records.
In other words, from the standpoint of the database, an OLTP query and an OLAP query are direct opposites. It’s easy to see this polarization when you consider the nature of the applications: OLTP applications are generally used by lots of people, dealing with one business object at a time; OLAP applications are used only by analysts (fewer in number by far than OLTP users) and don’t deal with business objects at all, but with large batches of records over time.
So if you’re working on a data warehouse—the source of OLAP data—you’re going to want your database access to work a little differently.
What difference does it make?
You can’t neglect database access when designing access for a data warehouse servicing OLAP apps. Why not? If you do, you’ll never get the warehouse tuned for decent performance. And why is this? Because lots of reads of a little data is physically very different from a few reads of lots of data.
What’s the preferred organization when storing data in tables in a typical relational database environment? Well, we like normalized data, with logical relationships between data items. For transactional processing, this works well, and we’re happy with it. But imagine applying that same reasoning to a data warehouse. Visualize your warehouse data normalized. Now picture yourself pulling in 48 months’ worth of tabled records for an analytic. Count the number of physical reads required to fulfill the query. Is your head spinning?
Objective #1: DO NOT normalize the data in a data warehouse!
If you don’t want normalized data distribution, how do you want to store data in a warehouse? Think of OLTP data as suburbia—sprawling neighborhoods, every house holding a family, some bigger, some smaller, all connected to one another by this road or that—while OLAP data is your basic downtown high-rise, every unit identical, all in one place, perfectly stacked together (no roads required).
If an OLAP analyst is going to pull in 48 months’ worth of tabled records for an analytic, the most efficient read for such a query is to grab an optimum batch of data in any one read, and that means sequential data stored in arrays. That is, since you’re counting physical reads to gauge query efficiency, you want the minimum number of physical reads, and that means having the sequential data as physically proximate as possible.
Objective #2: Sequential logical storage = physically proximate storage!
If you’re moving data by the gigabyte in and out of a database, then you have to be efficient in both directions: The data must be stored in an extremely efficient structure, and the method of access must match this structure. This should be self-evident, but if you’re used to thinking in relational terms, it’s easy to take this part of the equation for granted—and failure to think of this could lead to crippling inefficiencies (far too many reads-per-query).
Objective #3: Storage structure(s) and access method(s) should match!
Returning to the analogy of the relational database as a sprawling, "normalized" suburban neighborhood and the analytical database as an urban high-rise, you can see that what we’re talking about is opposites in terms of access path. Relational database retrieval works well because we know what the access path is. In a data warehouse, we don’t; hence, our need for a structure that allows us to roll data up rapidly, in a multidimensional "high-rise". Likewise, we can know where we are in a relational database because data objects all bear a certain predetermined relationship to other data objects. In a data warehouse, these relationships are not nearly so well-defined: Proximity is often our primary access feature, because we’re "scooping" data out in bucketfuls, and like objects are located in sequence (physically as well as logically).
Taking the high-rise analogy even further, the ideal storage in a data warehouse is multidimensional. The high-rise is an elongated cube, and cubes are exactly what we want to model in storing warehouse data. Since we want data to be related, but not in the expensive RDBMS fashion, we need to create context for data objects laid out in series. We do this by adding a dimension to our data model that will establish context, in practice.
The shape of the data
How do you add context to sequentially-stored data objects without imposing a relational structure on them? You can add context with two powerful concepts in data structures: the star schema and the cube.
The star schema
In a star schema, tables of data are joined to a central "fact" table, such that the peripheral tables (where the serious buckets of data are) form dimensions of the central table. The central table contains the keys by which the peripheral tables are referenced, giving them context both alone and in combination with other tables. The central table is therefore redundant, but because of the virtual relational power of this configuration, it’s a small price to pay (the snowflake schema is a variation on this theme, wherein there are multiple tables defining one or more dimension, with dimension tables extending other dimensional tables).
The data cube
This structure is a three-dimensional manifestation of the star schema (actually, it can be multidimensional: cube, hypercube, or as many dimensions as you need). In such a structure, you literally have a cube of information, tables stacked such that one dimension forms a "fact" table of keys and pointers, pointing in the direction you want to analyze the data.
The goal, then, is to store data in cubes and retrieve it in cubes for maximum logical and physical efficiency. Our access time is reduced and our physical reads are minimized. Because of this efficiency gain, in a follow up article, I'll examine how to implement the data cube in Microsoft SQL Server 2000.
Scott Robinson is a 20-year IT veteran with extensive experience in business intelligence and systems integration. An enterprise architect with a background in social psychology, he frequently consults and lectures on analytics, business intelligence and social informatics, primarily in the health care and HR industries.