With a strong nod to the increased uptake of business intelligence throughout its user base, SQL Server increased its data warehousing arsenal with the 2012 release. Columnstore indexing may have slipped under the radar for many, but it's a significant advance in warehouse query processing. (If you missed it, it wasn't for lack of marketing by Microsoft; Google Project Apollo.)
However, it's the kind of advance that's really hard to appreciate, because it fixes a problem that doesn't really seem like a problem. Querying a data warehouse is a burden from the get-go, given that such queries typically have to slam through many millions of rows of data. Common queries are optimized, as they should be, and since warehouse data tends to be stable, we just tip into the existing query tool bag to make them better still.
OLAP cubes are a go-to; they dimension data according to its most common utility and summarize across those dimensions. Indexed views and summary tables are other familiar tricks. But these fixes assume that the data being queried doesn't change. There is little or no flexibility in these approaches to query speed, and maintenance is problematic when change does occur, summary data changes, and indexes shift.
A sharp right turn
Now we have a serious alternative, and it's as admirable for its ingenuity as for its performance. Where we've always based our thinking about database processing on rows, this new technique is column-oriented. In a nutshell, data is grouped column by column, and only the required columns are read into memory. The read from disk to memory is much faster, and less memory is consumed overall.
In data warehousing queries, the value of this savings is hard to overstate. Data in warehouses tends to be rich and detailed, and any given query is not likely to be after all of it. A typical warehouse query will pull from less than a fourth of the columns available in a row, so the IO savings in reading only those columns is huge.
This shift from a row-oriented to a column-oriented technique exploits other efficiency potentials, too. Consider row storage, where column values in a given row are contiguous; this organization works well when you care about all the columns in a row. The data you're interested in is stored in efficient pages.
But in column-oriented storage, values are arranged contiguously along the other axis. Where contiguous values across rows vary wildly, contiguous values down columns are often very similar. This makes column storage friendly to compression algorithms, adding another efficiency boost to the scheme.
When it's time to change
It's intuitively obvious that you wouldn't use columnstore indexing on tables that changed; the same issues that prevail with OLAP -- indexed views and so on – also surface in this scenario. But columnstore indexing may still be your top choice, not only for the performance boost, but because of its flexibility in maintenance. Often, just dropping the index, performing an update, and rebuilding it will do.
But the place where columnstore indexes are most useful is the data warehouse, and in data warehouses, partitioning is common. Partitions can be used for quick and flexible update of tables using columnstore indexes. You can partition the table and swap, putting new data into a separate table and building the index there, then dropping the new table into a vacant partition.
Scott Robinson is a 20-year IT veteran with extensive experience in business intelligence and systems integration. An enterprise architect with a background in social psychology, he frequently consults and lectures on analytics, business intelligence and social informatics, primarily in the health care and HR industries.