With a strong nod to the
increased uptake of business intelligence throughout its user base, SQL Server
increased its data warehousing arsenal with the 2012 release. Columnstore
indexing may have slipped under the radar for many, but it’s a significant
advance in warehouse query processing. (If you missed it, it wasn’t for lack of
marketing by Microsoft; Google Project
However, it’s the kind of
advance that’s really hard to appreciate, because it fixes a problem that
doesn’t really seem like a problem. Querying a data warehouse is a burden from
the get-go, given that such queries typically have to slam through many
millions of rows of data. Common queries are optimized, as they should be, and
since warehouse data tends to be stable, we just tip into the existing query
tool bag to make them better still.
OLAP cubes are a go-to;
they dimension data according to its most common utility and summarize across
those dimensions. Indexed views and summary tables are other familiar tricks.
But these fixes assume that the data being queried doesn’t change. There is
little or no flexibility in these approaches to query speed, and maintenance is
problematic when change does occur, summary data changes, and indexes shift.
A sharp right turn
Now we have a serious
alternative, and it’s as admirable for its ingenuity as for its performance.
Where we’ve always based our thinking about database processing on rows, this
new technique is column-oriented. In a nutshell, data is grouped column by
column, and only the required columns are read into memory. The read from disk
to memory is much faster, and less memory is consumed overall.
In data warehousing
queries, the value of this savings is hard to overstate. Data in warehouses
tends to be rich and detailed, and any given query is not likely to be after
all of it. A typical warehouse query will pull from less than a fourth of the
columns available in a row, so the IO savings in reading only those columns is
This shift from a row-oriented
to a column-oriented technique exploits other efficiency potentials, too.
Consider row storage, where column values in a given row are contiguous; this
organization works well when you care about all the columns in a row. The data
you’re interested in is stored in efficient pages.
But in column-oriented
storage, values are arranged contiguously along the other axis. Where contiguous
values across rows vary wildly, contiguous values down columns are often very
similar. This makes column storage friendly to compression algorithms, adding
another efficiency boost to the scheme.
When it’s time to change
It’s intuitively obvious
that you wouldn’t use columnstore indexing on tables that changed; the same
issues that prevail with OLAP — indexed views and so on – also surface in this
scenario. But columnstore indexing may still be your top choice, not only for
the performance boost, but because of its flexibility in maintenance. Often,
just dropping the index, performing an update, and rebuilding it will do.
But the place where
columnstore indexes are most useful is the data warehouse, and in data
warehouses, partitioning is common. Partitions can be used for quick and
flexible update of tables using columnstore indexes. You can partition the
table and swap, putting new data into a separate table and building the index
there, then dropping the new table into a vacant partition.