Data Centers

Optimize SQL query speed with the Oracle clustering_factor attribute

Optimizing SQL query speed in Oracle involves numerous data characteristics. One of the most important determining characteristics is clustering_factor. Learn more about this attribute before you start your next optimization project.

The cost-based optimizer (CBO) improves with each new release of Oracle, and the most current enhancement with Oracle9i is the consideration of external influences (CPU cost and I/O cost) when formulating an execution plan. As Oracle evolves into Oracle10g we may see even more improvements in the ability of the CBO to always get the optimal execution plan for a query, but in the meantime, every Oracle developer must understand these mechanisms to properly tune her SQL.

Rules for Oracle indexing
To understand how Oracle chooses the execution plan for a query, you need to first learn the rules Oracle uses when it decides whether or not to use an index.

While important characteristics of column data within tables are known to the CBO, the most important characteristics are the clustering factor for the column and the selectivity of column values. Oracle provides a column called clustering_factor in the dba_indexes view that provides information on how the table rows are synchronized with the index. The table rows are synchronized with the index when the clustering factor is close to the number of data blocks and the column value is not row-ordered when the clustering_factor approaches the number of rows in the table.

To illustrate, consider this query that filters the result set using a column value:
customer_state = ‘New Mexico’;

Here, the decision to use an index vs. a full-table scan is at least partially determined by the percentage of customers in New Mexico. An index scan is faster for this query if the percentage of customers in New Mexico is small and the values are clustered on the data blocks.

Why, then, would a CBO choose to perform a full-table scan when only a small number of rows are retrieved? Perhaps it is because the CBO is considering the clustering of column values within the table.

Four factors work together to help the CBO decide whether to use an index or a full-table scan: the selectivity of a column value, the db_block_size, the avg_row_len, and the cardinality. An index scan is usually faster if a data column has high selectivity and a low clustering_factor (Figure A).

Figure A
This column has small rows, large blocks, and a low clustering factor.

To maintain row order, the DBA will periodically resequence table rows or use a single-table cluster in those cases where a majority of the SQL references a column with a high clustering_factor, a large db_block_size, and a small avg_row_len. This removes the full-table scan, places all adjacent rows in the same data block, and makes the query up to thirty times faster.

On the other hand, as the clustering_factor nears the number of rows in the table, the rows fall out of sync with the index. This high clustering_factor, where the value is close to the number of rows in the table (num_rows), indicates that the rows are out of sequence with the index and an additional I/O may be required for index range scans.

Even when a column has high selectivity, a high clustering_factor, and small avg_row_len, there is still indication that column values are randomly distributed in the table, and an additional I/O will be required to obtain the rows. An index range scan would cause a huge amount of unnecessary I/O as shown in Figure B, thus making a full-table scan more efficient.

Figure B
This column has large rows, small blocks, and a high clustering factor.

Additional Information
For more information on clustering factor, you can check out these links:

In sum, the CBOs decision to perform a full-table vs. an index range scan is influenced by the clustering_factor, db_block_size, and avg_row_len. It is important to understand how the CBO uses these statistics to determine the fastest way to deliver the desired rows.

Editor's Picks