How a relational database solves and optimizes your query
Some people view a relational database as a mystical oracle that answers questions posed to it by the programmer. However, there are a set of rules that a relational database uses to solve your query. Each relational database approaches the process of solving your query slightly differently; however, the fundamental concepts are the same. The same process must be followed. This article will walk you through a high-level overview of how a query analyzer tries to solve your query.
Goals for query optimization
Before walking through the steps that the query optimizer goes through in solving your query, it’s important to understand the query optimizer’s goal. Obviously, one goal is to solve the query that you’ve provided but more than that, its goal is to execute that query with as few resources as possible. In database terms, this means reducing the number of I/O operations which must go to disk.
Nearly every decision that the query optimizer will make will err on the side of reducing the number of physical I/O operations that have to be satisfied with a disk read. Disks are substantially slower than any other resource that the query optimizer has available to it. As a result, if there’s a decision to be made, it will be made in a way that is perceived to be less costly from a disk I/O perspective.
The tools that the relational database has for solving your query are the indices that have been set up on the tables. The index is the relational database’s way of presorting the data into multiple perspectives at the same time. This is done by maintaining the details on the fields in the index and maintaining pointers where the actual data is stored
With the exception of a clustered index, every index comes with a disk storage cost. Clustered indexes are essentially free from a disk read and disk space costs because the clustered index is really the order that the data is stored into the table.
When using an index, the database engine must perform two data reads for every record that is desired out of the database. The first read is into the index to locate the pointer to the actual record. The second is to read the location specified by the pointer. This is a cost consideration that the database server will take into account as it tries to solve your query. It’s also the primary reason why the query analyzer will choose not to use an index. In a later section, Covering Indices, you’ll learn a technique that can eliminate the need for the double read—however, in most cases using an index means doing nearly double the number of reads per record.
The information that the SQL server uses to determine whether it will be worth the extra reads to use an index is the statistics page. Each index has a page of information which tells the query optimizer the distribution of the index key values in the overall data of the table. This is used to approximate how many rows will be returned from a query.
The query optimizer needs to know how many rows will be returned so that it can decide whether using the index will be valuable or not. If the query optimizer determines from the index statistics page that the query will result in a few rows, it will likely use the index. However, if the query optimizer determines that there will be a relatively large number of rows, the query optimizer will likely decide to use a table scan to resolve the query.
The field order in an index is of immense concern when it comes to using indices. When deciding whether or not to use an index, the server must work its way from the first field to the last field. This means that the query optimizer can only consider indices valuable when the fields that it needs are on top of fields it doesn’t need. Any fields that are irrelevant to the query—which are above the fields the query analyzer needs—cause the index to be removed from consideration.
When planning indices, you should only place fields in the top of an index which are used frequently by queries. Even one field that isn’t a part of the fields the query optimizer needs will cause the entire index to be ignored.
Evaluating a WHERE clause
The most important part of a select statement for determining which index will be used is the WHERE clause. The WHERE clause filters the number of records displayed and is the easiest way for the query optimizer to find value in using an index. There are several different ways that a WHERE clause can be used. The most common forms are explained in detail below:
The most common example of how a WHERE clause is used is in matching a record or set of records. When you specify that a specific field is equal to a value, the query optimizer knows that it can go to a single entry in the index, or a small range of entries and identify the records in the main table that satisfy your query. This allows it to dramatically filter the number of reads that it needs to do from the physical table in order to reduce the execution time of the query. The query analyzer will locate indices that contain the fields referred to with the equality operator.
Greater or less than
Although matching and equality are by far the most common types of entry, WHERE clause situations in which ranges are requested are common as well. In this case, the query analyzer knows that it can look at the area of the index either greater or less than the specified value. Generally, the query analyzer can determine from multiple individual statements what percentage of the index must be read and can decide from there whether it makes sense to use the index or not.
Using functions in a WHERE clause can limit the query analyzer’s ability to determine how useful any indexes may be. Because it won’t be apparent to the query analyzer what the result of the function will be—particularly if executed on a field rather than on constants—using functions in a WHERE clause will minimize the chances that the query analyzer will think any index is valuable, regardless of the fields involved.
Evaluating the ORDER BY clause
Once the query analyzer has worked its way through the WHERE clause, it starts looking at the ORDER BY clause. What the query optimizer is looking for is a way that it won’t need to re-sort the results once it’s done. If the query optimizer can find an index that already has the rows in the right order and it’s compatible with the WHERE clause, it will use the index so that it can avoid the sort at the end.
In order to facilitate the use of indexes, your ORDER BY clauses shouldn’t contain unnecessary fields. Because the query analyzer doesn’t know if a field is meaningful or not, in an ORDER BY clause it assumes that it is. As a result, if you include fields in the ORDER BY clause the optimizer must try to find an index which contains all of those fields. Listing every field in your ORDER BY clause will essentially prevent the query optimizer from using an index.
Above, I mentioned that the query analyzer assumes there’s a penalty to be paid for accessing an index and, because of that, it will sometimes avoid using indexes which are ordered by the right fields. This is true, except in the special case where an index contains all of the fields necessary to satisfy the fields selected.
If, for instance, you are selecting only User ID, First Name, LastName, and EmailAddress from a user record, and you have an index which contains all of these fields, then the query analyzer can use the index and skip the underlying data table altogether.
This can be really useful particularly when you have a cross reference table that is used in both directions. You do a clustered index in one direction and then create an index with the fields in the reverse order. The result is that SQL server can use the physical table when going in the first direction and can exclusively use the index in the reverse order.
Covering indexes requires additional space in the database and requires more time to update because of the longer key values; however, they are a great way to help the query analyzer out if you have a relatively static cross reference table that gets a lot of use.
Help the query optimizer
There are a bunch of hoops that the query analyzer goes through every time you submit a query. These hoops are designed to help you get the quickest results; however, that only works if you can help the query optimizer do it’s job by specifying only what you need in the queries and creating the right indexes.