I have examined the differences between scaling up and scaling out your SQL Server environment and explained how to scale out your system with either horizontal data partitioning or vertical data partitioning. In this final installment of the series, I provide a more in-depth look at scaling out your system with distributed partitioned views.
A distributed partitioned view joins data from one or more SQL Server databases. When developing a horizontally partitioned database environment, you can use distributed partitioned views to join your partitioned tables from different servers so it appears that all of the data is on the same server. You can design these views so that, if your underlying table structure is partitioned correctly, the query optimizer will know which table to pull data from in your query, speeding up execution time. A correctly designed distributed partitioned view also allows updates, inserts, and deletes to occur on the view. I take a closer look at how this occurs later in the article.
In the example, I assume that the SalesHistory table is very large and that it would be beneficial to horizontally partition the rows from the table onto different servers. The structure of the SalesHistory table is identical on each server. However, one server holds sales information from the eastern region of the country, while the other server holds sales information from the western region of the country.
I identify these records in the table based upon a Region field, along with the SaleID field, which is an integer field that I seed differently for the different regions of the country. The SaleID field is very important to the design concept because that is the field I will use as the partition key. (Note: It is extremely important to design the tables in your scale-out scenario so that the rows in the table are uniquely identifiable against the tables on the other servers.) This set of field(s) is the partition key. The SalesHistory table is designed so that the SaleID will always be uniquely identifiable based on the table that it is in. I am able to do this through CHECK constraints.
I will be working with two separate SQL Server instances, and for the example, they will be on the same machine. The server is named Chapman, and the instances are named InstanceA and InstanceB. Both instances are SQL Server 2005 Developer Edition and allow remote connections, along with Windows and SQL Server authentication.
This script creates the SalesDB database and enables lazy schema validation on each server. Lazy schema validation is a SQL Server setting that optimizes performance by ensuring that data is not requested from a linked server until that data is actually required from that server. The script in Listing A needs to be run on both database instances. The script in Listing B creates the reader login and user in the SalesDB database. This script needs to be run on both of the database instances.
Listing C will run separately on each database instance. The script creates a linked server on each server that points to the opposite server. A linked server allows SQL Server to execute commands against an OLEDB data source, such as another SQL Server. I use the login I created in the script above as the security context for the linked server connection. These linked servers allow me to query each server from the other server.
Listing D creates the SalesHistory tables on each server and loads data into those tables. I am creating a check constraint on each table on the SaleID column; this constraint will ensure that only certain SaleID values get into the SalesHistory table. This field is my parition key and will allow the query optimizer to determine whether to pull data from the local table or the remote table when I query the distributed partitioned view.
Now I need to give the reader sql login the ability to insert, update, delete, and view the metadata on the SalesHistory table. Execute the script in Listing E on each server.
The script in Listing F creates the distributed paritioned views on each server. Notice that the view created on ServerA queries the local SalesHistory table and unions that resultset to the SalesHistory table on ServerB.
Giving the query optimizer the ability to easily know which server to query by using partition keys can greatly enhance performance in a distributed environment. You can test this by looking at the execution plan when running queries against the view. For queries that return records from the remote server, it will pull those records; but if your query is satisfied by local data only, the remote server will not be queried at all.
How does this help when you need to insert, update, or delete data? It actually helps in the same way because the partition key is the primary key of the table (it always has to at least be part of the primary key); SQL Server will always know where to insert a new record and will perform the action either locally or remotely. This also works with updates and deletes — SQL Server uses the partition key for updating and deleting data, which is why I granted the user the privileges that I defined earlier in the article. Since that user has the ability to read and write on the SalesHistory table as necessary, and because this view exists on both servers, I could essentially perform operations on that view through my application layer so the application layer is unaware of my underlying table structure. The application layer only needs to be concerned with updating my views.
Tim Chapman is a SQL Server database administrator who works for a bank in Louisville, KY, and has more than 7 years of IT experience. He is also Microsoft certified in SQL Server 2000 and SQL Server 2005. If you would like to contact Tim, please e-mail him at firstname.lastname@example.org.
Tim Chapman is a SQL Server MVP, a database architect, and an administrator who works as an independent consultant in Raleigh, NC, and has more than nine years of IT experience.