By Susan Sales Harkins and Arthur Fuller
A logical model helps clarify your application’s domain by defining the necessary tables, the relationships between those tables, and their cardinality. At the logical level, each column’s attributes are defined, such as whether to accept null values, default values, formulas, and various checks and constraints. Then the fun begins as the design is applied to a particular relational database management system (RDBMS)—it’s like putting square pegs in round holes.
In part one of this article, we presented an example of a travel agency that has four kinds of customers:
- Other agencies that receive a commission
- Corporate accounts that receive no discount and may have any number of travelers
- A franchise or club, where the franchise receives a commission and the member receives a discount
We considered creating a customer table for each type, but that arrangement created several problems. A better solution is a single customer table that identifies each customer by type, which will require a second table that defines customer types. At this point, our logical model resembles the diagram in Figure A. A real application of this sort would be much more involved, but we’ve purposely kept our example simple.
|A simplified view of a logical data model|
After combining the client tables, our design includes a new table that defines customer types.
Resolving normalization problems
Committing to one customer table doesn’t bring us to the end of the normalization process. There are still many issues to consider:
- The name field in CorporateTravelers isn’t atomic (the smallest possible unit) which breaks the First Normal Form (1NF). These travelers will be individuals, not companies. That problem is simple enough to solve by creating two fields, FirstName and LastName, and it’s something we’ll definitely do.
- The new Clients table presents a bigger problem where the name field is concerned. Retail clients could possibly be individuals who will require a FirstName and LastName field. To simplify our example, we’ll institute a business rule—all customers are companies, so at this stage of the game, there’s no need to resolve this issue. For the sake of simplicity, we’ll simply invoke a business rule and restrict clients to companies—not individuals—requiring no further design magic at this time.
- The address information breaks the Second Normal Form (2NF). There are dependencies between the address fields and all fields should fully describe the client, not other address fields. For this reason, some developers might remove the address fields to a new table. That move might result in more than one additional table if the tables are normalized to Boyce-Codd Normal Form (BCNF). In that case, you could end up with an address, city, and state table. In addition, a client could have several different addresses—one for mailing, one for delivering tickets and agenda information, and so on.
- The discount and commission fields in the Clients table present a special problem. They don’t really break any normalization rules, but many of these fields will be blank. Empty fields aren’t necessarily bad, but as a rule, they should be avoided. In fact, too many blanks are usually a sign of normalization amiss. In this case, both fields fully describe the primary key and there are no apparent dependencies with any nonkey fields. However, because of the potential blank fields, we’ll move these fields to a new table to identify each record by client and then flag each percentage value as a commission or discount.
- Will each client have only one telephone number? If not, the current telephone number field in the Clients table breaks 1NF—multivalued fields are prohibited. That means multiple telephone numbers cannot be entered in one field. Instead, the telephone number field must be moved to a new table, which might require a telephone type table. The solution is to remove the telephone number field from the Clients table and create a new table for just telephone numbers.
To normalize or not to normalize
Normalizing the address table is a major undertaking if the goal is to normalize through BCNF. Initially, we need to identify each address by type—business, delivery, and so on. The first step is to remove the address fields from the Clients table. Then, create the new address table, based on the initial list shown in Figure B. But we still have the dependencies between the state and city fields to resolve.
This situation brings two issues to light. First, the entity list in Figure B isn't fully normalized. The table breaks 3NF because there are dependencies between nonkey fields—the city and state fields are dependent on one another. To normalize this table:
- Remove the ZIP code, city, state, and country fields from the address list.
- Create a ZIP code table and move the fields you just removed to this new table.
- Insert the ZIP code primary key value in the address table as a foreign key.
Some developers would argue that taking this table to BCNF is overkill. Others believe it is almost always simpler and less expensive to fully normalize the database design. In fact, some developers might take this step even further by creating city, state, and even country tables, to avoid repeated groups. But, for now, we are content with just resolving the dependency issue. There’s no right or wrong, just the balance of resources vs. needs.
The second issue is one we’ll mention, but we won’t resolve in this article. An application that caters to a small geographical area probably doesn’t need a large table of existing city, state, ZIP code, and country entries. On the other hand, if the business is likely to expand, you should consider designing for success. You need to decide whether the application warrants a huge, resource consuming and performance-reducing table full of city, state, ZIP code, and country values that the client might never use. After all the above changes, our design has a few new tables, as shown in Figure C.
The move from logical to physical
Once the tables are normalized, the process involves moving from logical to the physical model by choosing a target database. Despite the theoretical portability of SQL, different systems offer different capabilities, extensions, and unfortunately, bugs. This added value is really just marketing: Each vendor extends the language, hoping to seduce developers into relying on their unique and powerful syntax. After a while, sharing data becomes tortuous and expensive because applications are heavily invested in extensions (or so the vendor hopes).
Microsoft SQL Server 2000 offers user-defined functions (UDFs), a welcome extension to the language though unsupported by any competitive product. Microsoft is not alone in this desire for uniqueness. Oracle’s trap is the NESTED TABLE syntax, and MySQL provides a number of impressive extensions such as ENUM and SET column types.
All the vendors play this game, and it isn’t necessarily a bad game, as it does provide a lot of choices for the consumer. But it’s the main reason for a data modeling tool—especially in a large application. A place where product particulars don’t matter is preferred—a database heaven so to speak. In the data-modeling tool of your choice, listen to harp music, think about the design, select a target, and then simply click a button to generate a script that creates the database in language suitable to the chosen target. Or, time is lost in purgatory (or worse), fixing problems missed during the implementation process.
Logical tables don’t always make the switch to physical
At this stage, you must reexamine the design in the light of the capabilities offered by the target database. Occasionally, what appears in the logical model never becomes an actual table. For instance, the logical model has us creating several new tables to identify types: customers, telephone numbers, fees, and addresses. Some or even all of these additional tables might not be necessary if a simple check constraint could provide a much more efficient solution. Fortunately, our target system, SQL Server, supports check constraints. A check constraint is a rule that limits the possible values permitted in a column. For example, a small integer column can accept 256 values, but a check constraint could limit the amount even further, say to just 20 values.
There’s almost always a trade-off though, and check constraints are no exception to the rule. A system must be taken down to add a new item to the constraint list, and that could be burdensome to many applications. Refer to Figure D at the end of this section to compare the pros and cons of both solutions, tables or check constraints.
UDF vs. stored calculations
One reason that SQL Server developers have responded so enthusiastically to UDFs is that you can use many of them in a single SELECT statement. We can easily illustrate this in our example application. Let’s assume that we pay commissions based on monthly sales over a year at a time. A simple UDF could accept an agency primary key (and possibly a date) and return a 12-column table containing the total monthly sales for the past year for that agency. This approach has two advantages:
- It requires no additional storage.
- It risks no skew between stored totals and actual sums. (It’s possible that a bug could cause problems such as failing to update the sums table correctly.)
On the other hand, a UDF would certainly be slower than storing the sales totals. If performance is a main concern and a table is chosen, a new problem appears:
- A 13-column table consisting of the client’s primary key and a column for each month’s sales total can be used.
- Or a three-column table is created (client’s primary key, month, and monthly sales total) with a row added for each month.
The latter would mean visiting a dozen rows rather than one, so it will be slightly slower but only noticeable if there are thousands and thousands of records. There is one more virtue to this approach—no data is destroyed. There is no need to overwrite the rows; simply add a new one as each month passes. Then, should management decide to change the rules, for instance taking the last two years into account instead of just one, the model requires no changes. Simply visit 24 rows per client instead of 12.
Of course, the calculated sales total table presents a problem of its own. Although it isn’t a clear violation of Third Normal Form (3NF), it’s a solution that requires new checks and balances. Anytime someone updates a particular order, for any reason, the system must recalculate the monthly sales total for that client. For this reason, even though a UDF is slower, it might be the wiser choice, despite the possible drain on performance (especially if running it just once a month or so), as it would be run only when needed. That means the resulting values would reflect the most up-to-date data.
So here’s the modeling dilemma: Choose between performance and additional work. If the UDF is chosen, performance is forfeited (maybe). If a sales total table is chosen, more time is invested in the actual development process to ensure the accuracy of all the data being used to calculate those final commission figures. There’s no right or wrong—the business and sometimes users will dictate the right solution. But this is exactly why the modeling process works so well. Fully normalized tables aren’t always the right choice. Figure D compares how SQL Server handles these issues in comparison to two others, MySQL and Oracle.
This isn’t where I was headed when I started
We’ve dealt with just a few of the issues surrounding the implementation of a logical model. More than anything we’ve shown that there’s a good chance that many of the original design elements will never make it to the finished application. In lieu of this turn, there are two thoughts we'd like to conclude with:
- A logical table should become a physical table only if it must. That’s helpful, isn’t it? If you appreciate the sarcasm, you have a good grasp of the logical-to-physical design process.
- A smart designer invests in a data modeler because at any moment, for reasons beyond the designer's control, the target could change.