Who is Codd and why should you care? Discover the origins of the modern relational data model, how Codd helped develop the rules for relational database theory, and what every developer should know about this theory.
Trying to use a relational database management system (RDBMS) without applying relational database theory to your design is like trying to drive a standard transmission without using the clutch: You're not going to get very far. In fact, your application may come to a grinding halt. If you're lucky, you'll end up with just an inefficient database. But it could be much worse. You could create a database that erroneously reports data or even destroys data. In this series of articles, you'll learn how to apply relational rules and, in the process, develop an efficient design that protects the validity of your data. Let's start the series off with a look at the origins of the relational data model.
In the beginning…
Although we're surrounded by new technologies, many of us still depend on an old workhorse—the RDBMS and relational database theory. Both were born of need because 30 years ago resources for storing large amounts of data were expensive.
Prior to the relational system we use today, data was stored in flat-file format. All data was stored with each record in one large table. Imagine entering each customer's address, phone number, and other vital information over and over—every time that customer placed an order. The flat-file arrangement required redundant data, which in turn produced large files. In addition, flat-file data can be difficult to work with and demands a large pool of human resources to maintain, which eventually equates to more salaries, more office space, and more equipment. In a nutshell, relational database theory was inspired by a problem—the expense and complexity of maintaining data and its associated programs.
The solution to the problem
Redundant data was the problem solved by the relational model. Once the theory existed, the tools followed. The solution appeared after Dr. E. F. Codd published a paper entitled “A Relational Model of Data for Large Shared Data Banks” in 1970 (which appeared in the June 1970 issue of Communications of the ACM). In that paper, Dr. Codd introduced a set of rules intended to eliminate the need to store redundant data. These rules formed the genesis of relational database theory.
The resulting relational model is a data model that represents data in the form of relations. A data model is a conceptual collection of the data, relationships, and constraints on the data. Therefore, it follows that the relational data model is a conceptual representation of the objects, events, and associations in a relational database system. What all that really means is that the relational data model requires that data be stored in relations.
A relation is represented by a table that stores information about an entity—a conceptual collection of one type of data. Each row corresponds to one tuple and a column corresponds to an attribute. In this context, an attribute is simply a named column in a relation. Each attribute is associated with a domain (i.e., the possible set of values for that attribute). For example, Table A lists the domains for a few attributes of a relation that stores data about books.
The domain defines which values can be stored in a particular column. For instance, you might want to restrict a name column to 35 or fewer characters. The set of names accepted is the domain, and that domain has the potential to contain any text string of up to 35 characters. A domain is conceptual; an RDBMS doesn't support domains in a technical sense beyond data types. You must use system features, such as a constraint to limit the number of characters, to place further limits on the type of data a column will accept.
You’ll usually encounter the terms relation, tuple, and attribute only in extremely technical discussions. Throughout the remainder of this series, I’ll use the less formal and more widely recognized equivalent terms: table, record, and column, respectively. Some people prefer the terms file, row, and field, but they’re all talking about the same thing.
Relational tables store data about one thing, concept, or event. A table is a collection of related data stored in rows and columns. A column or field defines one category of data. Together, all the columns constitute a record, which contains all the related data for one item or entity.
Today's RDBMSs put relational database theory into action. Technically, an RDBMS contains the data and the interface tools with which you'll manipulate the data. That means you won't need additional tools to add, delete, update, or view the data you store using the system. Everything you need is already there. The system also stores what’s known as metadata: a description of the stored data itself.
Relational theory was created out of need at a time when memory, both volatile and persistent, was expensive and computers required a lot of physical space. Obviously those problems no longer exist, so why bother with relational data models at all? Utilizing the relational data model allows you to:
- Eliminate redundant data.
- Reduce data inconsistencies.
- Protect data integrity.
Because memory is cheap, redundant data isn’t that much of a problem anymore from a storage standpoint. However, reducing the number of times you must enter each item has the side benefit of reducing data entry errors. Each item is stored only once, so even if you do make an error, you must correct only that one entry. The other advantages the relational data model has over alternative data models will become clearer later in this article series.
Getting started--know your data
You won't find this rule listed in Codd's paper, but being familiar with the data and having a thorough understanding of the database's purpose is the single most important step you can take toward creating an efficient and flexible database. Talk with the users and the managers and gather all the paper forms and reports currently being used and issued. The users know what it takes to gather and process the data. If necessary, sit with the users while they process all this data and observe their manual solutions. The managers know how they want the data presented. Both perspectives are important, so give both groups your ear.
After meeting with the users and managers, write a mission statement. It can be as short or as long as necessary, but it should state the purpose of the database and clarify the general process of achieving that purpose, without stating every detail. Share the mission statement with everyone involved in the project, users and managers, to make sure you have the crux of the project. Don't be surprised or upset if you're asked to refine the statement. The more you refine the statement, the more you reduce your chances of misunderstanding, which could leads to mistakes you'll just have to fix later.
I've explained the origins of the relational data model, but you can't just throw together a bunch of tuples, attributes, and relations, bake them for an hour, and call the results a relational database. There is, of course, more to it than that. You’ll find out more in part two of this series, when we dig into the rules for normalization.