Normalization

Codd proposed Normalization in his paper that outlined the relational model. It is based on the concept of Normal forms. A relation/table is said to be in a particular normal form if it satisfies constraints specified for that normal form. See pages 126-128 for an overview of normal forms (i.e 1NF - 5NF and DK/NF).

Normalization is essentially the process of removing redundant data from relational tables by splitting the tables into smaller tables by projection. The idea is to do this without loss of information. That is, we should be able to combine the new tables by a natural join to recreate the original tables.

To accomplish this we employ a series of normalization rules. We will address rules that allow us to construct tables in third normal form.

First, Second & Third Normal Forms

A relational table is by definition in first normal form (1NF). That is, the attributes/columns of the table must be single valued and so cannot contain repeating groups or arrays. However, tables in 1NF may be susceptible to a variety of update anomalies. To avoid these anomalies, all relational tables should be in, at least, third normal form (3NF).

Second normal form (2NF) and third normal form deal with the dependence between non-key and key attributes/columns. Under 2NF and 3NF, the following general rule applies:

a non-key attribute must provide a fact about the key, the whole key, and nothing but the key

In addition, the table must satisfy 1NF. Following is a discussion of 2NF and 3NF, with rules to transform tables that violate these normal forms into compliant tables.

2NF

2NF is violated when a non-key attribute/column is a fact about a subset of a key. Notice that this is only relevant when the key is a composite key (i.e consists of several attributes/columns). Consider the following relation/table:

INVENTORY(PartNumber, WhseNumber, Quantity, WhseAddress)

The attribute/column WhseAddress is a fact about WhseNumber alone and so is a fact about a subset of the primary key of the relation/table INVENTORY. The following problems may arise from this situation:

  1. WhseAddress is repeated for every tuple that refers to a part in that warehouse. That is, the redundant data problem. Every tuple referring to a part in a particular warehouse would have to be updated if the warehouse address changes. Also, if only some tuples are updated then the database would be in an inconsistent state.
  2. If at some point there are no parts for a particular warehouse then we may have no way of determining the warehouse address from the database.

To address this weakness we may apply the following transformation:

  1. Identify violators. That is, identify any subsets of the primary key (i.e. determinants) and the columns they determine. That is, for the INVENTORY example, WhseNumber is a determinant that determines WhseAddress.
  2. Create a new table for each determinant and the columns it determines. Let the determinant become the primary key of the new table. That is, for the INVENTORY example, create a WAREHOUSE table:

    WAREHOUSE(WhseNumber, WhseAddress)

  3. Remove the determined columns from the original table.

Hence, the INVENTORY table is transformed by projection into two tables.

INVENTORY(PartNumber, WhseNumber, Quantity)

WAREHOUSE(WhseNumber, WhseAddress)

3NF

3NF is violated when a non-key attribute/column is a fact about another nonkey attribute. Notice that this is quite different than the situation presented for 2NF above where a non-key attribute was a fact about a subset of a key. Consider the following relation/table:

EMPLOYEE(EmpID, Department, Location)

Let us assume that each department is located in one place. For example, the IT department is located on the 5th floor of the CNA building. Location, is a fact about Department (in addition to being a fact about employee). The problems that may arise from this situation are similar to those above.

  1. Location is repeated for every tuple that refers to a Department. Again, the redundant data problem.
  2. If at some point the department has no employees then we may have no way of determining the departments location from the database.

To address this weakness we may apply the following transformation:

  1. Identify violators. That is, identify any determinants (in this case they are non-key determinants) and the columns they determine. That is, for the EMPLOYEE example, Department is a determinant that determines Location.
  2. Create a new table for each determinant and the columns it determines. Let the determinant become the primary key of the new table. That is, for the EMPLOYEE example, create a DEPARTMENT table:

    DEPARTMENT(Department, Location)

  3. Remove the determined columns from the original table.

Hence, the EMPLOYEE table is transformed by projection into two tables.

EMPLOYEE(EmpID, Department)

DEPARTMENT(Department, Location)

 

Readings: Chapter 5, pg 126-130 (optionally, 131-145)