Wednesday, October 16, 2013

Benefits of a Synthetic Primary Key

Why make up a primary key, when there's a perfectly good primary key already available?

Great question. I love it. It shows thought and a recognition that there might be a consequence to the choice. 

For a field to be a primary key, all we need is to make sure that every record has a value in that field, and that there are no duplicates, i.e., UNIQUE and NOT NULL.

There is no requirement that a primary key field 
  1. have any meaning in the real world
  2. be mnemonic (remind you of something)
  3. be printed out on anything
  4. be derivable from other information
  5. be available or of interest to anyone other than a database professional
Each of the items above is helpful and useful, but is not strictly needed from a primary key. In fact, some of these helpful characteristics can actually cause practical problems in fields that are used as a primary key.

(There are significant performance implications for this choice, but I'm going to speak to more practical reasons why a completely made-up primary key has benefits.)

Consider the following table called Person, with the following fields:
  • firstName
  • lastName
  • SSN
  • username
For a field to be usable as a primary key, it must minimally be NOT NULL, and UNIQUE. This table, potentially, has fields that we might consider enforcing these requirements on:
  • SSN
  • username
Let's look first at SSN. This would normally be CHAR(9). As the database designer, here are some things to consider about SSN:
  • it means something in other systems
  • it requires that the Person exist in another system
  • it's not a strictly necessary attribute of a Person entity, i.e., People exist that don't have SSNs
  • there's no reasonable validity check on SSN values
Let's assume that our database designer has decided to use SSN as the primary key for the Person table. Somehow, the value '112456678' is listed as John Smith's SSN. How do we know that this is really John Smith's SSN? Suppose it isn't? John contacts us and states that the last character that is '8' should be '3'. So we make the change, and realize that all foreign key references must also be updated. OK, we cascade updates (manually or automatically) through all the foreign key references. What about last month's data from last month that is in the data warehouse, and not "cascadable"? What about the archive data from last year that's not online?

It's not that it's impossible to make all those changes, it's just expensive. Or, we can make the economically sensible decision to not update all the cascading data. Effectively, in our system, John Smith will have become two completely different people, because his primary key will have changed. 

SSN is a great example of what I call an External Key. It's a value that's useful in identifying a person in someone else's system; in this case, the IRS. We have to give it to the IRS when we want to reference a specific Person in their database. However, because it's really the IRS's data, we cannot know, definitively, that it is accurate. 

If it was our data, we could definitively verify that it was correct. Or, we could just decide that it is correct. For example, supposer that we were the IRS, and SSN was our data, and we found that John Smith's last SSN character was '8', but he thought it should be '3'. As long as '112456678' was unique and not null, we could just leave it as is.

So how about using username as a primary key? Let's assume that username is a value that we have complete control over - that we have some method for determining it, and no other system uses it. 

Since we have complete control over it, how should we do it? What makes a good username? It's common to use some scheme like 'SmithJ', i.e., last name, concatenated with the first letter of the first name. People like that, mainly because of certain characteristics:
  1. it has some meaning in the real world - someone's lastname...
  2. it's mnemonic - it helps us remember which person goes with the username
  3. it would be helpful useful information to print out on things
  4. it's derivable from other data
  5. it's useful for people in the real world (not db professionals)
However, all the things that make it useful also make it problematic as a primary key. We still have the problems that SSN had:
  • it means something in other systems (in this case, the social system - we call people by their name)
  • there's no reasonable validity check on names
What happens when someone changes their name? All the same things that happen when we have a mistake in a SSN. We don't control names - people control their own names.

So let's create usernames that don't mean something in the real world! Then it would make a good Primary Key!

Then we lose all the good attributes of a username.

My default recommendation is to have a primary key that is for the database professionals, and has no meaning to anyone else. Something completely non-meaningful; a completely Synthetic Key, that means nothing to anyone else but a db professional.

Why? 

Data Integrity. When a key changes, we lose Referential Integrity. If we cannot control a key value, we cannot guarantee data integrity.

Database professionals need dependable (unchanging) primary keys so that they can be used to relate data. 

Of course, we also want primary keys that will help with performance, but that's for another post...



No comments:

Post a Comment

Followers