Database by Doug: 2013

Saturday, November 9, 2013

Benefits of an "Autonumber" Synthetic Key

I've come to creating all my tables using an auto-number, integer primary key called "ID".

In other words:

CREATE TABLE Person
(
    ID INT IDENTITY(1,1) PRIMARY KEY,
    ...
)

In another post, I talked about the good characteristics of a synthentic, non-meaningful key. In summary, meaningful keys are likely to need to change at some point, which can cause a problem because:

all foreign keys will need updated
some foreign keys may not be easily updatable, for instance older records that are archived, or off-line

So if you are going to synthesize, or completely invent, a field to be the primary key, what do you want out of that field?

In order to guarantee Entity Integrity, and satisfy the Guaranteed Access Rule (GAR), all you really need is that a key field be UNIQUE and NOT NULL. (The GAR requires that every value in a database be uniquely addressable. This is accomplished by tablename.fieldname WHERE primarykey=x.)

Another necessity is to have enough unique values for the table in question. Remember that data can live a long time (years, decades). Ask this question: How many records might be inserted in this table over the next ten years? That might seem hard, but you can start with how many per day, and do the math. Also consider that your traffic may increase over time, especially if your system is useful.

What else would be nice?

First, the field value for a record should not change. As pointed out above, a changing value in a primary key field will require cascading updates to all foreign keys. This can be difficult to accomplish, especially as data becomes older, and perhaps not online.

Second, the field should be small. In general, using the number of bytes necessary, but no more, keeps your records concise, and improves the number of records that fit on a page. In other words, when the database engine has to do an expensive disk IO to get a page, you'd like to get as many records on that page as possible. You are paying a disk IO, you'd like to get as many records as possible. Would you rather get 10 pencils for a dollar or 12?

Third, it might be nice to arrange it so that new records enter in an orderly fashion. This is where an "AutoNumber" field is helpful (IDENTITY in MS SQL Server).

Here's a layman's example. Think of a line of cars at a tollbooth. Usually, cars arrive at the tollbooth, and go to the end of the line. There's no question about where they go, and they don't affect any other cars in the line. But Suppose that as cars arrive at the tollbooth, that they must be arranged in license plate order. First, it must be decided where they should be in the line. Then which other cars have to move, then how to move them, etc. That's lot of work, and will take a lot of time.

In the database setting it is also convenient to have new records get placed "at the end". If your table uses an ascending "AutoNumber" field as the primary key, new records will get placed at the end. Furthermore, the database won't even have to ask the quesion "where does this record belong?", it just puts it at the end. No previously inserted records need to be relocated.

And the toll booth analogy goes further. Cars go through the tollbooth, leaving the line. The "oldest" cars go through the tollbooth first. Suppose we were making cars line up by licence plate number. Then we said, "The car that has been here longest gets to go through the tollbooth first!" We'd have to figure out which one was oldest, then make other cars move, etc. That's crazy!

In a database, records can leave the transactional table as they are archived. These are typically, the "oldest" records. If you are using an AutoNumber primary key, all the records that need to get archived are together, at the beginning of the list.

Finally, using an AutoNumber tends to keep "hot" records in RAM, rather on disk, improving performance. In a transactional database the newest records tend to accessed more often. The older records tend to get accessed less often. As an example, suppose I place an order at an e-tailer. I'm much more likely to change my mind and modify the order within 10 minutes of placing the order than after a week.

Using an AutoNumber means that the newest records are all together, on the same data pages. The database engine is smart enough to keep frequently accessed data pages in RAM to avoid the expensive disk access charges. As an example, I've noticed that my dentist office pulls all the files for the current day, and keeps them at the check in desk. So they are convenient and quick to access. Yesterday's files are filed in the back, because we won't need them for awhile.

To summarize, Autonumber fields are helpful because:

they get inserted in order, without affecting other records
they get archived in order, without affecting other records
they help the database keep current records in RAM

Of course, there are exceptions to every rule. The advice above would apply to transactional systems with a variety of operations (INSERT, UPDATE, DELETE, SELECT).

Wednesday, October 16, 2013

Benefits of a Synthetic Primary Key

Why make up a primary key, when there's a perfectly good primary key already available?

Great question. I love it. It shows thought and a recognition that there might be a consequence to the choice.

For a field to be a primary key, all we need is to make sure that every record has a value in that field, and that there are no duplicates, i.e., UNIQUE and NOT NULL.

There is no requirement that a primary key field

have any meaning in the real world
be mnemonic (remind you of something)
be printed out on anything
be derivable from other information
be available or of interest to anyone other than a database professional

Each of the items above is helpful and useful, but is not strictly needed from a primary key. In fact, some of these helpful characteristics can actually cause practical problems in fields that are used as a primary key.

(There are significant performance implications for this choice, but I'm going to speak to more practical reasons why a completely made-up primary key has benefits.)

Consider the following table called Person, with the following fields:

firstName
lastName
SSN
username

For a field to be usable as a primary key, it must minimally be NOT NULL, and UNIQUE. This table, potentially, has fields that we might consider enforcing these requirements on:

SSN
username

Let's look first at SSN. This would normally be CHAR(9). As the database designer, here are some things to consider about SSN:

it means something in other systems
it requires that the Person exist in another system
it's not a strictly necessary attribute of a Person entity, i.e., People exist that don't have SSNs
there's no reasonable validity check on SSN values

Let's assume that our database designer has decided to use SSN as the primary key for the Person table. Somehow, the value '112456678' is listed as John Smith's SSN. How do we know that this is really John Smith's SSN? Suppose it isn't? John contacts us and states that the last character that is '8' should be '3'. So we make the change, and realize that all foreign key references must also be updated. OK, we cascade updates (manually or automatically) through all the foreign key references. What about last month's data from last month that is in the data warehouse, and not "cascadable"? What about the archive data from last year that's not online?

It's not that it's impossible to make all those changes, it's just expensive. Or, we can make the economically sensible decision to not update all the cascading data. Effectively, in our system, John Smith will have become two completely different people, because his primary key will have changed.

SSN is a great example of what I call an External Key. It's a value that's useful in identifying a person in someone else's system; in this case, the IRS. We have to give it to the IRS when we want to reference a specific Person in their database. However, because it's really the IRS's data, we cannot know, definitively, that it is accurate.

If it was our data, we could definitively verify that it was correct. Or, we could just decide that it is correct. For example, supposer that we were the IRS, and SSN was our data, and we found that John Smith's last SSN character was '8', but he thought it should be '3'. As long as '112456678' was unique and not null, we could just leave it as is.

So how about using username as a primary key? Let's assume that username is a value that we have complete control over - that we have some method for determining it, and no other system uses it.

Since we have complete control over it, how should we do it? What makes a good username? It's common to use some scheme like 'SmithJ', i.e., last name, concatenated with the first letter of the first name. People like that, mainly because of certain characteristics:

it has some meaning in the real world - someone's lastname...
it's mnemonic - it helps us remember which person goes with the username
it would be helpful useful information to print out on things
it's derivable from other data
it's useful for people in the real world (not db professionals)

However, all the things that make it useful also make it problematic as a primary key. We still have the problems that SSN had:

it means something in other systems (in this case, the social system - we call people by their name)
there's no reasonable validity check on names

What happens when someone changes their name? All the same things that happen when we have a mistake in a SSN. We don't control names - people control their own names.

So let's create usernames that don't mean something in the real world! Then it would make a good Primary Key!

Then we lose all the good attributes of a username.

My default recommendation is to have a primary key that is for the database professionals, and has no meaning to anyone else. Something completely non-meaningful; a completely Synthetic Key, that means nothing to anyone else but a db professional.

Why?

Data Integrity. When a key changes, we lose Referential Integrity. If we cannot control a key value, we cannot guarantee data integrity.

Database professionals need dependable (unchanging) primary keys so that they can be used to relate data.

Of course, we also want primary keys that will help with performance, but that's for another post...

Tuesday, September 24, 2013

How indexes can help even without a where clause

I always see examples on indexing that focus on the WHERE clause, and writing SARG-able queries.

And rightly so. It's easy to demonstrate huge benefits when you can match an index to a particular WHERE clause. In other words, it's easy to convert a Table Scan operation to an Index Seek operation by the addition of a specific index.

But, indexes can help quite a bit even without a WHERE clause. In other words, if you can convert a Table Scan to an Index Scan, that can be quite helpful to.

So let me demonstrate some of this with a table that I have described below.

CREATE TABLE [dbo].[Person](
 [ID] [int] IDENTITY(1,1) NOT NULL,
 [salutory] [nvarchar](15) NULL,
 [firstName] [nvarchar](25) NULL,
 [middleInitial] [nchar](1) NULL,
 [lastName] [nvarchar](25) NOT NULL,
 [nameSuffix] [nvarchar](8) NULL,
 [email] [nvarchar](255) NULL,
 [entryDate] [datetime] NOT NULL,
 [lastUpdateDate] [datetime] NOT NULL,
 [weight] [float] NULL,
 [gender] [char](1) NULL,
 [dateOfBirth] [date] NULL,
 CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED 
(
 [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
) ON [PRIMARY]

To start with, the only index is the clustered index on the primary key, ID. And there is a lot of cooked / simulated data in the table:

SELECT COUNT(*)
FROM   Person

Which yields: 677,025

First, I'll make sure that the table is defragmented as much as possible:

DBCC DBREINDEX (Person, PK_Person, 100)

And to make sure that happened:

DBCC SHOWCONTIG (Person, PK_Person)

Which gives:

DBCC SHOWCONTIG scanning 'Person' table...
Table: 'Person' (2137058649); index ID: 1, database ID: 6
TABLE level scan performed.
- Pages Scanned................................: 6884
- Extents Scanned..............................: 861
- Extent Switches..............................: 860
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 100.00% [861:861]
- Logical Scan Fragmentation ..................: 0.01%
- Extent Scan Fragmentation ...................: 0.93%
- Avg. Bytes Free per Page.....................: 40.6
- Avg. Page Density (full).....................: 99.50%
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

From this, we can tell a few things about this table I've cooked up:

The table has 6884 pages in it
The pages are almost entirely full, 99.50% page density
677,025 records spread across 6884 pages gives about 98 records per page
Since each page is 8092 bytes, each record is about 8092/98 = 82 bytes per record

So let's see what it will take to do a basic query:

SET STATISTICS IO ON

SELECT lastName
FROM   Person

The SET command will return some IO statistics for the query. In this case, the statistics message is:

(677033 row(s) affected)
Table 'Person'. Scan count 1, logical reads 6898, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Here's what this tells me:

zero physical reads tells me that all my records are in cache (this is not particularly realistic - normally you'd have more data than RAM)
6898 logical reads tells me that 6898 pages that were in cache were accessed to give the result
In a more realistic setting, I'd have experienced somewhere between 0 and 6898 physical reads

For purposes of discussion, we'll assume that both "logical reads" and "physical reads" are equivalently undesirable. In other words, less reads is better in terms of performance.

Also, see the query plan that the SQL Engine came up with:

In this case, the Clustered Index is the Person table: a table scan. So the engine decided to go to the leaf level of the Clustered Index (the data pages of the index), and scan across from data page to data page.

Now let's add an index on lastname:

CREATE NONCLUSTERED INDEX [IX_Person_LastName] ON [dbo].[Person] 
(
 [lastName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

and look at some statistics on it:

DBCC SHOWCONTIG (Person, IX_Person_LastName)

DBCC SHOWCONTIG scanning 'Person' table...
Table: 'Person' (2137058649); index ID: 13, database ID: 6
LEAF level scan performed.
- Pages Scanned................................: 1980
- Extents Scanned..............................: 250
- Extent Switches..............................: 251
- Avg. Pages per Extent........................: 7.9
- Scan Density [Best Count:Actual Count].......: 98.41% [248:252]
- Logical Scan Fragmentation ..................: 0.20%
- Extent Scan Fragmentation ...................: 1.60%
- Avg. Bytes Free per Page.....................: 23.4
- Avg. Page Density (full).....................: 99.71%
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Now let's estimate some things about the index:

The index has 1980 pages in it
The pages are almost entirely full, 99.71% page density
677,025 records spread across 1980 pages gives about 341 records per page
Since each page is 8092 bytes, each record is about 8092/341 = 23 bytes per record

Now let's do the same select statement, and see what happens:

SELECT lastName
FROM   Person

Which gives this message about the IO:

(677041 row(s) affected)
Table 'Person'. Scan count 1, logical reads 1996, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

And this query plan:

So, here is what the index accomplished for us:

1996 reads vs 6898 reads without the index (about 29% of the cost)
A scan of the smaller non-clustered index, and no use of the base table at all

OK, so maybe you saw that coming.

If we can get everything we need from a smaller data structure, then it should cost less. This is known as a Covering Index. In this case, our SELECT statement that only references the lastName column is covered by the index that has lastName in it.

Here's one that you might not see coming. Consider this statement:

SELECT COUNT(*)
FROM   Person

Guess what? Without the index, it costs 6898 reads (table scan), and with the index it costs 1979 (clustered index scan).

Let's consider a different index. This time, I'm going to index on lastName, and weight.

CREATE NONCLUSTERED INDEX [IX_Person_LastName] ON [dbo].[Person] 
(
 [lastName] ASC,
 [weight] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

And here is a SELECT statement to consider:

SELECT  AVG(weight)
FROM    Person

This results in 2919 reads, and this query plan:

Hey, that's an index Scan on IX_Person_LastName! And look no WHERE clause!

Again, this shows the power of a pure covering index. The database engine is able to get every weight in the Person table, by scanning a smaller data structure, i.e., the index.

Make special note that the index had a benefit, even though there was no WHERE clause, and no Index Seek. Indexes can help, even without the WHERE clause, and even without accomplishing a seek.

But if you really want to see the Seek, and a really big performance improvement, here you go:

SELECT  AVG(weight)
FROM    Person
WHERE   LastName = 'Adams'

9 reads, and the Index Seek that is so IO-efficient:

Saturday, June 22, 2013

Google Music Manager Install on Windows 8

Had a bit of trouble getting Google's Music Manager to install on Windows 8.

In short, the solution was to "unblock" the executable after downloading it, but I couldn't find any guidance on this.

Here's what I was seeing, and how to solve it.

Google's Music Manager is used to upload music into Google Play's My Music area. I've liked this service, and use it on a number of machines, mainly running Windows 7 and Windows Vista.

So, from within Google Play, My Music, you choose Upload Music:

Which brings you to this screen, where you to the Download Music Manager screen:

When you click on the Download Music Manager button, you download an installer:

So, if you just click on this file, you will get a message that says something like "connecting to the internet". But it just stays there.

Strangely enough, I was unable to reproduce this to provide the screen shot. Even after un-installing Music Manager, and checking for new Windows Firewall holes, I couldn't reproduce the problem with installation.

To fix the issue, you open up your downloads folder by clicking on the context menu for the downloaded file, and choosing "Show in Folder".

Once in the downloads folder, right click on the musicmanagerinstall.exe file, and choose properties in the pop-up menu.

Click the Unblock button.

Click on OK, then double click the musicmanagerinstall.exe to begin the install. It should work fine now.

I think that this is a good safety feature - executable files off the internet should not have a high level of privileges when they are executed.

It appears that this installer is a very small file, but that it actually goes back out to the internet to download more files for the installation. In contrast, when I installed iTunes, it downloaded a fairly large (gigabytes) installer, which did not go back out to the internet for anything else, until it was completely installed.

Strange that I couldn't find any help on the internet on this. Not even in the Google Help area.

Wednesday, March 20, 2013

Don't use a temp table, use a Common Table Expression

When Common Table Expressions (CTEs) were first added in SQL 2005, I glossed them over as a construct for achieving recursive queries for hierarchies of self-referential records. Of course, this appears to be the main use case, but I see more and more use for them.

In particular, I see situations where temp tables are used that could be phrased as a CTE.

Here is the situation. A software developer who naturally thinks in a procedural manner may have trouble with the declarative nature of a subquery. The natural, procedural, way to handle this is to break the problem into two sequential steps saving the intermediate result in a temp table. This can have performance implications, and may not be particularly self-documenting.

Suppose a software developer wants to find the person who weighs the most from the Person table as shown below. (In my environment, this table has 421,958 records on about 700 pages.)

The classic SQL pattern for this is a simple subquery:

SELECT   ID,
         firstname,
         lastname
FROM     Person
WHERE    weight =
          (SELECT MAX(weight)
           FROM   Person)

However, the developer may not be familiar with this pattern, and tends to think in a step-wise manner, so writes the following:

DECLARE @maxWeight as float

SELECT @maxWeight = MAX(weight)
FROM    Person

SELECT ID,
        firstname,
        lastname
FROM    Person
WHERE   weight = @maxWeight

There are several problems with this:

in a dynamic data environment, there is no guarantee that the actual maximum weight will not change between the two SELECT statements which can result in

the wrong person selected (if the maximum weight changes between the select statements)
no record selected (if the record has been deleted between the select statements)

this is a significantly different problem for the SQL Engine to solve

To see this, assume two simple indexes on the Person table

Clustered index on ID (PK_Person)
Non-Clustered index on weight (IX_weight)

The query plan for the subquery version looks like this:

The script version generates two separate query plans that look like this:

Regardless of the actually performance change in the two query plans, it's clear that the software developer has forced the SQL Engine to perform two separate operations. This prevents the query optimizer from combining the two into, perhaps, a lower cost form. Further, two separate query plans are optimized and cached.

A common table expression (CTE) may be a more natural form for a procedural developer.

Here's how it looks:

;WITH maxWeight (weight)
   AS ( SELECT MAX(weight)
         FROM   Person)

SELECT   Person.ID,
         Person.firstname,
         Person.lastname
FROM     Person, maxWeight
WHERE    Person.weight = maxWeight.weight

Which generates the exact same query plan as the subquery version:

You can see that it is phrased as a two-step process: Create the CTE using the WITH construct, then reference it in the following SELECT statement. This is in contrast to the subquery, which states the record set in a more declarative fashion. And the subquery may be perceived as happening second, because of the phrasing.

Of course, that was a really logically simple example - really it's just a scalar, not a subquery.
So what about this:

SELECT   Person.ID,
         Person.firstname,
         Person.lastname,
         Person.weight
FROM     Person
WHERE    Person.weight =
          (SELECT MAX(weight)
           FROM   Person p
           WHERE p.lastName = Person.lastName)

This is a correlated subquery which gives the heaviest person, by last name. In other words, the heaviest person with last name of Adams, the heaviest person with the last name of Brown, etc. We know this is a correlated subquery because the blue-highlighted reference to the Person table is referencing the Person table in the outer query. The Person table does not exist in the inner query, because it was renamed to p. (In this database, there are 67,726 distinct last names. I generated this with a last name list from the U.S. census.)

So, this is a logically more sophisticated SQL language pattern that an average procedural developer may not be familiar with. A procedural rephrasing would look like this:

SELECT   lastname,
         MAX(weight) AS maxWeight
INTO     #temp
FROM     Person
GROUP BY lastName

SELECT   Person.ID,
         Person.firstName,
         Person.lastName,
         Person.weight
FROM     Person, #temp
WHERE    Person.lastName = #temp.lastName
AND    Person.weight   = #temp.maxWeight

DROP TABLE #temp

The subquery version of this results in 14,054 logical IOs. The procedural version results in 7027 in the first SELECT, then 7027 in the second SELECT which sums to 14,054. But it ALSO, resulted in 291 additional IOs on the #temp table in the second SELECT statement. And, of course, the #temp table had to be allocated and dropped.

This could be rephrased as a CTE, which is more "procedural" looking:

;WITH maxWeightByLastName (lastName, maxWeight)
AS    (SELECT   lastname,
               MAX(weight) AS maxWeight
       FROM     Person
       GROUP BY lastName)

SELECT Person.ID,
       Person.firstName,
       Person.lastName,
       Person.weight
FROM   Person, maxWeightByLastName
WHERE Person.lastName = maxWeightByLastName.lastName
AND Person.weight = maxWeightByLastName.maxWeight

Which results in 14,054 logical IOs, and has a query plan identical to the correlated subquery:

A more subtle advantage to motivate avoiding the #temp table version is that it can be difficult to tune, because the query optimizer is forced to isolate the two separate operations.

To show this, suppose I create a non-clustered index on lastname and weight. To summarize, there will be two indexes:

Clustered index on ID (PK_Person)
Non-Clustered index on lastName and weight (IX_lastName_weight)

Also, to create the perfect situation, we'll NOT include firstname in the SELECT clause. In other words, the new index covers all the fields we need.

Running the #temp table version generates a 1842 IOs in the first select, 1842 in the second, and 291 IOs to the #temp table for a total of 3975. The two query plans look like:

Note that both query plans uses our IX_lastName_weight index - so the index is helpful.

Both the correlated subquery version and the CTE version generate the same query plan:

And this version results in 1847 logical IOs. In other words, the #temp table version requires more than twice the IOs, and also has the overhead of temp table creation and dropping. Of course, this is something of a perfect storm example, but do you really want to take away the ability of the query optimizer to do its job?

So why the difference?

Notice that the temp table version requires two separate scans of the index, while the CTE/subquery version only requires one. The query optimizer has been able to accomplish the task with a single scan of the table, but because of the #temp table phrasing, is forced to do it twice.

The added index scan is entirely because of the procedural temp table logic - it forces two scans of the index, when a single would have sufficed.

In summary, I believe that CTEs may be a good option for software developers who tend to think in a procedural manner. In this case, a CTE might feel like a more natural representation than a subquery, and avoid some of the overhead of a temporary table.

Database by Doug