Wednesday, March 20, 2013

Don't use a temp table, use a Common Table Expression


When Common Table Expressions (CTEs) were first added in SQL 2005, I glossed them over as a construct for achieving recursive queries for hierarchies of self-referential records. Of course, this appears to be the main use case, but I see more and more use for them.

In particular, I see situations where temp tables are used that could be phrased as a CTE.

Here is the situation. A software developer who naturally thinks in a procedural manner may have trouble with the declarative nature of a subquery. The natural, procedural, way to handle this is to break the problem into two sequential steps saving the intermediate result in a temp table. This can have performance implications, and may not be particularly self-documenting.

Suppose a software developer wants to find the person who weighs the most from the Person table as shown below. (In my environment, this table has 421,958 records on about 700 pages.)




The classic SQL pattern for this is a simple subquery:

SELECT   ID, 
         firstname, 
         lastname
FROM     Person
WHERE    weight =
          (SELECT MAX(weight)
           FROM   Person)

However, the developer may not be familiar with this pattern, and tends to think in a step-wise manner, so  writes the following:

DECLARE @maxWeight as float

SELECT  @maxWeight = MAX(weight)
FROM    Person

SELECT  ID,
        firstname,
        lastname
FROM    Person
WHERE   weight = @maxWeight 


There are several problems with this:
  • in a dynamic data environment, there is no guarantee that the actual maximum weight will not change between the two SELECT statements which can result in
    • the wrong person selected (if the maximum weight changes between the select statements)
    • no record selected (if the record has been deleted between the select statements)
  • this is a significantly different problem for the SQL Engine to solve
 To see this, assume two simple indexes on the Person table
  • Clustered index on ID (PK_Person)
  • Non-Clustered index on weight (IX_weight)
The query plan for the subquery version looks like this:





The script version generates two separate query plans that look like this:

 
Regardless of the actually performance change in the two query plans, it's clear that the software developer has forced the SQL Engine to perform two separate operations. This prevents the query optimizer from combining the two into, perhaps, a lower cost form. Further, two separate query plans are optimized and cached.

A common table expression (CTE) may be a more natural form for a procedural developer.

Here's how it looks:

;WITH maxWeight (weight)
   AS (  SELECT  MAX(weight)
         FROM   Person)    


SELECT   Person.ID,
         Person.firstname,
         Person.lastname
FROM     Person, maxWeight
WHERE    Person.weight = maxWeight.weight


Which generates the exact same query plan as the subquery version:



You can see that it is phrased as a two-step process: Create the CTE using the WITH construct, then reference it in the following SELECT statement. This is in contrast to the subquery, which states the record set in a more declarative fashion. And the subquery may be perceived as happening second, because of the phrasing.

Of course, that was a really logically simple example - really it's just a scalar, not a subquery.
So what about this:

SELECT   Person.ID,
         Person.firstname,
         Person.lastname,
         Person.weight
FROM     Person
WHERE    Person.weight =
          (SELECT MAX(weight)
           FROM   Person p
           WHERE  p.lastName = Person.lastName)


This is a correlated subquery which gives the heaviest person, by last name. In other words, the heaviest person with last name of Adams, the heaviest person with the last name of Brown, etc. We know this is a correlated subquery because the blue-highlighted reference to the Person table is referencing the Person table in the outer query. The Person table does not exist in the inner query, because it was renamed to p. (In this database, there are 67,726 distinct last names. I generated this with a last name list from the U.S. census.)

So, this is a logically more sophisticated SQL language pattern that an average procedural developer may not be familiar with. A procedural rephrasing would look like this:

SELECT   lastname,
         MAX(weight) AS maxWeight
INTO     #temp        
FROM     Person
GROUP BY lastName      

SELECT   Person.ID,
         Person.firstName,
         Person.lastName,
         Person.weight
FROM     Person, #temp
WHERE    Person.lastName = #temp.lastName
  AND    Person.weight   = #temp.maxWeight

DROP TABLE #temp


The subquery version of this results in 14,054 logical IOs. The procedural version results in 7027 in the first SELECT, then 7027 in the second SELECT which sums to 14,054. But it ALSO, resulted in 291 additional IOs on the #temp table in the second SELECT statement. And, of course, the #temp table had to be allocated and dropped.


This could be rephrased as a CTE, which is more "procedural" looking:

;WITH maxWeightByLastName (lastName, maxWeight)
AS    (SELECT   lastname,
               MAX(weight) AS maxWeight
       FROM     Person
       GROUP BY lastName)


SELECT Person.ID,
       Person.firstName,
       Person.lastName,
       Person.weight
FROM   Person, maxWeightByLastName
WHERE  Person.lastName = maxWeightByLastName.lastName
  AND  Person.weight = maxWeightByLastName.maxWeight


Which results in 14,054 logical IOs, and has a query plan identical to the correlated subquery:


A more subtle advantage to motivate avoiding the #temp table version is that it can be difficult to tune, because the query optimizer is forced to isolate the two separate operations.

To show this, suppose I create a non-clustered index on lastname and weight. To summarize, there will be two indexes:
  • Clustered index on ID (PK_Person)
  • Non-Clustered index on lastName and weight (IX_lastName_weight) 
Also, to create the perfect situation, we'll NOT include firstname in the SELECT clause. In other words, the new index covers all the fields we need.

Running the #temp table version generates a 1842 IOs in the first select, 1842 in the second, and 291 IOs to the #temp table for a total of 3975. The two query plans look like:


Note that both query plans uses our IX_lastName_weight index - so the index is helpful.

Both the correlated subquery version and the CTE version generate the same query plan:

And this version results in 1847 logical IOs. In other words, the #temp table version requires more than twice the IOs, and also has the overhead of temp table creation and dropping. Of course, this is something of a perfect storm example, but do you really want to take away the ability of the query optimizer to do its job?

So why the difference?

Notice that the temp table version requires two separate scans of the index, while the CTE/subquery version only requires one. The query optimizer has been able to accomplish the task with a single scan of the table, but because of the #temp table phrasing, is forced to do it twice.

The added index scan is entirely because of the procedural temp table logic - it forces two scans of the index, when a single would have sufficed.


In summary, I believe that CTEs may be a good option for software developers who tend to think in a procedural manner. In this case, a CTE might feel like a more natural representation than a subquery, and avoid some of the overhead of a temporary table.

Followers