When Common Table Expressions (CTEs) were first added in SQL 2005, I glossed them over as a construct for achieving recursive queries for hierarchies of self-referential records. Of course, this appears to be the main use case, but I see more and more use for them.
In particular, I see situations where temp tables are used that could be phrased as a CTE.
Here is the situation. A software developer who naturally thinks in a procedural manner may have trouble with the declarative nature of a subquery. The natural, procedural, way to handle this is to break the problem into two sequential steps saving the intermediate result in a temp table. This can have performance implications, and may not be particularly self-documenting.
Suppose a software developer wants to find the person who weighs the most from the Person table as shown below. (In my environment, this table has 421,958 records on about 700 pages.)
The classic SQL pattern for this is a simple subquery:
SELECT ID,
firstname,
lastname
FROM Person
WHERE weight =
(SELECT MAX(weight)
FROM Person)
However, the developer may not be familiar with this pattern, and tends to think in a step-wise manner, so writes the following:
DECLARE @maxWeight as float
SELECT @maxWeight = MAX(weight)
FROM Person
SELECT ID,
firstname,
lastname
FROM Person
WHERE weight = @maxWeight
There are several problems with this:
- in a dynamic data environment, there is no guarantee that the actual maximum weight will not change between the two SELECT statements which can result in
- the wrong person selected (if the maximum weight changes between the select statements)
- no record selected (if the record has been deleted between the select statements)
- this is a significantly different problem for the SQL Engine to solve
- Clustered index on ID (PK_Person)
- Non-Clustered index on weight (IX_weight)
The script version generates two separate query plans that look like this:
Regardless of the actually performance change in the two query plans, it's clear that the software developer has forced the SQL Engine to perform two separate operations. This prevents the query optimizer from combining the two into, perhaps, a lower cost form. Further, two separate query plans are optimized and cached.
A common table expression (CTE) may be a more natural form for a procedural developer.
Here's how it looks:
;WITH maxWeight (weight)
AS ( SELECT MAX(weight)
FROM Person)
SELECT Person.ID,
Person.firstname,
Person.lastname
FROM Person, maxWeight
WHERE Person.weight = maxWeight.weight
Which generates the exact same query plan as the subquery version:
You can see that it is phrased as a two-step process: Create the CTE using the WITH construct, then reference it in the following SELECT statement. This is in contrast to the subquery, which states the record set in a more declarative fashion. And the subquery may be perceived as happening second, because of the phrasing.
Of course, that was a really logically simple example - really it's just a scalar, not a subquery.
So what about this:
SELECT Person.ID,
Person.firstname,
Person.lastname,
Person.weight
FROM Person
WHERE Person.weight =
(SELECT MAX(weight)
FROM Person p
WHERE p.lastName = Person.lastName)
This is a correlated subquery which gives the heaviest person, by last name. In other words, the heaviest person with last name of Adams, the heaviest person with the last name of Brown, etc. We know this is a correlated subquery because the blue-highlighted reference to the Person table is referencing the Person table in the outer query. The Person table does not exist in the inner query, because it was renamed to p. (In this database, there are 67,726 distinct last names. I generated this with a last name list from the U.S. census.)
So, this is a logically more sophisticated SQL language pattern that an average procedural developer may not be familiar with. A procedural rephrasing would look like this:
SELECT lastname,
MAX(weight) AS maxWeight
INTO #temp
FROM Person
GROUP BY lastName
SELECT Person.ID,
Person.firstName,
Person.lastName,
Person.weight
FROM Person, #temp
WHERE Person.lastName = #temp.lastName
AND Person.weight = #temp.maxWeight
DROP TABLE #temp
The subquery version of this results in 14,054 logical IOs. The procedural version results in 7027 in the first SELECT, then 7027 in the second SELECT which sums to 14,054. But it ALSO, resulted in 291 additional IOs on the #temp table in the second SELECT statement. And, of course, the #temp table had to be allocated and dropped.
This could be rephrased as a CTE, which is more "procedural" looking:
;WITH maxWeightByLastName (lastName, maxWeight)
AS (SELECT lastname,
MAX(weight) AS maxWeight
FROM Person
GROUP BY lastName)
SELECT Person.ID,
Person.firstName,
Person.lastName,
Person.weight
FROM Person, maxWeightByLastName
WHERE Person.lastName = maxWeightByLastName.lastName
AND Person.weight = maxWeightByLastName.maxWeight
Which results in 14,054 logical IOs, and has a query plan identical to the correlated subquery:
A more subtle advantage to motivate avoiding the #temp table version is that it can be difficult to tune, because the query optimizer is forced to isolate the two separate operations.
To show this, suppose I create a non-clustered index on lastname and weight. To summarize, there will be two indexes:
- Clustered index on ID (PK_Person)
- Non-Clustered index on lastName and weight (IX_lastName_weight)
Running the #temp table version generates a 1842 IOs in the first select, 1842 in the second, and 291 IOs to the #temp table for a total of 3975. The two query plans look like:
Note that both query plans uses our IX_lastName_weight index - so the index is helpful.
Both the correlated subquery version and the CTE version generate the same query plan:
And this version results in 1847 logical IOs. In other words, the #temp table version requires more than twice the IOs, and also has the overhead of temp table creation and dropping. Of course, this is something of a perfect storm example, but do you really want to take away the ability of the query optimizer to do its job?
So why the difference?
Notice that the temp table version requires two separate scans of the index, while the CTE/subquery version only requires one. The query optimizer has been able to accomplish the task with a single scan of the table, but because of the #temp table phrasing, is forced to do it twice.
The added index scan is entirely because of the procedural temp table logic - it forces two scans of the index, when a single would have sufficed.
In summary, I believe that CTEs may be a good option for software developers who tend to think in a procedural manner. In this case, a CTE might feel like a more natural representation than a subquery, and avoid some of the overhead of a temporary table.
I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.
ReplyDeleteData Engineering Solutions
Machine Learning Solutions
Data Analytics Solutions
Data Modernization Solutions