Wednesday, January 28, 2015

Always define a Unique Grouping

I've written before about providing a unique ordering. The idea being that downstream systems and developers could begin to depend on records always coming back in the same order. Without having something truly unique in your ORDER BY clause (like a primary key), results could be coming back in different orders.

When using GROUP BY, its even more important to ensure that the grouping is always going to be unique.

Let's look at a fairly obvious example using the Products table from the Northwind database.


Consider this statement:

    SELECT   ProductName,
             AVG(unitPrice) AS [AvgUnitPrice]
    FROM     Products
    GROUP BY ProductName
    ORDER BY ProductName


This is perfectly valid syntax. The resulting records will look fine also.

However, there is a chance that two (or five) products will have the same name, and will get placed into the same group. In this case, the results of this statement are not correct.

The safest thing to do, for this query is to add the ProductID to the GROUP BY clause like this:

    SELECT   ProductName,
             AVG(unitPrice) AS [AvgUnitPrice]
    FROM     Products
    GROUP BY ProductName,
             ProductID
    ORDER BY ProductName


This ensures that the grouping is correct, and that the averages will be correct. However, it still has the problem that the ordering is now not necessarily unique. (As addressed here.)

To also provide a unique ordering, you'd add the ProductID to the ORDER BY also:

    SELECT   ProductName,
             AVG(unitPrice) AS [AvgUnitPrice]
    FROM     Products
    GROUP BY ProductName,
             ProductID
    ORDER BY ProductName,
             ProductID





Wednesday, January 21, 2015

Always define a unique ordering

Your SELECT statements should always have a unique ordering.

This ensures that "downstream" systems/code that consumes the data doesn't assume (wrongly) that the data is always in the same order.

Let me describe this using the Products table from the MS Northwind database.


Consider this SQL:

SELECT   ProductName
FROM     Products
ORDER BY ProductName

This is syntactically correct SQL, and we'll assume that it returns the correct records.

If all the products have a different ProductName, this will provide a unique ordering of the records. In other words, each time the SELECT statement is executed, the records come back in the same order.

However, in a dynamic environment, with a large number of records, it is likely that two products could have the same ProductName, e.g., "Socks", "Lipstick", "Hammer". (Unless uniqueness is enforced on ProductName.)

The danger here is that downstream systems or consumers of the data from this SELECT statement might actually depend on the records always being in the exact same order. And when they are not, things can go wrong.

So why would a programmer downstream assume that the records are in order? Well, why not? At first glance, they look like they are in order. But with a million records that are changing, how could the programmer be sure? And why sort a million records that look like they are already in order?

A simple safe SQL practice is to always specify a unique ordering. In this case, adding ProductID to the ORDER BY statement would be easy:

SELECT   ProductName
FROM     Products
ORDER BY ProductName, ProductID

This is not likely to be noticed by anyone, and may never make a difference. But it can prevent future errors.

By the way, I encountered a very similar situation on a large software project. The database people said the data was sorted, the programmers wrote their code to depend on the sort. And visual inspection of the data made us think it was sorted. We ended up noticing the changing order of records as we were stepping through code in a debugger. In other words, we were convinced it was a logic error in the code.

Followers