Thursday, June 30, 2016

SQL Coding Style, IMHO

Here's a longish rant/explanation of my personal SQL coding style.

 It's incomplete, but covers basics, and gets a beginner started.

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
~Martin Fowler
Programs must be written for people to read, and only incidentally for machines to execute.  
~Harold Abelson
-- formatting SQL
-- version 2, with fixed formatting

-- Any fool can write code that a computer can understand. 
-- Good programmers write code that humans can understand. 
-- ~Martin Fowler

-- Programs must be written for people to read, and only incidentally for machines to execute.  
-- ~Harold Abelson

-- here's a slightly long rant/explanation 
-- of my personal coding style

-- So here is something a company might want to know
-- how much of each product did we sell in 1997
-- plus which category and supplier
-- fairly simple and small relative to real-world queries

-- this is produced by this query, directly from the SSMS query designer

SELECT     Products.ProductID, Products.ProductName, Categories.CategoryName, Suppliers.CompanyName, SUM(([Order Details].Quantity * [Order Details].UnitPrice) 
                      * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM         [Order Details] INNER JOIN
                      Products ON [Order Details].ProductID = Products.ProductID INNER JOIN
                      Orders ON [Order Details].OrderID = Orders.OrderID LEFT OUTER JOIN
                      Categories ON Products.CategoryID = Categories.CategoryID LEFT OUTER JOIN
                      Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, Products.ProductName, Categories.CategoryName, Suppliers.CompanyName, YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- here are a few questions,
-- can you answer them with a quick scan of the code?
---- which tables are involved?
---- what are the foreign keys for the joins?
---- how many columns will come back?
---- which columns?
---- any outer joins?
---- how many joins?

-- all of the above questions should be *easily* readable from the code
-- but they aren't 
-- so here's some formatting modifications to help
-----------------------

-- vertically align a single item from each list on its own line:
-- --lists are anything with commas. They are typically in the SELECT, 
-- --GROUP BY, and ORDER BY
-- --this makes the elements in the list clear. 
-- --I don't have to depend on spotting commas for list item separation

SELECT  Products.ProductID, 
        Products.ProductName, 
        Categories.CategoryName, 
        Suppliers.CompanyName, 
        SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM         [Order Details] INNER JOIN
                      Products ON [Order Details].ProductID = Products.ProductID INNER JOIN
                      Orders ON [Order Details].OrderID = Orders.OrderID LEFT OUTER JOIN
                      Categories ON Products.CategoryID = Categories.CategoryID LEFT OUTER JOIN
                      Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, 
         Products.ProductName, 
         Categories.CategoryName, 
         Suppliers.CompanyName, 
         YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- vertically align the tables in the FROM clause, and lead each with its JOIN
---- for me, this brings the really critical information about the type of joins to the front
---- I won't miss an outer join by accident

SELECT  Products.ProductID, 
        Products.ProductName, 
        Categories.CategoryName, 
        Suppliers.CompanyName, 
        SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM              [Order Details] 
  INNER JOIN      Products ON [Order Details].ProductID = Products.ProductID 
  INNER JOIN      Orders ON [Order Details].OrderID = Orders.OrderID 
  LEFT OUTER JOIN Categories ON Products.CategoryID = Categories.CategoryID 
  LEFT OUTER JOIN Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, 
         Products.ProductName, 
         Categories.CategoryName, 
         Suppliers.CompanyName, 
         YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- remove verbose JOIN syntax
---- In general, extra words don't help readability
---- all important information about the type of join is retained
---- also vertically align components of each clause
---- with this, the overall structure/components
---- of the query is/are very apparent


SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- align ON clause elements, and space to the right
---- the tables involved become more clear
---- the "details" of the join are less important at first glance, 
---- but are clearly presented if needed

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- break long formulas by the operators
---- note also the separation of parentheses with a space
---- and alignment of the outer beginning and ending parentheses
---- this is judgement call material - every formula is different
---- you may choose to do this differently
---- but clearly, formatting of longer calculations helps readability

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- move the condition from the HAVING to the WHERE clause
-- and remove verbose/unnecessary parentheses
---- condition in HAVING or WHERE is logically equivalent, but should absolutely be changed
---- this is subtly, but immensely stylistically better
---- because it conveys the fact that YEAR(orderdate) is 
---- a property of each Order record, not of each product group
---- WHERE is for conditions on records
---- HAVING is for conditions on groups
---- It is also potentially performance-improving, because
---- records will be eliminated prior to grouping, not after
---- hopefully the query optimizer catches this, 
---- but why not make sure instead of hope?

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
WHERE       YEAR(Orders.OrderDate) = 1997
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
ORDER BY    Products.ProductID

-- the above I would say is far superior to the original
-- I can, with a quick scan, answer all the important questions I had above
-- small/typo errors become apparent
-- people new to the data model and codebase can read it

-- I think the above is sufficient, but some would go further
-- these changes I think improve the readability, but may not be worth the time/effort
-- essentially, this is further "pretty-printing" of the code
-- I think it helps in some places, but hurts in others
-- to me, the code looks "choppy", and certain expression don't appear cohesive
-- so because I can't *consistently* do this, I choose not to


SELECT      Products   .ProductID, 
            Products   .ProductName, 
            Categories .CategoryName, 
            Suppliers  .CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID  = Products   .ProductID 
  JOIN      Orders          ON [Order Details].OrderID    = Orders     .OrderID 
  LEFT JOIN Categories      ON Products       .CategoryID = Categories .CategoryID 
  LEFT JOIN Suppliers       ON Products       .SupplierID = Suppliers  .SupplierID
WHERE       YEAR(Orders.OrderDate) = 1997
GROUP BY    Products   .ProductID, 
            Products   .ProductName, 
            Categories .CategoryName, 
            Suppliers  .CompanyName, 
            YEAR(Orders.OrderDate)
ORDER BY    Products   .ProductID

-- some people will also use table aliases,
-- effectively creating a short "variable name" for each table
-- notice in the example below that I've still been careful with alignment
-- all table aliases are aligned vertially, so I
-- have an easy reference for each "variable name"

SELECT      p.ProductID, 
            p.ProductName, 
            c.CategoryName, 
            s.CompanyName, 
            SUM( (  od.Quantity 
                  * od.UnitPrice) 
                  * (1.0 - od.Discount)
                ) AS totalRevenue
FROM        [Order Details] od
  JOIN      Products        p   ON od.ProductID = p.ProductID 
  JOIN      Orders          o   ON od.OrderID   = o.OrderID 
  LEFT JOIN Categories      c   ON p.CategoryID = c.CategoryID 
  LEFT JOIN Suppliers       s   ON p.SupplierID = s.SupplierID
WHERE       YEAR(o.OrderDate) = 1997
GROUP BY    p.ProductID, 
            p.ProductName, 
            c.CategoryName, 
            s.CompanyName, 
            YEAR(o.OrderDate)
ORDER BY    p.ProductID


-- the main benefits in readability, to me, 
-- is fewer characters
-- and some vertical alignments are easier,
-- since each "variable name" is a single letter
-- it also has benefits in coding, by making "auto-complete" quicker

-- however,
-- there is additional cognitive overhead in translating
-- the "variable name" to the table
-- this gets worse the more tables are involved
-- and if you have 6 tables that all begin with p, you have a problem
-- and if you are not intimately familiar with the database/codebase,
-- your cognitive overhead is very high
-- so, can other people read the code? (see quotes at top)
-- 

-- personally, I'm not a fan of this
-- even with many years (*gasp* decades) of coding
-- when I drop into a new environment/datamodel/codebase
-- it takes me too long to adapt to their "variable names"

Wednesday, June 29, 2016

Simple Recursion in SQL with Common Table Expressions

Simple Recursion in SQL


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 6/29/2016
-- Recursion

USE Northwind

-- data in the world is commonly *hierarchical*
-- this is modeled using a table that relates to itself in a 1-many manner
-- in other words, it has a foreign key that references itself

-- to get records from the entire hierarchy, we need recursion

SELECT *
FROM   Employees

-- focusing on just the columns of interest
SELECT   EmployeeID,
         LastName,
         reportsTo 
FROM     Employees
ORDER BY EmployeeID

-- notice that reportsTo is a foreign key
-- that references the primary key in
-- the same table
-- in other words, this relationship is
--   self-referential

-- maybe better like this:
SELECT   EmployeeID,
         LastName,
         reportsTo AS [employeeID of Boss]
FROM     Employees
ORDER BY EmployeeID

-- or
SELECT   EmployeeID,
         LastName,
         reportsTo AS [bossID]
FROM     Employees
ORDER BY EmployeeID


-- so to join a table to itself,
-- we need to use a table alias

SELECT   Employee.EmployeeID,
         Employee.LastName  AS [LastName],
         Employee.ReportsTo AS [ReportsToID],
         Boss.EmployeeID    AS [BossID],
         Boss.LastName      AS [Boss LastName]
FROM     Employees   AS [Employee]
  JOIN   Employees      AS [Boss]  ON Employee.ReportsTo = Boss.EmployeeID
ORDER BY Employee.EmployeeID

-- notice that we lost someone, employeeID=2, 
-- because their reportsTo was NULL
-- we can get them back with an outer JOIN

SELECT      Employee.EmployeeID,
            Employee.LastName  AS [LastName],
            Employee.ReportsTo AS [ReportsToID],
            Boss.EmployeeID    AS [BossID],
            Boss.LastName      AS [Boss LastName]
FROM        Employees          AS [Employee]
  LEFT JOIN Employees          AS [Boss] ON Employee.ReportsTo= Boss.EmployeeID
ORDER BY    Employee.EmployeeID

-- the above has been from the Employee perspective, and their bosses
-- lets do this the other way
-- Bosses, and their employees

SELECT      Boss.EmployeeID  AS [BossID],
            Boss.LastName  AS [Boss LastName],
            Employee.EmployeeID,
            Employee.LastName AS [LastName],
            Employee.ReportsTo AS [ReportsToID]
FROM        Employees   AS [Employee]
  LEFT JOIN Employees   AS [Boss] ON Employee.ReportsTo= Boss.EmployeeID
ORDER BY    Boss.EmployeeID

-- look closely, the JOIN hasn't changed
-- the only difference is the order of the columns
-- and the ORDER BY statement

-- both of the above only represent 
-- a single level in the hierarchy

-- an employee, and their direct supervisor
-- or a boss, and their direct supervisee

-- look at it another way
SELECT   Boss.EmployeeID    AS [BossID],
         Boss.LastName      AS [Boss LastName],
         COUNT(Employee.EmployeeID) AS [Supervisees]
FROM     Employees AS [Employee]
  JOIN   Employees AS [Boss] ON Employee.ReportsTo= Boss.EmployeeID
GROUP BY Boss.employeeID,
         Boss.LastName
ORDER BY Boss.EmployeeID

-- so this shows that Fuller has 5 *direct* reportees
-- and Buchanan has 3 direct reportees

-- however, Buchanan reportees to Fuller
-- so really, Fuller has 8 reportees "under him"

-- and with more full data, 
-- there could be many levels
-- in the hierarchy

-- to do this, we'll need a
-- Common Table Expression, aka CTE

-- clean version-----------------------------------
WITH EmployeesAndTheirBosses 
  (EmployeeID,
  LastName,
  ReportsTo,
  BossLastName,
  depth)
AS
(
SELECT  EmployeeID,
        LastName,
        ReportsTo,
        LastName,
        0 
FROM    Employees
WHERE   reportsTo IS NULL 

UNION ALL 

SELECT   Employees.EmployeeID,
         Employees.LastName,
         Employees.ReportsTo,
         EmployeesAndTheirBosses.LastName,
         EmployeesAndTheirBosses.depth + 1
FROM     Employees
   JOIN  EmployeesAndTheirBosses ON employees.reportsTo = EmployeesAndTheirBosses.EmployeeID
)

SELECT   *
FROM     EmployeesAndTheirBosses
ORDER BY depth




-- annotated version-----------------------------------

-- CTEs define a table that can be referenced in only
-- the very next statement
-- WITH defines what the table looks like: name and columns
WITH EmployeesAndTheirBosses 
         (EmployeeID,
          LastName,
          ReportsTo,
          BossLastName,
          depth)
AS
-- now define what records go into the CTE
(
-- PART A
SELECT   EmployeeID,
         LastName,
         ReportsTo,
         LastName,
         0 -- the person without a boss has a depth of zero
FROM     Employees
WHERE    reportsTo IS NULL -- no boss!
-- the part above is the base condition
-- it's the employee with no boss

UNION ALL -- the ALL part is important for the recursion to work

-- the part below is the recursion
-- PART B
SELECT   Employees.EmployeeID,
         Employees.LastName,
         Employees.ReportsTo,
         EmployeesAndTheirBosses.LastName,
         EmployeesAndTheirBosses.depth + 1
FROM     Employees
   JOIN  EmployeesAndTheirBosses ON employees.reportsTo = EmployeesAndTheirBosses.EmployeeID
   -- notice that this references the CTE: EmployeesAndTheirBosses
   -- this is the recursion
)
-- now that the CTE is defined, select records from it
-- the CTE only exists for this next SELECT statement
-- in other words, the scope of the CTE is the next statement
SELECT   *
FROM     EmployeesAndTheirBosses
ORDER BY depth

-- I've ordered the records by depth, 
-- so you can see the recursion easily

--EmployeeID  LastName             ReportsTo   BossLastName         depth
------------- -------------------- ----------- -------------------- -----------
--2           Fuller               NULL        Fuller               0
--1           Davolio              2           Fuller               1
--3           Leverling            2           Fuller               1
--4           Peacock              2           Fuller               1
--5           Buchanan             2           Fuller               1
--8           Callahan             2           Fuller               1
--6           Suyama               5           Buchanan             2
--7           King                 5           Buchanan             2
--9           Dodsworth            5           Buchanan             2
--11          Kline                6           Suyama               3
--12          Jones                11          Kline                4
--13          Johnson              11          Kline                4

-- think of the recursion happening this way:

--EmployeeID  LastName             ReportsTo   BossLastName         depth
------------- -------------------- ----------- -------------------- -----------
-- ** base condition ** 
--2           Fuller               NULL        Fuller               0
-- ** first level of recursion, all those who report to level above
--1           Davolio              2           Fuller               1
--3           Leverling            2           Fuller               1
--4           Peacock              2           Fuller               1
--5           Buchanan             2           Fuller               1
--8           Callahan             2           Fuller               1
-- ** second level of recursion, all those who report to first level employees
--6           Suyama               5           Buchanan             2
--7           King                 5           Buchanan             2
--9           Dodsworth            5           Buchanan             2
-- ** third level of recursion, all those who report to 2nd level employees
--11          Kline                6           Suyama               3
-- ** fourth level of recursion, all those who report to 3rd level employees
--12          Jones                11          Kline                4
--13          Johnson              11          Kline                4
-- ** fifth level of recursion, all those who report to 4th level employees
-- ** no one

-- STEP 1: Base condition
-- The base condition is from PART A
-- it returns only fuller
-- Fuller is the person with no boss; NULL for reportsTo
-- depth=0 by decision - we could use any number, zero makes sense

-- STEP 2: First level of recursion
-- This joins Part A and Part B
-- So it joins Fuller with all people who reportTo Fuller
-- These are the records with depth= 1
-- depth = depth of fuller plus 1

-- STEP 3: Second level of recursion
-- this Joins First level of recursion with all Employees
-- So it joins Davolio thru Callahan with all employees who report to them
-- These are the records with depth = 2
-- depth = depth of Davolio thru Callahan plus 1

-- STEP 4: Third level of recursion
-- this Joins second level of recursion with all Employees
-- this Joins Suyama thru Dodsworth with employees who report to them
-- These are the records with depth = 3

-- etc.

-- in this case, we only have 4 levels
-- but this would continue if there were more levels

-- Database by Doug
-- Douglas Kline
-- 6/29/2016
-- Recursion

Followers