Thursday, June 30, 2016

SQL Coding Style, IMHO

Here's a longish rant/explanation of my personal SQL coding style.

 It's incomplete, but covers basics, and gets a beginner started.

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
~Martin Fowler
Programs must be written for people to read, and only incidentally for machines to execute.  
~Harold Abelson
-- formatting SQL
-- version 2, with fixed formatting

-- Any fool can write code that a computer can understand. 
-- Good programmers write code that humans can understand. 
-- ~Martin Fowler

-- Programs must be written for people to read, and only incidentally for machines to execute.  
-- ~Harold Abelson

-- here's a slightly long rant/explanation 
-- of my personal coding style

-- So here is something a company might want to know
-- how much of each product did we sell in 1997
-- plus which category and supplier
-- fairly simple and small relative to real-world queries

-- this is produced by this query, directly from the SSMS query designer

SELECT     Products.ProductID, Products.ProductName, Categories.CategoryName, Suppliers.CompanyName, SUM(([Order Details].Quantity * [Order Details].UnitPrice) 
                      * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM         [Order Details] INNER JOIN
                      Products ON [Order Details].ProductID = Products.ProductID INNER JOIN
                      Orders ON [Order Details].OrderID = Orders.OrderID LEFT OUTER JOIN
                      Categories ON Products.CategoryID = Categories.CategoryID LEFT OUTER JOIN
                      Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, Products.ProductName, Categories.CategoryName, Suppliers.CompanyName, YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- here are a few questions,
-- can you answer them with a quick scan of the code?
---- which tables are involved?
---- what are the foreign keys for the joins?
---- how many columns will come back?
---- which columns?
---- any outer joins?
---- how many joins?

-- all of the above questions should be *easily* readable from the code
-- but they aren't 
-- so here's some formatting modifications to help
-----------------------

-- vertically align a single item from each list on its own line:
-- --lists are anything with commas. They are typically in the SELECT, 
-- --GROUP BY, and ORDER BY
-- --this makes the elements in the list clear. 
-- --I don't have to depend on spotting commas for list item separation

SELECT  Products.ProductID, 
        Products.ProductName, 
        Categories.CategoryName, 
        Suppliers.CompanyName, 
        SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM         [Order Details] INNER JOIN
                      Products ON [Order Details].ProductID = Products.ProductID INNER JOIN
                      Orders ON [Order Details].OrderID = Orders.OrderID LEFT OUTER JOIN
                      Categories ON Products.CategoryID = Categories.CategoryID LEFT OUTER JOIN
                      Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, 
         Products.ProductName, 
         Categories.CategoryName, 
         Suppliers.CompanyName, 
         YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- vertically align the tables in the FROM clause, and lead each with its JOIN
---- for me, this brings the really critical information about the type of joins to the front
---- I won't miss an outer join by accident

SELECT  Products.ProductID, 
        Products.ProductName, 
        Categories.CategoryName, 
        Suppliers.CompanyName, 
        SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM              [Order Details] 
  INNER JOIN      Products ON [Order Details].ProductID = Products.ProductID 
  INNER JOIN      Orders ON [Order Details].OrderID = Orders.OrderID 
  LEFT OUTER JOIN Categories ON Products.CategoryID = Categories.CategoryID 
  LEFT OUTER JOIN Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, 
         Products.ProductName, 
         Categories.CategoryName, 
         Suppliers.CompanyName, 
         YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- remove verbose JOIN syntax
---- In general, extra words don't help readability
---- all important information about the type of join is retained
---- also vertically align components of each clause
---- with this, the overall structure/components
---- of the query is/are very apparent


SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- align ON clause elements, and space to the right
---- the tables involved become more clear
---- the "details" of the join are less important at first glance, 
---- but are clearly presented if needed

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- break long formulas by the operators
---- note also the separation of parentheses with a space
---- and alignment of the outer beginning and ending parentheses
---- this is judgement call material - every formula is different
---- you may choose to do this differently
---- but clearly, formatting of longer calculations helps readability

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- move the condition from the HAVING to the WHERE clause
-- and remove verbose/unnecessary parentheses
---- condition in HAVING or WHERE is logically equivalent, but should absolutely be changed
---- this is subtly, but immensely stylistically better
---- because it conveys the fact that YEAR(orderdate) is 
---- a property of each Order record, not of each product group
---- WHERE is for conditions on records
---- HAVING is for conditions on groups
---- It is also potentially performance-improving, because
---- records will be eliminated prior to grouping, not after
---- hopefully the query optimizer catches this, 
---- but why not make sure instead of hope?

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
WHERE       YEAR(Orders.OrderDate) = 1997
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
ORDER BY    Products.ProductID

-- the above I would say is far superior to the original
-- I can, with a quick scan, answer all the important questions I had above
-- small/typo errors become apparent
-- people new to the data model and codebase can read it

-- I think the above is sufficient, but some would go further
-- these changes I think improve the readability, but may not be worth the time/effort
-- essentially, this is further "pretty-printing" of the code
-- I think it helps in some places, but hurts in others
-- to me, the code looks "choppy", and certain expression don't appear cohesive
-- so because I can't *consistently* do this, I choose not to


SELECT      Products   .ProductID, 
            Products   .ProductName, 
            Categories .CategoryName, 
            Suppliers  .CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID  = Products   .ProductID 
  JOIN      Orders          ON [Order Details].OrderID    = Orders     .OrderID 
  LEFT JOIN Categories      ON Products       .CategoryID = Categories .CategoryID 
  LEFT JOIN Suppliers       ON Products       .SupplierID = Suppliers  .SupplierID
WHERE       YEAR(Orders.OrderDate) = 1997
GROUP BY    Products   .ProductID, 
            Products   .ProductName, 
            Categories .CategoryName, 
            Suppliers  .CompanyName, 
            YEAR(Orders.OrderDate)
ORDER BY    Products   .ProductID

-- some people will also use table aliases,
-- effectively creating a short "variable name" for each table
-- notice in the example below that I've still been careful with alignment
-- all table aliases are aligned vertially, so I
-- have an easy reference for each "variable name"

SELECT      p.ProductID, 
            p.ProductName, 
            c.CategoryName, 
            s.CompanyName, 
            SUM( (  od.Quantity 
                  * od.UnitPrice) 
                  * (1.0 - od.Discount)
                ) AS totalRevenue
FROM        [Order Details] od
  JOIN      Products        p   ON od.ProductID = p.ProductID 
  JOIN      Orders          o   ON od.OrderID   = o.OrderID 
  LEFT JOIN Categories      c   ON p.CategoryID = c.CategoryID 
  LEFT JOIN Suppliers       s   ON p.SupplierID = s.SupplierID
WHERE       YEAR(o.OrderDate) = 1997
GROUP BY    p.ProductID, 
            p.ProductName, 
            c.CategoryName, 
            s.CompanyName, 
            YEAR(o.OrderDate)
ORDER BY    p.ProductID


-- the main benefits in readability, to me, 
-- is fewer characters
-- and some vertical alignments are easier,
-- since each "variable name" is a single letter
-- it also has benefits in coding, by making "auto-complete" quicker

-- however,
-- there is additional cognitive overhead in translating
-- the "variable name" to the table
-- this gets worse the more tables are involved
-- and if you have 6 tables that all begin with p, you have a problem
-- and if you are not intimately familiar with the database/codebase,
-- your cognitive overhead is very high
-- so, can other people read the code? (see quotes at top)
-- 

-- personally, I'm not a fan of this
-- even with many years (*gasp* decades) of coding
-- when I drop into a new environment/datamodel/codebase
-- it takes me too long to adapt to their "variable names"

No comments:

Post a Comment

Followers