Wednesday, February 10, 2016

Group By Part 3: uniqueness and multi-column grouping

A short video covering the GROUP BY clause in SQL.

Part 3: uniqueness and multi-column grouping
For beginners.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 3 uniqueness and multi-column Grouping

USE Northwind

-- here's a basic query joining Categories with Products
SELECT   Categories.CategoryName,
         Products.ProductName,
         Products.Unitprice
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
ORDER BY Categories.CategoryName

-- now let's aggregate
-- show the average unit price, by Category

SELECT   Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName
ORDER BY Categories.CategoryName

-- awesome! runs fine, looks reasonable...

-- or is it

-- let me add one more item
SELECT   Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName,
         Categories.CategoryID -- this is the added piece
ORDER BY Categories.CategoryName

-- notice that now I have an additional group
-- let's add a column to see what's going on
SELECT   Categories.CategoryID,
         Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName,
         Categories.CategoryID -- this is the added piece
ORDER BY Categories.CategoryName

-- aha! there are two categories with the same name
-- is that OK? 
-- well, CategoryName is not the primary key
-- and apparently the schema allows duplicate values in the CategoryName

-- so, the lesson here is to always specify a *unique* grouping
-- in general, this means adding a primary key 
-- to the GROUP BY statement, even if it is not displayed

-- for example:
SELECT   Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName,
         Categories.CategoryID 
ORDER BY Categories.CategoryName,
         Categories.CategoryID

-- or another example
SELECT   Suppliers.CompanyName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
   JOIN  Suppliers      ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
ORDER BY Suppliers.CompanyName,
         Suppliers.SupplierID

-- notice in the second example that there are no duplicate
-- company names
-- so should we still add the SupplierID to the group by?
-- absolutely
-- your query needs to work on any data that might be in the database
-- not just what's there now

-- now for multi-level grouping
-- consider this query

SELECT   Suppliers.CompanyName,
         Categories.CategoryName,
         Products.ProductName,
         Products.Unitprice
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
ORDER BY Suppliers.CompanyName,
         Categories.CategoryName

-- notice that Exotic Liquids supplies us products in two categories
-- same with Forets...

-- so, let's calculate average unitprice by Supplier AND Category

SELECT   Suppliers.CompanyName,
         Categories.CategoryName,
         AVG(Products.Unitprice)  AS [avg unit price]
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
GROUP BY Suppliers.CompanyName,
         Categories.CategoryName
ORDER BY Suppliers.CompanyName,
         Categories.CategoryName

-- notice that now Exotic liquids has two groups
-- one for the Beverages category 
-- and one for the Spreads category

-- so now, the grouping is based on the *combination* of the two fields
-- note that the order of the fields in the GROUP BY doesn't matter
-- but for clarity it is best to be consistent with the ORDER BY 
--  (the order of the fields in the ORDER BY *does* matter)

-- and of course, to make sure our groupings are *unique* we should add
-- a unique field to the GROUP BY

SELECT   Suppliers.CompanyName,
         Categories.CategoryName,
         AVG(Products.Unitprice)  AS [avg unit price]
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
GROUP BY Suppliers.CompanyName,
         Categories.CategoryName,
         Suppliers.SupplierID,
         Categories.CategoryID
ORDER BY Suppliers.CompanyName,
         Categories.CategoryName,
         Suppliers.SupplierID,
         Categories.CategoryID

-- on a side note
-- it's also important to have a unique ordering
-- so that downstream consumers of your data aren't surprised
-- when the ordering changes

-- note that very different information that is provided if 
-- the column-ordering, and the row-ordering is reversed

SELECT   Categories.CategoryName,
         Suppliers.CompanyName,
         AVG(Products.Unitprice)  AS [avg unit price]
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
GROUP BY Categories.CategoryName,
         Suppliers.CompanyName,
         Suppliers.SupplierID,
         Categories.CategoryID
ORDER BY Categories.CategoryName,
         Suppliers.CompanyName,
         Suppliers.SupplierID,
         Categories.CategoryID

-- all the data in the result set is the same as the previous example

-- in the first example, the focus was on the Supplier
-- and secondarily, on the categories

-- in the second example, the focus is on the Category
-- and secondarily, on the Suppliers

-- in summary, make sure that your GROUP BY statement
-- specifies a unique grouping
-- this is generally done by adding the primary key

-- you can specify multiple columns in the GROUP BY
-- then groups are determined by *unique combinations*
-- of the values in the columns

-- finally, merely changing the column ordering
-- and the row ordering can produce quite different
-- information for human use

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 3 uniqueness & multi-column Grouping

No comments:

Post a Comment

Followers