Wednesday, February 10, 2016

GROUP BY Part 1: Introduction

A short video covering the GROUP BY clause in SQL.

For beginners.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 1 Introduction

USE Northwind

-- here's a simple SELECT statement from the Products table

SELECT   ProductID,
         productname,
         unitprice,
         UnitsInStock
FROM     Products
ORDER BY ProductID

-- notice that I get all the rows, because there is not a WHERE clause

-- if there were 10 million rows, I would see all of them

-- but sometimes I'd like to see a *summary* of the products

-- consider this:

SELECT   AVG(unitprice)  AS [average unit price]
FROM     Products

-- in this case, we have summarized all the prices of all products
-- into a single number -  the average

-- this summarization is known as "aggregating" or "aggregation"
-- we aggregate all the product records into a single group,
-- then we can ask questions about the group

-- what else can we find out about the group?

SELECT   AVG(unitprice)    AS [average unit price],
         MIN(unitprice)    AS [minimum unit price],
         MAX(reorderlevel) AS [maximum reorder level],
         COUNT(productID)  AS [number of products],
         SUM(unitsinStock) AS [total number of units in stock],
         STDEV(unitprice)  AS [standard deviation of unit price],
         VAR(unitprice)    AS [variance of unit price]
FROM     Products

-- we can also include expressions in an aggregate function

SELECT   SUM(unitprice * unitsInStock) AS [total value of inventory]
FROM     Products



-- in the above examples we were aggregating over the whole table
-- but you can limit the aggregation

SELECT   AVG(UnitPrice)     AS [average unit price]
FROM     Products
WHERE    CategoryID IN (2, 4, 6)

-- in this example, the AVG is calculated over 
-- products in the stated categories

-- think of the above example as happening in this order
-- 1 - the WHERE clause - records are eliminated first
-- 2 - the averaging - aggregate calculated over remaining records

-- now consider these two queries:
SELECT   AVG(unitprice) AS [Avg price of category 1 products]
FROM     Products
WHERE    categoryID = 1

SELECT   AVG(unitprice) AS [Avg price of category 2 products]
FROM     Products
WHERE    categoryID = 2

-- wouldn't it be nice to have these in the same table?
-- or a table that shows a list of categories, and the average for each category?
-- like this:
--**********************
--* CategoryID *  Avg  *
--**********************
--*         1  *  31.8 *
--*         2  *  23.1 *
--*         3  *  15.2 *
--*       ...  *  ...  *
--**********************

-- here it is:

SELECT   CategoryID,
         AVG(unitprice) AS [Avg unitprice]
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID

-- the new part is the GROUP BY clause
-- it has to be before the ORDER BY, and after the WHERE

-- the logic is this:
-- arrange the products into groups, using the CategoryID
-- all CategoryID =1 records go into a group
-- all CategoryID =2 records go into a group
-- etc.
-- now, answer a question about each group - what is the average unit price?

SELECT   CategoryID,
         AVG(unitprice) AS [Avg unitprice]
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID

-- and once the grouping is established, I can ask other aggregate questions:

SELECT   CategoryID,
         AVG(unitprice)    AS [average unit price],
         MIN(unitprice)    AS [minimum unit price],
         MAX(reorderlevel) AS [maximum reorder level],
         COUNT(productID)  AS [number of products],
         SUM(unitsinStock) AS [total number of units in stock],
         STDEV(unitprice)  AS [standard deviation of unit price],
         VAR(unitprice)    AS [variance of unit price]
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID
-- 
-- so how can we visualize what's going on?

SELECT   CategoryID
FROM     Products
ORDER BY CategoryID

-- in the above, notice the groupings of
-- the records, by categoryID

-- first two records are one group, with a NULL categoryID
-- next twelve records are the next group, where categoryID=1
-- etc.

-- now group them
SELECT   CategoryID
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID

-- each of the 9 resulting records represents
-- a group of products that have a common categoryID

-- now there are groups of products
-- one group for each different categoryID
-- (you can also do this with DISTINCT,
--  but that's for another video)

-- now let's add some attributes that we're interested in
SELECT   CategoryID,
         ProductID,
         ProductName,
         Unitprice
FROM     Products
ORDER BY CategoryID,
         ProductID

-- looking at the left-most column,
-- we can see values that will become groups
-- the two NULL categoryIDs will become a group
-- the next twelve records with categoryID=1 will become the next group

-- we're looking to get the average unit price by category
-- so let's pare down to just the columns we need
SELECT   CategoryID,
         Unitprice
FROM     Products
ORDER BY CategoryID

-- so let's "convert" the above into an aggregate
-- and I'm expecting that there will be two products for categoryID NULL
-- and twelve products for categoryID=1

SELECT   CategoryID,
         AVG(unitprice) AS [average unit price] -- the AVG function is new
FROM     Products
GROUP BY CategoryID    -- the GROUP BY is new
ORDER BY CategoryID

-- the changes above are:
-- * add the GROUP BY clause with the CategoryID
-- * add the AVG function to the unitprice

-- in the english language, a common idiom is to ask for
-- "a count of products by category" or
-- "the average unit price by supplier"
-- when someone asks "... by category" 
-- they usually mean that the category should be in the leftmost column
-- and that the records should be GROUP BY'ed category
-- and that the records should be ORDER BY'ed category

-- a couple of examples

-- Manager Bob: "show me the total value of inventory by category"
-- here's what Manager Bob most likely wants (but avoiding JOINs for now):
SELECT      CategoryID,     
            SUM(unitprice * unitsinstock) AS [value of inventory]
FROM        Products
GROUP BY    CategoryID
ORDER BY    CategoryID

-- Manager Barb: "show me the lowest unitprice of products by supplier"
-- here's what Manager Barb most likely wants (but avoiding JOINs again...):

SELECT      SupplierID,
            MIN(unitprice) AS [lowest unitprice]
FROM        Products
GROUP BY    SupplierID
ORDER BY    SupplierID

-- more clearly
-- "by Supplier"
SELECT      SupplierID,    -- supplier in first column
            MIN(unitprice) AS [lowest unitprice]
FROM        Products       
GROUP BY    SupplierID     -- supplier is the basis for grouping
ORDER BY    SupplierID     -- records ordered by supplier



-- throwing in a JOINs for clearer suppliers:
SELECT      Suppliers.CompanyName,
            MIN(Products.unitprice) AS [lowest unitprice]
FROM        Products
   JOIN     Suppliers      ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Suppliers.CompanyName,
            Suppliers.SupplierID   -- this is important, but not covered here
ORDER BY    Suppliers.CompanyName
 

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY introduction

No comments:

Post a Comment

Followers