Wednesday, February 17, 2016

The EXISTS clause in SQL

A short video on how to use the EXISTS clause in SQL.
For beginners. How and when to use the EXISTS clause to find records in one table that match records in another table. How it compares to JOINing and the IN clause.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 10/27/2015
-- EXISTS

use northwind

-- the EXISTS clause is a logical operator that evaluates a subquery for existence of records
-- if the subquery has one or more records, EXISTS returns true
-- if the subquery has no records, EXISTS returns false

-- since EXISTS returns a logical value,
-- it is common to use EXISTS in the WHERE clause, which expects logical expressions

-- it can make some queries faster
-- and sometimes makes your query clearer

-- here is a simple example

SELECT *
FROM   Products
WHERE  EXISTS ( SELECT 17 )

-- which returns all records in the Products table
-- the logic is that EXISTS is evaluated for each record in the Products table
-- for each record, EXISTS (SELECT 17) is evaluated,
-- and since (SELECT 17) always returns a 1-row, 1-column table, 
-- EXISTS returns true for all records

-- that's not a very practical example

-- here's a very common use of the EXISTS 

SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- this is "Categories that have products"
-- 
-- the subquery is a "correlated subquery", 
-- which means it references a table outside the subquery
-- in this case, the correlation is on Categories.CategoryID
-- note that Categories is not in the FROM clause of the subquery

-- you can do this same thing like this:

SELECT   DISTINCT CategoryName
FROM     Categories
   JOIN  Products    ON Products.CategoryID = Categories.CategoryID

-- note that we need to use DISTINCT to remove duplicates

-- so why not always use the JOIN version?

-- well, notice we don't really need to get any 
-- fields from the Products table
-- it's only there to figure out which categories to show

-- consider a case where there are 2 categories
-- and 20 billion products

-- the EXISTS version does this:
---- for category 1
---- look through products until it finds a product in category 1
---- stop
---- for category 2
---- look through products until it finds a product in category 2
---- stop

-- the JOIN w/DISTINCT version does this:
---- join the 2 categories with all 20 billion products
---- remove the 19,999,999,998 duplicates
   
-- so you can see that the EXISTS version can be much faster
-- depending on the situation

-- EXISTS is logically very similar to IN

-- here's another way to do the query above
SELECT CategoryName
FROM   Categories
WHERE  CategoryID IN 
         (SELECT CategoryID
          FROM   Products)

-- reminder of EXISTS
SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- the work that the database engine has to do
-- is very similar here
-- the "IN" can stop as soon as it finds the first instance
-- it doesn't need to search through all the products

-- however, NOT EXISTS (and NOT IN) does not usually have the same performance benefit:
SELECT CategoryName
FROM   Categories
WHERE  NOT EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- logically, this will give you the correct records
-- however, proving that something DOES NOT exist means checking all of them
-- in our example above with 2 categories, and 20 billion products
-- assume that category 2 doesn't have any products
-- to prove that category 2 doesn't exist in Products,
-- all 20 billion products need to be checked

-- in summary, EXISTS can sometimes improve the performance 
-- of a query like this, over a JOIN
SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- also, EXISTS can sometimes to be used to more clearly
-- phrase a query:
SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT 
          WHERE  Products.CategoryID = Categories.CategoryID)

-- thanks for watching

-- Database by Doug
-- Douglas Kline
-- 10/27/2015
-- EXISTS

The HAVING clause in SQL

A short video covering the HAVING clause in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 2/17/2016
-- HAVING - how to limit which groups are shown

USE Northwind

-- you might want to check out
-- GROUP BY part 1 - introduction
-- GROUP BY part 2 - NULL handling and COUNTs
-- GROUP BY part 3 - unique groupings and multi-column groups

-- a simple query from Products, leaving out any JOINs for clarity

SELECT    CategoryID,
          ProductID,
          ProductName,
          Unitprice
FROM      Products
ORDER BY  CategoryID,
          ProductID

-- now grouping by Category, and calculating average unit price
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  CategoryID

-- now suppose that we'd like to only see
-- "high priced categories"

-- we could maybe do this:
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

-- now the higher priced categories are at the top

-- I might even be able to use the non-ANSI-standard
-- TOP clause like this:

SELECT    TOP 5
          CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

-- but what if I want a cut-off
-- only show categories whose average unitprice is above $30
-- there's a statement for that: HAVING

SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    AVG(Unitprice) > 30   -- this is the new piece
ORDER BY  [average unit price] DESC

-- the new part is the HAVING clause
-- the HAVING clause is similar to the WHERE clause
-- in that it specifies a logical expression 
-- that can be evaluated as true or false

-- however, the logical expression is evaluated 
-- for each *group* (in our case, each group represents a category)

-- with the WHERE clause
-- the logical expression is evaluated
-- for each *record* (in our case, each record represents a product)

-- consider this example
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    COUNT(productID) > 2   -- this is the new piece
ORDER BY  [average unit price] DESC

-- here we are saying, only show categories
-- that have at least two products in them
-- notice that the HAVING doesn't have to be
-- something in the SELECT clause

-- consider this example
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    MIN(unitprice) > 5  -- this is the new piece
ORDER BY  [average unit price] DESC

-- here, only show categories
-- with minimum-priced products greater than $5

-- note that in the last three examples,
-- the average unit price does not change for each category
-- what changes, is which categories are displayed

-- now consider these:
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
WHERE     unitsInStock > 0 -- this is the changed part
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

-- notice that the average unit prices change
-- and potentially, the order of the rows could change

-- this is because only products with unitsInStock > 0
-- are included in the groups

-- and therefore, the aggregations (AVG, MIN, MAX, COUNT, etc.)
--  change

-- so, here is a good way 
-- to think about the order
-- in which the operations are performed

-- WHERE comes first - records are eliminated before the grouping
-- GROUP BY comes next - remaining records are arranged into groups
-- HAVING comes next - groups are eliminated
-- ORDER BY comes last - what's left gets ordered

-- a larger last example
SELECT   Suppliers.CompanyName,
         SUM(Products.unitprice 
             * Products.unitsInStock
            )        AS [Value of Inventory]
FROM     Products
   JOIN  Suppliers   ON Products.SupplierID = Suppliers.SupplierID
WHERE    Products.CategoryID IN (1, 3)
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
HAVING   COUNT(ProductID) > 1
ORDER BY [Value of Inventory] DESC

-- in english:
-- show the dollar value of inventory, by supplier
-- only include products in categories 1 and 3
-- only include suppliers that supply more than one product
-- order the suppliers by descending value of inventory

-- a few things to note
-- the GROUP BY has a Suppliers.SupplierID to ensure a unique grouping
-- The WHERE clause could be changed without affecting any other part
-- THE HAVING clause could be changed without affecting any other part
-- (of course the results change...)

-- another interesting note: HAVING is not strictly needed
-- you can accomplish a HAVING using a subquery
-- going back to my first example

SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    AVG(Unitprice) > 30   
ORDER BY  [average unit price] DESC

-- could equivalently be done 
-- using a subquery in the FROM clause
-- like this:
SELECT   CategoryID,
         [average unit price]         
FROM     (SELECT   CategoryID,
                   AVG(Unitprice) AS [average unit price]
          FROM     Products
          GROUP BY CategoryID) [CategorySubquery]
WHERE    [average unit price] > 30
ORDER BY [average unit price] DESC

-- but the HAVING clause is much clearer
-- especially with situations like one-to-many aggregation joins
-- like the one above:

SELECT   Suppliers.CompanyName,
         SUM(Products.unitprice 
             * Products.unitsInStock
            )        AS [Value of Inventory]
FROM     Products
   JOIN  Suppliers   ON Products.SupplierID = Suppliers.SupplierID
   JOIN  (SELECT   SupplierID,
                   COUNT(ProductID) AS [product count]
          FROM     Products
          WHERE    Products.CategoryID IN (1,3)
          GROUP BY SupplierID
         ) [SupplierGroup]
                     ON Suppliers.SupplierID = SupplierGroup.SupplierID
WHERE    [SupplierGroup].[product count] > 1
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
ORDER BY [Value of Inventory] DESC

-- compared to:
SELECT   Suppliers.CompanyName,
         SUM(Products.unitprice 
             * Products.unitsInStock
            )        AS [Value of Inventory]
FROM     Products
   JOIN  Suppliers   ON Products.SupplierID = Suppliers.SupplierID
WHERE    Products.CategoryID IN (1, 3)
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
HAVING   COUNT(ProductID) > 1
ORDER BY [Value of Inventory] DESC

-- thanks for watching and / or reading

-- Database by Doug
-- Douglas Kline
-- 2/17/2016
-- HAVING - how to limit which groups are shown


Monday, February 15, 2016

Simple Scalar Subqueries

A short video covering simple scalar subqueries in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- 2/15/2016
-- Intro to Simple Scalar Subqueries

use northwind

-- 
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products

-- suppose we want to show all products 
--  with above-average unit price

-- we can find the average unitprice

SELECT   AVG(unitprice)    AS [average unitprice]
FROM     Products

-- but which products are they?

-- this doesn't work:
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products

-- I'd have to group, which gives this:
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products
GROUP BY ProductID

-- then limit it? that doesn't work?
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products
WHERE    unitprice > AVG(unitprice)
GROUP BY ProductID

-- how about having? darn!
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products
GROUP BY ProductID
HAVING   unitprice > AVG(unitprice)

-- none of those things works
-- so let's look at the average again:
SELECT   AVG(unitprice)    AS [average unitprice]
FROM     Products

-- then use it in a second query
-- value = 28.6501

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice> 28.6501
ORDER BY unitprice

-- however, it took two steps
-- and in a dynamic environment
-- the average value would have changed
-- in the time between execution of the two SELECT statements

-- what we'd like to do is have these two queries 
-- executed as a single query 

-- well, consider this again
-- notice that it returns a 
-- 1-row, 1-column table

SELECT   AVG(unitprice)    AS [average unitprice]
FROM     Products

-- and convince yourself that it 
-- will *always* return a
-- 1-row, 1-column table

-- even this:
SELECT  AVG(unitprice)     AS [avg of a NULL]
FROM    Products
WHERE   unitprice IS NULL

SELECT  COUNT(productID)     AS [proof of a NULL unitprice]
FROM    Products
WHERE   unitprice IS NULL

-- and this:
SELECT   AVG(unitprice)    AS [average of no records]
FROM     Products
WHERE    1 = 0

-- in SQL, you can 
-- interchange a 1-row 1-col table for a scalar value

-- which is what we need in this query:

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice> (28.6501) -- scalar value in parens
ORDER BY unitprice


-- so here's the simple scalar subquery
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice > ( SELECT   AVG(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- notice that this is executed as a single query
-- no delay between calculating the average
-- and producing the record that has the average

-- this is a Simple Scalar Subquery

-- it's a subquery, because it is a fully-formed SELECT statement
-- it's scalar, because it is used in place of a scalar value
-- it's simple, because it can be executed independently of outer query
--  (it's *not* a correlated subquery - topic for another video)

-- a couple of caveats
-- * you can't put an ORDER BY in the subquery
-- * you can't rename columns in the subquery
-- these items don't make much sense int he context of a subquery

-- more examples
-- the idea with these Simple Scalar Subqueries,
-- is that you are SELECTing records that are special relative to the group

-- highest price product
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice = ( SELECT   MAX(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- lowest price product
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice = ( SELECT   MIN(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- below average price product
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice < ( SELECT   AVG(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- products more than two standard deviations 
-- above the average

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice > ( SELECT   AVG(unitprice)
                              + (2.0 * STDEV(unitprice))
                       FROM     Products ) 
ORDER BY unitprice

-- outlier products
-- products more than two standard deviations
-- away from the mean, above or below

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice > ( SELECT   AVG(unitprice)
                              + (2.0 * STDEV(unitprice))
                       FROM     Products ) 
    OR   unitprice < ( SELECT   AVG(unitprice)
                              - (2.0 * STDEV(unitprice))
                       FROM     Products ) 
ORDER BY unitprice

-- thanks for watching!

-- Database by Doug
-- 2/15/2016
-- Intro to Simple Scalar Subqueries

Friday, February 12, 2016

Introduction to Outer Joins

A short video covering Outer Joins in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- 2/12/2016
-- Intro to Outer Joins

use northwind

-- refresher
-- a join can be restrictive
-- it can filter records that don't satisfy the join

SELECT COUNT(productID) AS [Number of Products]
FROM   Products
-- 79 products

SELECT   COUNT(products.productID) AS [Number of Products]
FROM     Products
   JOIN  Categories ON Products.CategoryID=Categories.CategoryID

-- 77 products joined with Categories?


-- the "ON Products.CategoryID=Categories.CategoryID"
-- is known as the "join condition"

-- we missed the products with the NULL CategoryID

SELECT   ProductID,
         ProductName,
         CategoryID
FROM     Products
ORDER BY CategoryID

-- in particular, these products don't satisfy the join

SELECT   ProductID,
         ProductName,
         CategoryID
FROM     Products
WHERE    CategoryID IS NULL

-- so we get this, 77 products:
SELECT   Products.ProductID,
         Products.ProductName,
         Products.CategoryID,
         Categories.CategoryID,
         Categories.CategoryName
FROM     Products
   JOIN  Categories ON Products.CategoryID = Categories.CategoryID
ORDER BY Products.ProductID

-- that's why there are 77 instead of 79

-- so what if we want to see that categoryID?
--  the boss says she wants to see 
-- *every* product
-- *even if* it has a NULL value for the categoryID

-- in other words
-- "show *every* product, and if that product has a category
--  associated with it, show me that too"

-- this is where an outer join comes in

-- 
-- the above queries have been using a shorthand
-- they are really INNER JOINS
-- by default "JOIN" means "INNER JOIN"
-- the following two queries are logically equivalent
-- and ordering by categoryID

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products
       JOIN    Categories ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID
--
SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products
   INNER JOIN  Categories ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- and reformatting ...
-- this is the same as above, just reformatted, more specific
-- this is to get the two tables on the same line
-- and one table is on the left of the "INNER JOIN" phrase
-- and one table is on the right of the "INNER JOIN" phrase

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products INNER JOIN Categories  
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- now, a small change to see the one product
-- that doesn't satisfy the join
-- (at the bottom)

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products LEFT JOIN Categories   -- the change is in this line
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- the LEFT JOIN says:
-- "include all records from the table on the left"
-- in this case, Products is on the left, 
-- so every product record is shown, 
-- even if it doesn't satisfy the join condition

-- of course there is a RIGHT JOIN also
-- the next two queries are logically the same

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products LEFT JOIN Categories 
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID
-- 
SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Categories RIGHT JOIN Products 
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Products.ProductID

-- these are actually shorthand for
SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products LEFT OUTER JOIN Categories 
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- but the OUTER keyword is typically omitted

-- ** notice that the ON statement does not change
--    across INNER, LEFT, LEFT OUTER, RIGHT, RIGHT OUTER joins

-- OUTER joins are also known as "inclusive" joins,
-- because they *include* records that don't satisfy the join condition

-- INNER joins are also know as "exclusive" joins,
-- because they *exclude* records that don't satisfy the join condition


-- so the above query finds the product with a NULL
-- categoryID
-- the product record doesn't satisfy the join

-- what about a category that doesn't satisfy the join?

SELECT COUNT(CategoryID) AS [Number of Categories in Categories table]
FROM   Categories
-- ten categories

SELECT COUNT(DISTINCT CategoryID) AS [Number of Categories in Products table]
FROM   Products
-- only eight appear in the products table
-- but which 8?
-- we can't look for a NULL foreign key...

-- with few records, we can inspect the actual records...

SELECT   CategoryID
FROM     Categories
ORDER BY CategoryID

SELECT   DISTINCT CategoryID
FROM     Products
ORDER BY CategoryID

-- but what if there are thousands of categories
-- and millions of products?

-- an outer join..
-- from the other side...
-- "inclusive" of all Categories
SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Products RIGHT JOIN Categories 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CategoryID,
         Products.ProductID

-- so this includes *every* category, 
-- even the ones that don't satisfy the join
-- and also shows the category/product combinations that do satisfy the join

-- a common use of an outer join is 
-- to find records in one table that don't match
--  records from another table

-- this is sometimes called a "find unmatched" query

-- looking at the previous query again...

SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Products RIGHT JOIN Categories 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CategoryID,
         Products.ProductID

-- and rephrasing it as a LEFT JOIN...

SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Categories LEFT JOIN Products 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CategoryID,
         Products.ProductID

-- note the two categories with no matching products
-- they both have a NULL primary key, which can't be, right?
-- which is a sign that there really is no product record
-- so let's limit to only that 
-- to limit to only the categories with no products

SELECT   Categories.CAtegoryName,
         Products.productID,
         Products.productName
FROM     Categories LEFT JOIN  Products
            ON Products.CategoryID= Categories.CategoryID
WHERE    Products.productID IS NULL
ORDER BY Categories.CAtegoryName,
         Products.productName

-- and formatting for readability
SELECT      Categories.CAtegoryName,
            Products.productID,
            Products.productName
FROM        Categories 
  LEFT JOIN Products ON Products.CategoryID = Categories.CategoryID
WHERE       Products.productID IS NULL
ORDER BY    Categories.CategoryName,
            Products.productName

-- and now, a FULL OUTER JOIN
-- this shows *every* category
-- and *every* product
-- even the ones that don't satisfy the join
-- its the left and the right together

SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Products FULL OUTER JOIN Categories 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CAtegoryName,
         Products.productName

-- and formatted consistently with the above examples

SELECT               Categories.CategoryName,
                     Categories.CategoryID,
                     Products.CategoryID,
                     Products.productID,
                     Products.productName
FROM                 Products 
   FULL OUTER JOIN   Categories ON Products.CategoryID= Categories.CategoryID
ORDER BY             Categories.CAtegoryName,
                     Products.productName


-- Database by Doug
-- 2/12/2016
-- Intro to Outer Joins

Followers