Wednesday, February 17, 2016

Fixing Amazon Music Player blocked by Group Policy

A few months ago, a new Active Directory Group Policy was rolled out to my network in response to some nasty viruses going around.

And my Amazon Music Player stopped working, with a message:

C:\Users\...\AppData\Local\Amazon Music\Amazon Music.exe
This program is blocked by group policy. For more information, contact your system administrator.

Error message for Amazon Music blocked by Group Policy


Not sure of the technical details, but the Group Policy effectively blocked any exe files from executing in the \users\AppData\local folder. And Amazon Music Player installs, by default into this folder.

I did some google searches, and found some complaints by system administrators. I also found some information on how to more selectively block executables, rather than block the entire folder. I also submitted a ticket to my help desk to see if they could unblock this particular executable, since I *must* work to music. But they declined, citing security concerns. They've got more important things to worry about.

So months went by, while I made due with Amazon's web-based player. However, I have also been unable to upload my large music library into Amazon's Music Library. The only way to do it is through Amazon Music Player, which was now blocked by the group policy.

And since I recently gave in to a subscription for the Amazon Music Library, I needed to get this working.

So here's what I did. I copied the entire folder containing the executable to another location. This is the folder I copied:
 C:\Users\...\AppData\Local\Amazon Music\
I copied it to a spare drive I had, so it ended up here:
H:\Amazon Music\
Then I double-clicked on the executable here:
H:\Amazon Music\Amazon Music.exe
 Worked fine. So I put a shortcut to this copy of the executable on my desktop, and that's what I'm using.

A couple of caveats.


  • Copying just the executable did not work. 
    • Apparently there are some dll files whose relative paths must be maintained.
  • The cached files remain in the original location in appdata\local
    • I noticed that there are cache folders (that grow quite large) contained in the \Amazon Music\data folder. I cleared them, but even though I'm running the executable from a different location, the cache files still show up in the \Amazon Music\data folder.
Well, this got me running. I imagine it is not a permanent or ideal solution, but I can upload my mp3s into my Amazon Library now.


Getting your music off Amazon cloud drive

A few years ago, I had a free trial of Amazon Music, but wasn't ready to make the commitment to the paid version. However, I had moved a ton of music up into Amazon Music.

When I didn't sign up for the paid subscription to Amazon Music, they moved all but 250(?) uploaded songs into my Amazon Cloud Drive. They neatly placed it into a folder called Archived Music.

Well, since then I've gone ahead and committed to the paid subscription for Amazon Music. However, most of my music isn't in my Amazon Music library, it's in my Amazon Cloud Drive. You'd think there would be an easy way to move it back, but I couldn't find one. 

So, how to get my 10GB of mp3 files (that I painstakingly ripped from my CD collection) out of my cloud drive and into my Amazon Music library?

Unfortunately, there's a cap on downloads. Based on internet searches, it seems that you can only download 500 items at a time. So I couldn't select my entire Archived Music folder, since it had thousands of files.

Well, I couldn't find a solution online, so this is how I did it.

The music was arranged in the Archived Music folder by artist, so I first sorted by the folder names.

Then I selected the first 10 folders, and downloaded them. It was somewhere between 200 and 300 items, well under the cap.

Selecting and downloading artist folders from Amazon Cloud Drive

The download arrives as a zip file called "AmazonCloudDriveDownload.zip". 

Then, I selected the next set of artists and downloaded them. It arrived in my download folder as "AmazonCloudDriveDownload(1).zip".

How the zip files end up in your downloads folder


I was able to get 4 downloads going simultaneously, and going from three concurrent downloads to four didn't seem to slow any of them down. I probably could have done more simultaneously, but I didn't push it.

Depending on my artists, and the bit rate at which I ripped the mp3 files, the zip files were anywhere between 200MB and 2.2GB. It took 21 downloads to get all the files. Not sure if there was any throttling along the way, by Amazon or by my service provider. 

Once I got them all downloaded, I unzipped all the artist folders into a common folder. I wanted to make sure I didn't miss any artist folders, so I pulled up my local artists folders and compared side-by-side with the folders on the cloud drive. I did find that I had missed a whole download somehow, so I was glad I double-checked.
Comparing my local artist folders with those on Amazon Cloud Drive

 I was able to get all of them downloaded in an afternoon, on a pretty fast connection. All in all it was not as painful as I thought it would be.

Next step is to upload them into my Amazon Music Library. I'll let you know how that goes.

The EXISTS clause in SQL

A short video on how to use the EXISTS clause in SQL.
For beginners. How and when to use the EXISTS clause to find records in one table that match records in another table. How it compares to JOINing and the IN clause.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 10/27/2015
-- EXISTS

use northwind

-- the EXISTS clause is a logical operator that evaluates a subquery for existence of records
-- if the subquery has one or more records, EXISTS returns true
-- if the subquery has no records, EXISTS returns false

-- since EXISTS returns a logical value,
-- it is common to use EXISTS in the WHERE clause, which expects logical expressions

-- it can make some queries faster
-- and sometimes makes your query clearer

-- here is a simple example

SELECT *
FROM   Products
WHERE  EXISTS ( SELECT 17 )

-- which returns all records in the Products table
-- the logic is that EXISTS is evaluated for each record in the Products table
-- for each record, EXISTS (SELECT 17) is evaluated,
-- and since (SELECT 17) always returns a 1-row, 1-column table, 
-- EXISTS returns true for all records

-- that's not a very practical example

-- here's a very common use of the EXISTS 

SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- this is "Categories that have products"
-- 
-- the subquery is a "correlated subquery", 
-- which means it references a table outside the subquery
-- in this case, the correlation is on Categories.CategoryID
-- note that Categories is not in the FROM clause of the subquery

-- you can do this same thing like this:

SELECT   DISTINCT CategoryName
FROM     Categories
   JOIN  Products    ON Products.CategoryID = Categories.CategoryID

-- note that we need to use DISTINCT to remove duplicates

-- so why not always use the JOIN version?

-- well, notice we don't really need to get any 
-- fields from the Products table
-- it's only there to figure out which categories to show

-- consider a case where there are 2 categories
-- and 20 billion products

-- the EXISTS version does this:
---- for category 1
---- look through products until it finds a product in category 1
---- stop
---- for category 2
---- look through products until it finds a product in category 2
---- stop

-- the JOIN w/DISTINCT version does this:
---- join the 2 categories with all 20 billion products
---- remove the 19,999,999,998 duplicates
   
-- so you can see that the EXISTS version can be much faster
-- depending on the situation

-- EXISTS is logically very similar to IN

-- here's another way to do the query above
SELECT CategoryName
FROM   Categories
WHERE  CategoryID IN 
         (SELECT CategoryID
          FROM   Products)

-- reminder of EXISTS
SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- the work that the database engine has to do
-- is very similar here
-- the "IN" can stop as soon as it finds the first instance
-- it doesn't need to search through all the products

-- however, NOT EXISTS (and NOT IN) does not usually have the same performance benefit:
SELECT CategoryName
FROM   Categories
WHERE  NOT EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- logically, this will give you the correct records
-- however, proving that something DOES NOT exist means checking all of them
-- in our example above with 2 categories, and 20 billion products
-- assume that category 2 doesn't have any products
-- to prove that category 2 doesn't exist in Products,
-- all 20 billion products need to be checked

-- in summary, EXISTS can sometimes improve the performance 
-- of a query like this, over a JOIN
SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT *
          FROM   Products
          WHERE  Products.CategoryID = Categories.CategoryID)

-- also, EXISTS can sometimes to be used to more clearly
-- phrase a query:
SELECT CategoryName
FROM   Categories
WHERE  EXISTS 
         (SELECT 
          WHERE  Products.CategoryID = Categories.CategoryID)

-- thanks for watching

-- Database by Doug
-- Douglas Kline
-- 10/27/2015
-- EXISTS

The HAVING clause in SQL

A short video covering the HAVING clause in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 2/17/2016
-- HAVING - how to limit which groups are shown

USE Northwind

-- you might want to check out
-- GROUP BY part 1 - introduction
-- GROUP BY part 2 - NULL handling and COUNTs
-- GROUP BY part 3 - unique groupings and multi-column groups

-- a simple query from Products, leaving out any JOINs for clarity

SELECT    CategoryID,
          ProductID,
          ProductName,
          Unitprice
FROM      Products
ORDER BY  CategoryID,
          ProductID

-- now grouping by Category, and calculating average unit price
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  CategoryID

-- now suppose that we'd like to only see
-- "high priced categories"

-- we could maybe do this:
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

-- now the higher priced categories are at the top

-- I might even be able to use the non-ANSI-standard
-- TOP clause like this:

SELECT    TOP 5
          CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

-- but what if I want a cut-off
-- only show categories whose average unitprice is above $30
-- there's a statement for that: HAVING

SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    AVG(Unitprice) > 30   -- this is the new piece
ORDER BY  [average unit price] DESC

-- the new part is the HAVING clause
-- the HAVING clause is similar to the WHERE clause
-- in that it specifies a logical expression 
-- that can be evaluated as true or false

-- however, the logical expression is evaluated 
-- for each *group* (in our case, each group represents a category)

-- with the WHERE clause
-- the logical expression is evaluated
-- for each *record* (in our case, each record represents a product)

-- consider this example
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    COUNT(productID) > 2   -- this is the new piece
ORDER BY  [average unit price] DESC

-- here we are saying, only show categories
-- that have at least two products in them
-- notice that the HAVING doesn't have to be
-- something in the SELECT clause

-- consider this example
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    MIN(unitprice) > 5  -- this is the new piece
ORDER BY  [average unit price] DESC

-- here, only show categories
-- with minimum-priced products greater than $5

-- note that in the last three examples,
-- the average unit price does not change for each category
-- what changes, is which categories are displayed

-- now consider these:
SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
WHERE     unitsInStock > 0 -- this is the changed part
GROUP BY  CategoryID
ORDER BY  [average unit price] DESC

-- notice that the average unit prices change
-- and potentially, the order of the rows could change

-- this is because only products with unitsInStock > 0
-- are included in the groups

-- and therefore, the aggregations (AVG, MIN, MAX, COUNT, etc.)
--  change

-- so, here is a good way 
-- to think about the order
-- in which the operations are performed

-- WHERE comes first - records are eliminated before the grouping
-- GROUP BY comes next - remaining records are arranged into groups
-- HAVING comes next - groups are eliminated
-- ORDER BY comes last - what's left gets ordered

-- a larger last example
SELECT   Suppliers.CompanyName,
         SUM(Products.unitprice 
             * Products.unitsInStock
            )        AS [Value of Inventory]
FROM     Products
   JOIN  Suppliers   ON Products.SupplierID = Suppliers.SupplierID
WHERE    Products.CategoryID IN (1, 3)
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
HAVING   COUNT(ProductID) > 1
ORDER BY [Value of Inventory] DESC

-- in english:
-- show the dollar value of inventory, by supplier
-- only include products in categories 1 and 3
-- only include suppliers that supply more than one product
-- order the suppliers by descending value of inventory

-- a few things to note
-- the GROUP BY has a Suppliers.SupplierID to ensure a unique grouping
-- The WHERE clause could be changed without affecting any other part
-- THE HAVING clause could be changed without affecting any other part
-- (of course the results change...)

-- another interesting note: HAVING is not strictly needed
-- you can accomplish a HAVING using a subquery
-- going back to my first example

SELECT    CategoryID,
          AVG(Unitprice) AS [average unit price]
FROM      Products
GROUP BY  CategoryID
HAVING    AVG(Unitprice) > 30   
ORDER BY  [average unit price] DESC

-- could equivalently be done 
-- using a subquery in the FROM clause
-- like this:
SELECT   CategoryID,
         [average unit price]         
FROM     (SELECT   CategoryID,
                   AVG(Unitprice) AS [average unit price]
          FROM     Products
          GROUP BY CategoryID) [CategorySubquery]
WHERE    [average unit price] > 30
ORDER BY [average unit price] DESC

-- but the HAVING clause is much clearer
-- especially with situations like one-to-many aggregation joins
-- like the one above:

SELECT   Suppliers.CompanyName,
         SUM(Products.unitprice 
             * Products.unitsInStock
            )        AS [Value of Inventory]
FROM     Products
   JOIN  Suppliers   ON Products.SupplierID = Suppliers.SupplierID
   JOIN  (SELECT   SupplierID,
                   COUNT(ProductID) AS [product count]
          FROM     Products
          WHERE    Products.CategoryID IN (1,3)
          GROUP BY SupplierID
         ) [SupplierGroup]
                     ON Suppliers.SupplierID = SupplierGroup.SupplierID
WHERE    [SupplierGroup].[product count] > 1
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
ORDER BY [Value of Inventory] DESC

-- compared to:
SELECT   Suppliers.CompanyName,
         SUM(Products.unitprice 
             * Products.unitsInStock
            )        AS [Value of Inventory]
FROM     Products
   JOIN  Suppliers   ON Products.SupplierID = Suppliers.SupplierID
WHERE    Products.CategoryID IN (1, 3)
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
HAVING   COUNT(ProductID) > 1
ORDER BY [Value of Inventory] DESC

-- thanks for watching and / or reading

-- Database by Doug
-- Douglas Kline
-- 2/17/2016
-- HAVING - how to limit which groups are shown


Monday, February 15, 2016

Simple Scalar Subqueries

A short video covering simple scalar subqueries in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- 2/15/2016
-- Intro to Simple Scalar Subqueries

use northwind

-- 
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products

-- suppose we want to show all products 
--  with above-average unit price

-- we can find the average unitprice

SELECT   AVG(unitprice)    AS [average unitprice]
FROM     Products

-- but which products are they?

-- this doesn't work:
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products

-- I'd have to group, which gives this:
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products
GROUP BY ProductID

-- then limit it? that doesn't work?
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products
WHERE    unitprice > AVG(unitprice)
GROUP BY ProductID

-- how about having? darn!
SELECT   ProductID,
         AVG(unitprice)   AS [average unitprice]
FROM     Products
GROUP BY ProductID
HAVING   unitprice > AVG(unitprice)

-- none of those things works
-- so let's look at the average again:
SELECT   AVG(unitprice)    AS [average unitprice]
FROM     Products

-- then use it in a second query
-- value = 28.6501

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice> 28.6501
ORDER BY unitprice

-- however, it took two steps
-- and in a dynamic environment
-- the average value would have changed
-- in the time between execution of the two SELECT statements

-- what we'd like to do is have these two queries 
-- executed as a single query 

-- well, consider this again
-- notice that it returns a 
-- 1-row, 1-column table

SELECT   AVG(unitprice)    AS [average unitprice]
FROM     Products

-- and convince yourself that it 
-- will *always* return a
-- 1-row, 1-column table

-- even this:
SELECT  AVG(unitprice)     AS [avg of a NULL]
FROM    Products
WHERE   unitprice IS NULL

SELECT  COUNT(productID)     AS [proof of a NULL unitprice]
FROM    Products
WHERE   unitprice IS NULL

-- and this:
SELECT   AVG(unitprice)    AS [average of no records]
FROM     Products
WHERE    1 = 0

-- in SQL, you can 
-- interchange a 1-row 1-col table for a scalar value

-- which is what we need in this query:

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice> (28.6501) -- scalar value in parens
ORDER BY unitprice


-- so here's the simple scalar subquery
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice > ( SELECT   AVG(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- notice that this is executed as a single query
-- no delay between calculating the average
-- and producing the record that has the average

-- this is a Simple Scalar Subquery

-- it's a subquery, because it is a fully-formed SELECT statement
-- it's scalar, because it is used in place of a scalar value
-- it's simple, because it can be executed independently of outer query
--  (it's *not* a correlated subquery - topic for another video)

-- a couple of caveats
-- * you can't put an ORDER BY in the subquery
-- * you can't rename columns in the subquery
-- these items don't make much sense int he context of a subquery

-- more examples
-- the idea with these Simple Scalar Subqueries,
-- is that you are SELECTing records that are special relative to the group

-- highest price product
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice = ( SELECT   MAX(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- lowest price product
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice = ( SELECT   MIN(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- below average price product
SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice < ( SELECT   AVG(unitprice) 
                       FROM     Products ) 
ORDER BY unitprice

-- products more than two standard deviations 
-- above the average

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice > ( SELECT   AVG(unitprice)
                              + (2.0 * STDEV(unitprice))
                       FROM     Products ) 
ORDER BY unitprice

-- outlier products
-- products more than two standard deviations
-- away from the mean, above or below

SELECT   ProductID,
         ProductName,
         UnitPrice
FROM     Products
WHERE    unitprice > ( SELECT   AVG(unitprice)
                              + (2.0 * STDEV(unitprice))
                       FROM     Products ) 
    OR   unitprice < ( SELECT   AVG(unitprice)
                              - (2.0 * STDEV(unitprice))
                       FROM     Products ) 
ORDER BY unitprice

-- thanks for watching!

-- Database by Doug
-- 2/15/2016
-- Intro to Simple Scalar Subqueries

Friday, February 12, 2016

Introduction to Outer Joins

A short video covering Outer Joins in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- 2/12/2016
-- Intro to Outer Joins

use northwind

-- refresher
-- a join can be restrictive
-- it can filter records that don't satisfy the join

SELECT COUNT(productID) AS [Number of Products]
FROM   Products
-- 79 products

SELECT   COUNT(products.productID) AS [Number of Products]
FROM     Products
   JOIN  Categories ON Products.CategoryID=Categories.CategoryID

-- 77 products joined with Categories?


-- the "ON Products.CategoryID=Categories.CategoryID"
-- is known as the "join condition"

-- we missed the products with the NULL CategoryID

SELECT   ProductID,
         ProductName,
         CategoryID
FROM     Products
ORDER BY CategoryID

-- in particular, these products don't satisfy the join

SELECT   ProductID,
         ProductName,
         CategoryID
FROM     Products
WHERE    CategoryID IS NULL

-- so we get this, 77 products:
SELECT   Products.ProductID,
         Products.ProductName,
         Products.CategoryID,
         Categories.CategoryID,
         Categories.CategoryName
FROM     Products
   JOIN  Categories ON Products.CategoryID = Categories.CategoryID
ORDER BY Products.ProductID

-- that's why there are 77 instead of 79

-- so what if we want to see that categoryID?
--  the boss says she wants to see 
-- *every* product
-- *even if* it has a NULL value for the categoryID

-- in other words
-- "show *every* product, and if that product has a category
--  associated with it, show me that too"

-- this is where an outer join comes in

-- 
-- the above queries have been using a shorthand
-- they are really INNER JOINS
-- by default "JOIN" means "INNER JOIN"
-- the following two queries are logically equivalent
-- and ordering by categoryID

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products
       JOIN    Categories ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID
--
SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products
   INNER JOIN  Categories ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- and reformatting ...
-- this is the same as above, just reformatted, more specific
-- this is to get the two tables on the same line
-- and one table is on the left of the "INNER JOIN" phrase
-- and one table is on the right of the "INNER JOIN" phrase

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products INNER JOIN Categories  
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- now, a small change to see the one product
-- that doesn't satisfy the join
-- (at the bottom)

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products LEFT JOIN Categories   -- the change is in this line
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- the LEFT JOIN says:
-- "include all records from the table on the left"
-- in this case, Products is on the left, 
-- so every product record is shown, 
-- even if it doesn't satisfy the join condition

-- of course there is a RIGHT JOIN also
-- the next two queries are logically the same

SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products LEFT JOIN Categories 
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID
-- 
SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Categories RIGHT JOIN Products 
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Products.ProductID

-- these are actually shorthand for
SELECT         Products.ProductID,
               Products.ProductName,
               Products.CategoryID,
               Categories.CategoryID,
               Categories.CategoryName
FROM           Products LEFT OUTER JOIN Categories 
                  ON Products.CategoryID = Categories.CategoryID
ORDER BY       Categories.CategoryID

-- but the OUTER keyword is typically omitted

-- ** notice that the ON statement does not change
--    across INNER, LEFT, LEFT OUTER, RIGHT, RIGHT OUTER joins

-- OUTER joins are also known as "inclusive" joins,
-- because they *include* records that don't satisfy the join condition

-- INNER joins are also know as "exclusive" joins,
-- because they *exclude* records that don't satisfy the join condition


-- so the above query finds the product with a NULL
-- categoryID
-- the product record doesn't satisfy the join

-- what about a category that doesn't satisfy the join?

SELECT COUNT(CategoryID) AS [Number of Categories in Categories table]
FROM   Categories
-- ten categories

SELECT COUNT(DISTINCT CategoryID) AS [Number of Categories in Products table]
FROM   Products
-- only eight appear in the products table
-- but which 8?
-- we can't look for a NULL foreign key...

-- with few records, we can inspect the actual records...

SELECT   CategoryID
FROM     Categories
ORDER BY CategoryID

SELECT   DISTINCT CategoryID
FROM     Products
ORDER BY CategoryID

-- but what if there are thousands of categories
-- and millions of products?

-- an outer join..
-- from the other side...
-- "inclusive" of all Categories
SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Products RIGHT JOIN Categories 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CategoryID,
         Products.ProductID

-- so this includes *every* category, 
-- even the ones that don't satisfy the join
-- and also shows the category/product combinations that do satisfy the join

-- a common use of an outer join is 
-- to find records in one table that don't match
--  records from another table

-- this is sometimes called a "find unmatched" query

-- looking at the previous query again...

SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Products RIGHT JOIN Categories 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CategoryID,
         Products.ProductID

-- and rephrasing it as a LEFT JOIN...

SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Categories LEFT JOIN Products 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CategoryID,
         Products.ProductID

-- note the two categories with no matching products
-- they both have a NULL primary key, which can't be, right?
-- which is a sign that there really is no product record
-- so let's limit to only that 
-- to limit to only the categories with no products

SELECT   Categories.CAtegoryName,
         Products.productID,
         Products.productName
FROM     Categories LEFT JOIN  Products
            ON Products.CategoryID= Categories.CategoryID
WHERE    Products.productID IS NULL
ORDER BY Categories.CAtegoryName,
         Products.productName

-- and formatting for readability
SELECT      Categories.CAtegoryName,
            Products.productID,
            Products.productName
FROM        Categories 
  LEFT JOIN Products ON Products.CategoryID = Categories.CategoryID
WHERE       Products.productID IS NULL
ORDER BY    Categories.CategoryName,
            Products.productName

-- and now, a FULL OUTER JOIN
-- this shows *every* category
-- and *every* product
-- even the ones that don't satisfy the join
-- its the left and the right together

SELECT   Categories.CategoryName,
         Categories.CategoryID,
         Products.CategoryID,
         Products.productID,
         Products.productName
FROM     Products FULL OUTER JOIN Categories 
            ON Products.CategoryID= Categories.CategoryID
ORDER BY Categories.CAtegoryName,
         Products.productName

-- and formatted consistently with the above examples

SELECT               Categories.CategoryName,
                     Categories.CategoryID,
                     Products.CategoryID,
                     Products.productID,
                     Products.productName
FROM                 Products 
   FULL OUTER JOIN   Categories ON Products.CategoryID= Categories.CategoryID
ORDER BY             Categories.CAtegoryName,
                     Products.productName


-- Database by Doug
-- 2/12/2016
-- Intro to Outer Joins

Wednesday, February 10, 2016

Group By Part 3: uniqueness and multi-column grouping

A short video covering the GROUP BY clause in SQL.

Part 3: uniqueness and multi-column grouping
For beginners.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 3 uniqueness and multi-column Grouping

USE Northwind

-- here's a basic query joining Categories with Products
SELECT   Categories.CategoryName,
         Products.ProductName,
         Products.Unitprice
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
ORDER BY Categories.CategoryName

-- now let's aggregate
-- show the average unit price, by Category

SELECT   Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName
ORDER BY Categories.CategoryName

-- awesome! runs fine, looks reasonable...

-- or is it

-- let me add one more item
SELECT   Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName,
         Categories.CategoryID -- this is the added piece
ORDER BY Categories.CategoryName

-- notice that now I have an additional group
-- let's add a column to see what's going on
SELECT   Categories.CategoryID,
         Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName,
         Categories.CategoryID -- this is the added piece
ORDER BY Categories.CategoryName

-- aha! there are two categories with the same name
-- is that OK? 
-- well, CategoryName is not the primary key
-- and apparently the schema allows duplicate values in the CategoryName

-- so, the lesson here is to always specify a *unique* grouping
-- in general, this means adding a primary key 
-- to the GROUP BY statement, even if it is not displayed

-- for example:
SELECT   Categories.CategoryName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
  JOIN   Categories    ON Products.CategoryID = Categories.CategoryID
GROUP BY Categories.CategoryName,
         Categories.CategoryID 
ORDER BY Categories.CategoryName,
         Categories.CategoryID

-- or another example
SELECT   Suppliers.CompanyName,
         AVG(Products.unitprice) AS [average unit price]
FROM     Products
   JOIN  Suppliers      ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Suppliers.CompanyName,
         Suppliers.SupplierID
ORDER BY Suppliers.CompanyName,
         Suppliers.SupplierID

-- notice in the second example that there are no duplicate
-- company names
-- so should we still add the SupplierID to the group by?
-- absolutely
-- your query needs to work on any data that might be in the database
-- not just what's there now

-- now for multi-level grouping
-- consider this query

SELECT   Suppliers.CompanyName,
         Categories.CategoryName,
         Products.ProductName,
         Products.Unitprice
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
ORDER BY Suppliers.CompanyName,
         Categories.CategoryName

-- notice that Exotic Liquids supplies us products in two categories
-- same with Forets...

-- so, let's calculate average unitprice by Supplier AND Category

SELECT   Suppliers.CompanyName,
         Categories.CategoryName,
         AVG(Products.Unitprice)  AS [avg unit price]
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
GROUP BY Suppliers.CompanyName,
         Categories.CategoryName
ORDER BY Suppliers.CompanyName,
         Categories.CategoryName

-- notice that now Exotic liquids has two groups
-- one for the Beverages category 
-- and one for the Spreads category

-- so now, the grouping is based on the *combination* of the two fields
-- note that the order of the fields in the GROUP BY doesn't matter
-- but for clarity it is best to be consistent with the ORDER BY 
--  (the order of the fields in the ORDER BY *does* matter)

-- and of course, to make sure our groupings are *unique* we should add
-- a unique field to the GROUP BY

SELECT   Suppliers.CompanyName,
         Categories.CategoryName,
         AVG(Products.Unitprice)  AS [avg unit price]
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
GROUP BY Suppliers.CompanyName,
         Categories.CategoryName,
         Suppliers.SupplierID,
         Categories.CategoryID
ORDER BY Suppliers.CompanyName,
         Categories.CategoryName,
         Suppliers.SupplierID,
         Categories.CategoryID

-- on a side note
-- it's also important to have a unique ordering
-- so that downstream consumers of your data aren't surprised
-- when the ordering changes

-- note that very different information that is provided if 
-- the column-ordering, and the row-ordering is reversed

SELECT   Categories.CategoryName,
         Suppliers.CompanyName,
         AVG(Products.Unitprice)  AS [avg unit price]
FROM     Suppliers
   JOIN  Products    ON Suppliers.SupplierID =Products.SupplierID
   JOIN  Categories  ON Categories.CategoryID=Products.CategoryID
GROUP BY Categories.CategoryName,
         Suppliers.CompanyName,
         Suppliers.SupplierID,
         Categories.CategoryID
ORDER BY Categories.CategoryName,
         Suppliers.CompanyName,
         Suppliers.SupplierID,
         Categories.CategoryID

-- all the data in the result set is the same as the previous example

-- in the first example, the focus was on the Supplier
-- and secondarily, on the categories

-- in the second example, the focus is on the Category
-- and secondarily, on the Suppliers

-- in summary, make sure that your GROUP BY statement
-- specifies a unique grouping
-- this is generally done by adding the primary key

-- you can specify multiple columns in the GROUP BY
-- then groups are determined by *unique combinations*
-- of the values in the columns

-- finally, merely changing the column ordering
-- and the row ordering can produce quite different
-- information for human use

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 3 uniqueness & multi-column Grouping

Group By Part 2: Null Handling and Count

A short video covering the GROUP BY clause in SQL.

Part 2: covering how NULL values are handled and Counts For beginners.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 2 NULL Handling & Count
USE Northwind

-- see GROUP BY - Part 1 Introduction
-- what about NULL values? How are they handled?

SELECT SUM(unitprice)/COUNT(ProductID)                AS [manual average unitprice],
       AVG(unitprice)                                 AS [correct average unitprice],
       SUM(unitprice)/COUNT(ProductID)-AVG(unitprice) AS [difference]
FROM   products

-- normally, you would expect these two averages to agree
-- but I've added a product with a NULL unitprice:

SELECT *
FROM   Products
WHERE  unitprice IS NULL

-- so let's look at all the pieces to see what's happening

SELECT SUM(unitprice)                  AS [total of all unitprices],
       COUNT(ProductID)                AS [number of records],
       SUM(unitprice)/COUNT(ProductID) AS [manual average unitprice],
       AVG(unitprice)                  AS [correct average unitprice]
FROM   products

-- the problem here is that there are 79 total records
-- but only 78 non-NULL unitprices
-- the AVG function handles the NULL appropriately
-- and only divides by 78

-- in general, only non-NULL values are included in MIN, MAX, SUM, STDEV, VAR, AVG, etc.

-- SO, depend on the functions, rather than writing your own
-- they will appropriately handle NULL values
SELECT      AVG(unitprice),
            SUM(unitprice),
            STDEV(unitprice),
            VAR(unitprice)
FROM        Products

-- COUNT needs more explanation
-- note the difference in the two number returned below:

SELECT   COUNT(ProductID)   AS [number of non-NULL productIDs],
         COUNT(unitprice)   AS [number of non-NULL unitPrices]
FROM     Products

-- COUNT requires a field to be specified
-- it will count records that have a non-NULL value in that field

-- but what about this?

SELECT   COUNT(*)          AS [count of products]
FROM     Products

-- recall that an asterisk in the SELECT clause is shorthand for 'all fields'
-- so the above can be interpreted as 
-- "check all fields for non-NULL values, and count if any field has a non-NULL value"

-- of course, if a table has a primary key, then the table can be guaranteed
-- to have at least one field that is non-NULL (the primary key field)

-- for code clarity, the asterisk should be avoided
-- it is much clearer to count the primary key of the entity that you intend to count

-- as an example:
SELECT         COUNT(*)    AS [what is being counted here?]
FROM           Products  
   INNER JOIN  Categories  ON Products.CategoryID = Categories.CategoryID

-- this is much clearer
-- we're counting products, because we are counting the primary key
-- of the products table
SELECT         COUNT(Products.ProductID) AS [product count]
FROM           Products  
   INNER JOIN  Categories  ON Products.CategoryID = Categories.CategoryID

-- but why are their only 77 records, and not 79?
-- because only 77 products satisfy the JOIN with Categories:

SELECT      ProductID,
            CategoryID
FROM        Products
ORDER BY    CategoryID

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 2 NULL Handling & Count

GROUP BY Part 1: Introduction

A short video covering the GROUP BY clause in SQL.

For beginners.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY - Part 1 Introduction

USE Northwind

-- here's a simple SELECT statement from the Products table

SELECT   ProductID,
         productname,
         unitprice,
         UnitsInStock
FROM     Products
ORDER BY ProductID

-- notice that I get all the rows, because there is not a WHERE clause

-- if there were 10 million rows, I would see all of them

-- but sometimes I'd like to see a *summary* of the products

-- consider this:

SELECT   AVG(unitprice)  AS [average unit price]
FROM     Products

-- in this case, we have summarized all the prices of all products
-- into a single number -  the average

-- this summarization is known as "aggregating" or "aggregation"
-- we aggregate all the product records into a single group,
-- then we can ask questions about the group

-- what else can we find out about the group?

SELECT   AVG(unitprice)    AS [average unit price],
         MIN(unitprice)    AS [minimum unit price],
         MAX(reorderlevel) AS [maximum reorder level],
         COUNT(productID)  AS [number of products],
         SUM(unitsinStock) AS [total number of units in stock],
         STDEV(unitprice)  AS [standard deviation of unit price],
         VAR(unitprice)    AS [variance of unit price]
FROM     Products

-- we can also include expressions in an aggregate function

SELECT   SUM(unitprice * unitsInStock) AS [total value of inventory]
FROM     Products



-- in the above examples we were aggregating over the whole table
-- but you can limit the aggregation

SELECT   AVG(UnitPrice)     AS [average unit price]
FROM     Products
WHERE    CategoryID IN (2, 4, 6)

-- in this example, the AVG is calculated over 
-- products in the stated categories

-- think of the above example as happening in this order
-- 1 - the WHERE clause - records are eliminated first
-- 2 - the averaging - aggregate calculated over remaining records

-- now consider these two queries:
SELECT   AVG(unitprice) AS [Avg price of category 1 products]
FROM     Products
WHERE    categoryID = 1

SELECT   AVG(unitprice) AS [Avg price of category 2 products]
FROM     Products
WHERE    categoryID = 2

-- wouldn't it be nice to have these in the same table?
-- or a table that shows a list of categories, and the average for each category?
-- like this:
--**********************
--* CategoryID *  Avg  *
--**********************
--*         1  *  31.8 *
--*         2  *  23.1 *
--*         3  *  15.2 *
--*       ...  *  ...  *
--**********************

-- here it is:

SELECT   CategoryID,
         AVG(unitprice) AS [Avg unitprice]
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID

-- the new part is the GROUP BY clause
-- it has to be before the ORDER BY, and after the WHERE

-- the logic is this:
-- arrange the products into groups, using the CategoryID
-- all CategoryID =1 records go into a group
-- all CategoryID =2 records go into a group
-- etc.
-- now, answer a question about each group - what is the average unit price?

SELECT   CategoryID,
         AVG(unitprice) AS [Avg unitprice]
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID

-- and once the grouping is established, I can ask other aggregate questions:

SELECT   CategoryID,
         AVG(unitprice)    AS [average unit price],
         MIN(unitprice)    AS [minimum unit price],
         MAX(reorderlevel) AS [maximum reorder level],
         COUNT(productID)  AS [number of products],
         SUM(unitsinStock) AS [total number of units in stock],
         STDEV(unitprice)  AS [standard deviation of unit price],
         VAR(unitprice)    AS [variance of unit price]
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID
-- 
-- so how can we visualize what's going on?

SELECT   CategoryID
FROM     Products
ORDER BY CategoryID

-- in the above, notice the groupings of
-- the records, by categoryID

-- first two records are one group, with a NULL categoryID
-- next twelve records are the next group, where categoryID=1
-- etc.

-- now group them
SELECT   CategoryID
FROM     Products
GROUP BY CategoryID
ORDER BY CategoryID

-- each of the 9 resulting records represents
-- a group of products that have a common categoryID

-- now there are groups of products
-- one group for each different categoryID
-- (you can also do this with DISTINCT,
--  but that's for another video)

-- now let's add some attributes that we're interested in
SELECT   CategoryID,
         ProductID,
         ProductName,
         Unitprice
FROM     Products
ORDER BY CategoryID,
         ProductID

-- looking at the left-most column,
-- we can see values that will become groups
-- the two NULL categoryIDs will become a group
-- the next twelve records with categoryID=1 will become the next group

-- we're looking to get the average unit price by category
-- so let's pare down to just the columns we need
SELECT   CategoryID,
         Unitprice
FROM     Products
ORDER BY CategoryID

-- so let's "convert" the above into an aggregate
-- and I'm expecting that there will be two products for categoryID NULL
-- and twelve products for categoryID=1

SELECT   CategoryID,
         AVG(unitprice) AS [average unit price] -- the AVG function is new
FROM     Products
GROUP BY CategoryID    -- the GROUP BY is new
ORDER BY CategoryID

-- the changes above are:
-- * add the GROUP BY clause with the CategoryID
-- * add the AVG function to the unitprice

-- in the english language, a common idiom is to ask for
-- "a count of products by category" or
-- "the average unit price by supplier"
-- when someone asks "... by category" 
-- they usually mean that the category should be in the leftmost column
-- and that the records should be GROUP BY'ed category
-- and that the records should be ORDER BY'ed category

-- a couple of examples

-- Manager Bob: "show me the total value of inventory by category"
-- here's what Manager Bob most likely wants (but avoiding JOINs for now):
SELECT      CategoryID,     
            SUM(unitprice * unitsinstock) AS [value of inventory]
FROM        Products
GROUP BY    CategoryID
ORDER BY    CategoryID

-- Manager Barb: "show me the lowest unitprice of products by supplier"
-- here's what Manager Barb most likely wants (but avoiding JOINs again...):

SELECT      SupplierID,
            MIN(unitprice) AS [lowest unitprice]
FROM        Products
GROUP BY    SupplierID
ORDER BY    SupplierID

-- more clearly
-- "by Supplier"
SELECT      SupplierID,    -- supplier in first column
            MIN(unitprice) AS [lowest unitprice]
FROM        Products       
GROUP BY    SupplierID     -- supplier is the basis for grouping
ORDER BY    SupplierID     -- records ordered by supplier



-- throwing in a JOINs for clearer suppliers:
SELECT      Suppliers.CompanyName,
            MIN(Products.unitprice) AS [lowest unitprice]
FROM        Products
   JOIN     Suppliers      ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Suppliers.CompanyName,
            Suppliers.SupplierID   -- this is important, but not covered here
ORDER BY    Suppliers.CompanyName
 

-- Database by Doug
-- Douglas Kline
-- 2/4/2016
-- GROUP BY introduction

Followers