Thursday, October 1, 2015

Introduction to JOINs

A short video covering the JOINs within a SQL SELECT statement.

For beginners. How to show columns from more than one table in a single SELECT statement. Covers joining with the WHERE clause and the JOIN statement, and some best practices.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 10/1/2015
-- Introduction to JOINs

USE Northwind

-- suppose we want to look at our products
-- and what categories they are in

SELECT   *
FROM     Products

-- focus more on just a few columns

SELECT   ProductID,
         ProductName,
         CategoryID
FROM     Products
ORDER BY ProductID

-- so what category is CategoryID 1? 4?

SELECT   *
FROM     Categories

-- focus in on the categoryName

SELECT   CategoryID,
         CategoryName
FROM     Categories

-- great, categoryID=1 is 'Beverages'
-- categoryID=4 is 'Dairy Products'

-- we used the CategoryID field in the Products table
-- to "look up" the name of the Category

-- but we don't want to do that for lots of records

SELECT   ProductID,
         ProductName,
         CategoryID
FROM     Products
ORDER BY ProductID

-- why not just add the CategoryName field to the SELECT clause?

SELECT   ProductID,
         ProductName,
         CategoryID,
         CategoryName
FROM     Products
ORDER BY ProductID

-- CategoryName is not a field in the Products table

-- OK, let's add the Categories table to the FROM statement

SELECT   ProductID,
         ProductName,
         CategoryID,
         CategoryName
FROM     Products,
         Categories
ORDER BY ProductID

-- that got rid of the syntax error on CategoryName
-- but now CategoryID is ambiguous??

-- look at the columns list for each table
-- there's a CategoryID in both the Products table 
-- and the Products table

-- so, we need to be more specific - which one do we mean

-- here's how

SELECT   ProductID,
         ProductName,
         Products.CategoryID,
         CategoryName
FROM     Products,
         Categories
ORDER BY ProductID

-- great, now we don't have any syntax errors

-- but we have 702 rows?
-- where is that coming from?

SELECT COUNT(CategoryID) AS [CategoryCount]
FROM   Categories

-- 9 categories

SELECT COUNT(ProductID) AS [ProductCount]
FROM   Products

-- 78 products

SELECT 9 * 78

-- 702 - just like the number of records

SELECT   ProductID,
         ProductName,
         Products.CategoryID,
         CategoryName
FROM     Products,
         Categories
ORDER BY ProductID

-- let's look closer
-- adding the categoryID from the Categories table

SELECT   ProductID,
         ProductName,
         Products.CategoryID,
         Categories.CategoryID,
         CategoryName
FROM     Products,
         Categories
ORDER BY ProductID

-- that's odd, why is chai being shown next to the wrong category?
-- in fact, it's being shown next to *every* category

-- all products are shown next to every category

-- because we haven't stated how the products and categories should be related
-- the database engine returns all possible combinations of products and categories
-- this is called a CROSS JOIN
-- this is sometime what we want, but not usually

-- so , how do we limit the returned records below 
-- so that only the ones with matching categoryIDs are kept?

-- using the WHERE clause
--

SELECT   ProductID,
         ProductName,
         Products.CategoryID,
         Categories.CategoryID,
         CategoryName
FROM     Products,
         Categories
WHERE    Products.CategoryID = Categories.CategoryID -- ******** 
ORDER BY ProductID

-- this looks a lot better
-- chai has a categoryID = 1, and it is shown next to the category with categoryID = 1

-- but the record count is 77?
-- aren't there 78 products?

-- looking at the products again...

SELECT   ProductID,
         ProductName,
         CategoryID
FROM     Products
ORDER BY ProductID

-- scrolling down, we can see that Dougs Mustard has a NULL categoryID
-- and looking back our combined query

SELECT   ProductID,
         ProductName,
         Products.CategoryID,
         Categories.CategoryID,
         CategoryName
FROM     Products,
         Categories
WHERE    Products.CategoryID = Categories.CategoryID -- ******** 
ORDER BY ProductID

-- scrolling down, note that Dougs Mustard doesn't appear
-- because this Products.CategoryID = Categories.CategoryID
-- is essentially this: NULL = Categories.CategoryID
-- which is false (NULLs in another video)

-- ok, looks like we have the logic right with this:
SELECT   ProductID,
         ProductName,
         Products.CategoryID,
         Categories.CategoryID,
         CategoryName
FROM     Products,
         Categories
WHERE    Products.CategoryID = Categories.CategoryID 
ORDER BY ProductID

-- but humans don't really want to see the CAtegoryID 
-- so we can remove them

SELECT   ProductID,
         ProductName,
         CategoryName
FROM     Products,
         Categories
WHERE    Products.CategoryID = Categories.CategoryID 
ORDER BY ProductID

-- the above is called a JOIN of the Products and Categories table
-- the above is stated as a constraint in the WHERE clause
-- but it is a special kind of a constraint
-- between the Primary Key of one table, and the Foreign Key in another table

-- so one way to look at a JOIN is as a constraint
-- you make all possible combinations of the records from two tables
-- then remove the ones that don't have equal PK--FK combinations

-- another way to view a JOIN is as a specific combination of tables
-- combine the tables in a specific way to create a specific set of records
-- then show fields from the combined set of records

-- this is done in the FROM clause

SELECT   ProductID,
         ProductName,
         CategoryName
FROM     Products JOIN Categories ON Products.CategoryID = Categories.CategoryID 
ORDER BY ProductID

-- this is logically equivalent
-- it will always return the same records as the WHERE clause version

-- I like to format it like this:

SELECT   ProductID,
         ProductName,
         CategoryName
FROM     Products 
   JOIN  Categories     ON Products.CategoryID = Categories.CategoryID 
ORDER BY ProductID

-- I suggest you learn both ways
-- so you can read either way when you encounter it
-- also, sometimes one way is clearer than another in a certain context

-- the JOIN syntax signals your *intention* better than the WHERE clause
-- and generally organizes your statement better 

-- conceptually, it is an binary operator between two tables, 
-- with an ON clause to specify details that the operator will use
-- it is *not* a list of tables (separated by commas),
-- but an expression that states how to combine the tables

-- one last thing
-- safe SQL coding practice
-- right now the statement below works fine
-- it is syntactically correct

SELECT   ProductID,
         ProductName,
         CategoryName
FROM     Products 
   JOIN  Categories     ON Products.CategoryID = Categories.CategoryID 
ORDER BY ProductID

-- but what would happen if in the future
-- someone put a CategoryName field in the Products table?
-- then there would be a CategoryName field in both Products and Categories

-- they *should* be able to do that without breaking code
-- even if it doesn't make sense

-- then categoryName would become ambiguous and the code would not execute

-- so how do we prevent that possible future bug?

-- by *qualifying* every field like this

SELECT   Products.ProductID,
         Products.ProductName,
         Categories.CategoryName
FROM     Products 
   JOIN  Categories     ON Products.CategoryID = Categories.CategoryID 
ORDER BY Products.ProductID

-- safe coding rule:
-- whenever there is more than one table 
-- fully qualify every field in every clause: SELECT, FROM, WHERE, ORDER BY, etc.


-- intro to JOIN...

No comments:

Post a Comment

Followers