Wednesday, October 10, 2018

Using DISTINCT in SQL

-- Database by Doug
-- Douglas Kline
-- 10/9/2018
-- DISTINCT - removing duplicates, but in a "dumb" way

-- consider this

SELECT ProductID,
       ProductName,
       SupplierID
FROM   Products

-- this is a list of all products
-- note the repeats in the SupplierID

-- suppose I want top know the list of suppliers
-- in the Products table

SELECT   SupplierID
FROM     Products

-- again, notice the repeats
-- to remove the repeats, I can do this:

SELECT   DISTINCT SupplierID
FROM     Products

-- you might be saying, why not just do this:
SELECT   SupplierID
FROM     Suppliers

-- my response is: that's a different list
-- it's the list of all suppliers in the supplier table

-- What I'm looking for is the list of all suppliers
-- in the products table
-- in other words, all suppliers which we actually *use*

-- Let's look at another example, with a table I've created

SELECT   firstName
FROM     Person
ORDER BY firstName 

-- this is a list of all people's first names
-- as we scroll down, we'll start to see repeats
-- in other words, multiple people have the same first name

-- now let's say we want a list of all first distinct firstnames

SELECT   DISTINCT firstName
FROM     Person
ORDER BY firstName 

-- so now there isn't a record returned for every Person record
-- there's a record returned for every unique firstname
-- also note that NULL is considered to be a unique firstname

-- you might say, why not use Group By to do this?
-- like this:

SELECT   firstName
FROM     Person
GROUP BY firstName
ORDER BY firstName

-- logically, it returns the exact same records, and always will
-- however, GROUP BY does a lot more work
-- it actually sets up groups of records in preparation to 
-- calculate aggregates like SUM, COUNT, AVG, etc.

-- DISTINCT is much faster 
-- if it sees a value it has seen before, it just throws it out
-- in other words, it doesn't group the records
-- it just makes a list of unique values

-- so, don't use GROUP BY when what you really need is DISTINCT

-- alright, what if you want to count stuff?
SELECT   COUNT(ID)                  AS [# of people],
         COUNT(DISTINCT ID)         AS [# of distinct primary key values],
         COUNT(firstName)           AS [# of people with non-NULL firstnames],
         COUNT(DISTINCT firstName)  AS [# of distinct firstNames]
FROM     Person


-- note that the first two values are always the same, 
-- since primary key values are distinct aka unique

-- and also notice that there are 599 distinct first names, but recall

SELECT   DISTINCT firstName
FROM     Person
ORDER BY firstName 

-- so why is the COUNT(DISTINCT firstname) = 599
-- but DISTINCT firstname gives 600 records?

-- remember that COUNT counts non-NULL values

-- finally, DISTINCT is sort of "dumb", in that it doesn't
-- know anything about primary keys
-- or anything about the underlying table(s)
-- it only considers values from the fields you provide

-- consider this:

SELECT   DISTINCT firstname
FROM     Person

-- it doesn't give distinct Person records, just distinct firstnames
-- now this:

SELECT   DISTINCT lastname
FROM     Person

-- and the distinct applies to the *combination* 
-- of all the fields in the SELECT clause
-- in this example,
-- all distinct *combinations* of gender and firstname are shown

SELECT   DISTINCT gender,
         firstname
FROM     Person
ORDER BY gender,
         firstname

-- in summary,
-- distinct removes duplicates
-- it removes duplicates based on all fields in the SELECT list
-- when used with COUNT, it will not count duplicate values

-- thanks for watching!

-- Database by Doug
-- Douglas Kline
-- 10/9/2018
-- DISTINCT - removing duplicates, but in a "dumb" way



No comments:

Post a Comment

Followers