Thursday, January 28, 2016

Obese Data

big data (noun) (Merriam-Webster)
an accumulation of data that is too large and complex for processing by traditional database management tools
obese data (noun)
an accumulation of data that is much larger and more complex than necessary
I made the second one up. After seeing lots of examples of obese databases.

In short, obese databases' effective size is much smaller than their actual size, e.g., a database that consumes 1TB of space, but has only 1GB of data. Further, the schema / architecture / model / framework / product / system / process is more complex than necessary.

Databases that have these problems:
  • they are slow
  • they could be a fraction of the size that they are 
  • their growth rate is higher than it should be (GBs/day rather than Kbs/day)
  • their operations (backup, batch load, maintenance) take much longer than necessary
  • they are inflexible, e.g., not easily adaptable to solve a new problem, write a new report, add a new feature, etc.
All these problems above are leading people to think that they have big data, when they really have obese data. 

Obese data might be viewed as a "database anti-pattern".

Things that contribute to obese data in the context of databases:
  • data types that are larger than they need to be
    • bigint that could be int
    • nchar that could be char
    • guid that could be int
    • datetime2 that could be date
    • char that could be varchar
    • char(50) that could be char(2)
  • unused columns
  • duplicate indexes
  • indexes that are never used
  • full-factorial indexing :-)
    • an index for every single column in the table
    • an index for every two-column combination in the table
    • etc.
  • data that should be archived
  • leftover staging data
  • redundant data
  • large numbers of NULL values
  • inappropriate schema
    • not "normalized" enough
    • too "normalized"
  • missing or inappropriate maintenance operations
    • hourly dbcc dbreindex to fillfactor 100% on a write-heavy table
    • fillfactor 10% on read-heavy table

Sunday, January 24, 2016

Using the IN Operator in SQL

A short video covering the IN operator in the WHERE clause within a SQL SELECT statement.

For beginners. Simple WHERE clauses using IN.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 1/20/2016
-- Using the IN operator

USE northwind

-- refresher

-- here's a simple SELECT statement with a simple WHERE clause

SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    UnitPrice > 10
ORDER BY ProductID

-- recall that the portion in the WHERE clause is a logical expression
-- the logical expression is evaluated for each record
-- records that evaluate (Unitprice > 10) as True (1) are included
-- records that evaluate (Unitprice > 10) as False (0) are not included

-- here's a simple IN example

SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID     IN (2, 4)
ORDER BY ProductID

-- all records that have a CategoryID of either 2 or 4 are included in the results
-- so the above could be equivalently rephrased as

SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID = 2
   OR    CategoryID = 4
ORDER BY ProductID

-- but this can be inconvenient and verbose if there are lots of desired values, like this:

SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID = 2
   OR    CategoryID = 4
   OR    CategoryID = 5
   OR    CategoryID = 8
   OR    CategoryID = 67
   OR    CategoryID = 101
ORDER BY ProductID

-- which would be more succinctly phrased as

SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID     IN (2, 4, 5, 8, 67, 101)
ORDER BY ProductID

-- here's another example that can lead to bugs:
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID = 2
   OR    CategoryID = 4
   OR    CategoryID = 5
   OR    CategoryID = 8
   OR    CategoryID = 67
   OR    CategoryID = 101
   AND   UnitPrice > 10
ORDER BY ProductID

-- note that the record with ProductID 13, 'Konbu' has a unitprice < 6
-- since the ORs and ANDs are evaluated left-to-right, the Unitprice > 10 has no effect

-- so parentheses are needed to control the order of operations, like this:
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    (
         CategoryID = 2
   OR    CategoryID = 4
   OR    CategoryID = 5
   OR    CategoryID = 8
   OR    CategoryID = 67
   OR    CategoryID = 101
         )
   AND   UnitPrice > 10
ORDER BY ProductID

-- this is MUCH clearer, and less conducive to bugs, phrased with IN
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID IN (2, 4, 5, 8, 67, 101)
   AND   UnitPrice   > 10
ORDER BY ProductID

-- here's another way that IN is very convenient
-- back to the long example of ORs
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID = 2
   OR    CategoryID = 4
   OR    CategoryID = 5
   OR    CategoryID = 8
   OR    CategoryID = 67
   OR    CategoryID = 101
ORDER BY ProductID

-- what if we want CategoryIDs NOT equal to 2, 4, 5, etc
-- this doesn't work:
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    NOT
         CategoryID = 2
   OR    CategoryID = 4
   OR    CategoryID = 5
   OR    CategoryID = 8
   OR    CategoryID = 67
   OR    CategoryID = 101
ORDER BY ProductID

-- note the categoryID = 8 for productID=10
-- so we would need parentheses like this:
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    NOT
         (
         CategoryID = 2
   OR    CategoryID = 4
   OR    CategoryID = 5
   OR    CategoryID = 8
   OR    CategoryID = 67
   OR    CategoryID = 101
         )
ORDER BY ProductID

-- or more verbosely, but without parentheses
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    NOT CategoryID = 2
   AND   NOT CategoryID = 4
   AND   NOT CategoryID = 5
   AND   NOT CategoryID = 8
   AND   NOT CategoryID = 67
   AND   NOT CategoryID = 101        
ORDER BY ProductID

-- so, again, this is much clearer and less prone to bugs phrase with IN:
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    CategoryID NOT IN (2, 4, 5, 8, 67, 101)
ORDER BY ProductID

-- or equivalently
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    NOT CategoryID IN (2, 4, 5, 8, 67, 101)
ORDER BY ProductID

-- technically, 'IN' is comparison operator, like '=' or '<' 
-- a phrase like '7 > 8' compares the value on the left with the value on the right
-- and returns a true or false

-- a phrase like '7 IN (4, 5, 8) compares the value on the left with 
--  the list of values on the right, and returns a true or false

-- IN can be used with other data types:

SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    ProductName IN ('Chai', 'Tofu', 'Alice Mutton')
ORDER BY ProductID

-- for character based items, the above works, 
-- but you can't combine LIKE and IN, by including LIKE wildcards:
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    ProductName IN ('Chai%', 'Tofu', 'Alice Mutton')
ORDER BY ProductID

-- the above doesn't work - % is interpreted as a regular character, not a wildcard for matching

-- here's one using dates

SELECT   *
FROM     Employees
WHERE    BirthDate IN ('19481208', '19520219')

-- another very cool feature of the IN operator, 
-- is the ability to dynamically create the list of values
-- with a SELECT statement

SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    ProductName IN (  SELECT ProductName 
                           FROM   Products 
                           WHERE  unitprice > 10)
ORDER BY ProductID

-- of course, this is a very simple example,
-- and it would be much more clearer to phrase it like this:
SELECT   ProductID,
         CategoryID,
         SupplierID,
         ProductName, 
         UnitPrice
FROM     Products
WHERE    unitprice > 10
ORDER BY ProductID

-- but the idea is very powerful
-- and allows for greater flexibility for phrasing your SQL
-- in some cases, a sub query with IN is the clearest way to state what you intend

-- in summary
-- IN is a logical operator - it returns True/False (1/0)
-- IN is sometimes used to replace a series of logical expressions connected with ORs
-- IN operates on a reference value and a list of values
-- IN returns True(1) if the reference value is *in* the list of values
-- IN - the list of values can be the result of a SELECT statement

-- Database by Doug
-- Douglas Kline
-- 1/20/2016
-- Using the IN operator


Followers