Thursday, September 8, 2016

Can I use NULL with the IN operator?

Can I use NULL with the IN operator?


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 9/8/2016
-- the IN operator and NULL
-- aka Can I use IN with NULL?



-- the short answer is no
-- but why?

-- a couple of simple examples

SELECT 1
WHERE  NULL IN (1, 2, 3)

-- notice that NULL is not IN a list that contains a NULL

SELECT 1
WHERE  NULL IN (NULL, 1, 2, 3)

-- ok, let's look at a table I have

USE ReferenceDB


-- this is very close to what I saw in a piece of code
SELECT  lastname,
        email
FROM    Person
WHERE   email IN (NULL, '')


-- so are there zero-length strings? yes
SELECT  lastname,
        email
FROM    Person
WHERE   email = ''

-- are there NULLs? no?
SELECT  lastname,
        email
FROM    Person
WHERE   email = NULL

-- ah, for test for a NULL value, you need the IS operator
SELECT  lastname,
        email
FROM    Person
WHERE   email IS NULL

-- and more proof...
SELECT 1
WHERE  NULL = NULL

SELECT 1
WHERE  NULL IS NULL

-- but is an IN *really* just a translation of a sequence of OR'ed equalities?

SELECT  lastname,
        email
FROM    Person
WHERE   email IN (NULL, '')

SELECT  lastname,
        email
FROM    Person
WHERE   email = NULL
   OR   email = ''

SELECT  lastname,
        email
FROM    Person
WHERE   email IN (NULL, '', 'fred')

-- so, the query optimizer is smart enough to
-- remove the NULL from the IN list, since it doesn't matter

Friday, July 8, 2016

Simple indexing and what to look at for database performance

Simple Indexing and what to look at for database performance


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 7/8/2016
-- simple indexing - an introduction

-- the example that's in every book
-- but more about why it works, and what to look at

USE ReferenceDB

SELECT *
FROM   Person


/****** Object:  Table [dbo].[Person]    Script Date: 7/8/2016 10:55:36 AM ******/
--CREATE TABLE [dbo].[Person](
-- [ID] [int] IDENTITY(1,1) NOT NULL,
-- [salutory] [nvarchar](15) NULL,
-- [firstName] [nvarchar](25) NULL,
-- [middleInitial] [nchar](1) NULL,
-- [lastName] [nvarchar](25) NOT NULL,
-- [nameSuffix] [nvarchar](8) NULL,
-- [email] [nvarchar](255) NULL,
-- [entryDate] [datetime] NOT NULL CONSTRAINT [DF_Person_entryDate]  DEFAULT (getdate()),
-- [lastUpdateDate] [datetime] NOT NULL CONSTRAINT [DF_Person_lastUpdateDate]  DEFAULT (getdate()),
-- [weight] [float] NULL,
-- [gender] [char](1) NULL,
-- [dateOfBirth] [date] NULL,
-- [bogusGUID] [uniqueidentifier] ROWGUIDCOL  NULL CONSTRAINT [DF_Person_bogusGUID]  DEFAULT (newid()),
-- [bogusGUID2] [uniqueidentifier] NULL CONSTRAINT [DF_Person_bogusGUID2]  DEFAULT (newid()),
-- [bogusChar] [char](100) NOT NULL CONSTRAINT [DF_Person_bogusChar]  DEFAULT ('fred'),
-- [bogusNChar] [nchar](100) NOT NULL CONSTRAINT [DF_Person_bogusNChar]  DEFAULT (N'george'),
-- CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED 
--(
-- [ID] ASC
--)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 60) ON [PRIMARY]
--) ON [PRIMARY]

-- typical hard disk seek time:           6.43 ms
-- typical RAM seek time (CAS latency):  14.00 ns

-- note the ms and ns difference
-- in the same units
-- typical hard disk seek time:        6430.00 ms
-- typical RAM seek time (CAS latency):  14.00 ns

-- so RAM is roughly 500 times faster than a spinning hard drive
-- but it is also a lot more expensive, and volatile

-- Databases are built to leverage this difference
-- use both RAM and disk, each for what its best at

SET STATISTICS IO ON -- see the IO involved in each query

SELECT *
FROM   Person

-- logical reads: 138533
-- physical reads: 0

-- these are "page reads"
-- each page is 8Kb
-- they are the amount of data moved between RAM and disk

-- this means the entire table is in memory

-- but it might not be

-- in a "real" database, you might have 10 TB of data, but 16 GB of RAM
-- you can't keep all data in RAM

-- so think of logical page reads as "gifts"
-- you got lucky, the data you needed happened to be in RAM
-- and therefore came back 500 times faster than it might have

-- the best thing is to reduce the overall number of 
-- page reads, logical or physical

SELECT  firstname
FROM    Person

-- lots faster, but same number of reads
-- this is due to less data coming back over the network

SELECT *
FROM   Person
WHERE  ID = 2417634

-- almost instant
-- page IOs: 3

-- what about his?
SELECT *
FROM   Person
WHERE  lastname = 'kline'

-- page IOs: 138534

-- what is special about ID?

-- there is an index on ID

-- what if I want to make the lastname lookup faster?

-- we could create an index that would make that faster

USE [ReferenceDB]
GO

/****** a really simple create index statement
CREATE NONCLUSTERED INDEX [IX_LastName] ON [dbo].[Person]
(
 [lastName] ASC
)   
GO
 ******/

 -- try again?
SELECT *
FROM   Person
WHERE  lastname = 'kline'

-- page IOs: 393, vs. 138534 without the index

-- that's the perfect scenario for an index
-- it's what is covered in all the basic DB books

-- this is all upside
-- there are significant downsides to indexes
-- particularly regarding INSERTS, UPDATES, and DELETES
-- in future videos

-- books also tend to focus on the data structure performance
-- in other words, balanced trees and the *computational* efficiency of them

-- the real performance benefits are all in reduced disk IO

-- in the above, I'm not measuring performance by CPU
-- I'm measuring performance by disk IOs

-- Why? 

-- because CPU is 1000(?) times faster than RAM

-- for DB, generally worry about performance in this order:

-- network, which is 1000(?) slower than
-- disk,    which is 1000(?) slower than
-- RAM,     which is 1000(?) slower than
-- CPU


-- Database by Doug
-- Douglas Kline
-- 7/8/2016
-- simple indexing

Friday, July 1, 2016

SQL Server's bit data type

SQL Server's bit data type


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 7/1/2016
-- bit data type

-- odd implicit conversion
-- bit data type 
-- character data type

-- saw something odd today, and thought I would cover it

USE Northwind

-- consider this from the Products database

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
ORDER BY ProductID

-- discontinued is a bit data type
-- of course productID is an integer data type
-- which is generally considered a very small integer in its treatment
-- https://msdn.microsoft.com/en-us/library/ms187752.aspx 
-- bits are stored differently internally, but ostensibly behave like integers

-- this fails
UPDATE Products
SET    discontinued = NULL
WHERE  ProductID = 1
-- this is clearly not integer-like

-- so consider these WHERE clauses using the productID field
SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  productID = '1' -- implicit conversion to integer  
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  productID = 'a' -- implicit conversion fails  
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  productID = '12' -- implicit conversion succeeds  
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  productID = '1222222222' -- implicit conversion succeeds 
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  productID = '1.2' -- implicit conversion fails 
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  productID = '-1' -- implicit conversion works, no records 
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  productID = '' -- implicit conversion succeeds, no records 
ORDER BY ProductID

-- but what does it convert to?

SELECT CAST('' AS integer) -- it converts to integer value of zero

-- so what if we are using a bit field
-- and trying to use it like a boolean value
-- but its really treated as a very small integer (2 values)

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '1' -- implicit conversion to bit  
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '0' -- implicit conversion  
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = 'a' -- implicit conversion fails 
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '12' -- implicit conversion with truncation? 
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '122222222' -- implicit conversion with truncation
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '022222222' -- implicit conversion with truncation? not really?
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '0.2' -- implicit conversion fails
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '-1' -- implicit conversion fails - not like an integer
ORDER BY ProductID

SELECT ProductID,
       ProductName,
       discontinued
FROM   Products
WHERE  discontinued = '' -- implicit conversion works
ORDER BY ProductID

-- so what does '' convert to?

SELECT CAST('' as bit) -- a zero

-- so, integer and bit seems inconsistent with how they treat negatives
-- what about tinyint?

SELECT CAST('-1' AS tinyint) -- gives overflow
SELECT CAST('-1' AS bit) -- gives conversion failure
SELECT CAST(-1 AS tinyint) -- gives overflow
SELECT CAST(-1 AS bit) -- succeeds, gives one

-- ok, that's strange

-- so, it appears that bit is neither a boolean nor an "exact numeric" 
-- in the way it is treated by SQL Server

-- just need to know how it behaves

-- in my experience, it is always *used* like it's a boolean/logical
-- personally, I would prefer it to behave consistently as a boolean/logical

-- Database by Doug
-- Douglas Kline
-- 7/1/2016
-- bit data type

Thursday, June 30, 2016

SQL Coding Style, IMHO

Here's a longish rant/explanation of my personal SQL coding style.

 It's incomplete, but covers basics, and gets a beginner started.

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
~Martin Fowler
Programs must be written for people to read, and only incidentally for machines to execute.  
~Harold Abelson
-- formatting SQL
-- version 2, with fixed formatting

-- Any fool can write code that a computer can understand. 
-- Good programmers write code that humans can understand. 
-- ~Martin Fowler

-- Programs must be written for people to read, and only incidentally for machines to execute.  
-- ~Harold Abelson

-- here's a slightly long rant/explanation 
-- of my personal coding style

-- So here is something a company might want to know
-- how much of each product did we sell in 1997
-- plus which category and supplier
-- fairly simple and small relative to real-world queries

-- this is produced by this query, directly from the SSMS query designer

SELECT     Products.ProductID, Products.ProductName, Categories.CategoryName, Suppliers.CompanyName, SUM(([Order Details].Quantity * [Order Details].UnitPrice) 
                      * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM         [Order Details] INNER JOIN
                      Products ON [Order Details].ProductID = Products.ProductID INNER JOIN
                      Orders ON [Order Details].OrderID = Orders.OrderID LEFT OUTER JOIN
                      Categories ON Products.CategoryID = Categories.CategoryID LEFT OUTER JOIN
                      Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, Products.ProductName, Categories.CategoryName, Suppliers.CompanyName, YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- here are a few questions,
-- can you answer them with a quick scan of the code?
---- which tables are involved?
---- what are the foreign keys for the joins?
---- how many columns will come back?
---- which columns?
---- any outer joins?
---- how many joins?

-- all of the above questions should be *easily* readable from the code
-- but they aren't 
-- so here's some formatting modifications to help
-----------------------

-- vertically align a single item from each list on its own line:
-- --lists are anything with commas. They are typically in the SELECT, 
-- --GROUP BY, and ORDER BY
-- --this makes the elements in the list clear. 
-- --I don't have to depend on spotting commas for list item separation

SELECT  Products.ProductID, 
        Products.ProductName, 
        Categories.CategoryName, 
        Suppliers.CompanyName, 
        SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM         [Order Details] INNER JOIN
                      Products ON [Order Details].ProductID = Products.ProductID INNER JOIN
                      Orders ON [Order Details].OrderID = Orders.OrderID LEFT OUTER JOIN
                      Categories ON Products.CategoryID = Categories.CategoryID LEFT OUTER JOIN
                      Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, 
         Products.ProductName, 
         Categories.CategoryName, 
         Suppliers.CompanyName, 
         YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- vertically align the tables in the FROM clause, and lead each with its JOIN
---- for me, this brings the really critical information about the type of joins to the front
---- I won't miss an outer join by accident

SELECT  Products.ProductID, 
        Products.ProductName, 
        Categories.CategoryName, 
        Suppliers.CompanyName, 
        SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM              [Order Details] 
  INNER JOIN      Products ON [Order Details].ProductID = Products.ProductID 
  INNER JOIN      Orders ON [Order Details].OrderID = Orders.OrderID 
  LEFT OUTER JOIN Categories ON Products.CategoryID = Categories.CategoryID 
  LEFT OUTER JOIN Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY Products.ProductID, 
         Products.ProductName, 
         Categories.CategoryName, 
         Suppliers.CompanyName, 
         YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY Products.ProductID

-- remove verbose JOIN syntax
---- In general, extra words don't help readability
---- all important information about the type of join is retained
---- also vertically align components of each clause
---- with this, the overall structure/components
---- of the query is/are very apparent


SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- align ON clause elements, and space to the right
---- the tables involved become more clear
---- the "details" of the join are less important at first glance, 
---- but are clearly presented if needed

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM(([Order Details].Quantity * [Order Details].UnitPrice) * (1.0 - [Order Details].Discount)) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- break long formulas by the operators
---- note also the separation of parentheses with a space
---- and alignment of the outer beginning and ending parentheses
---- this is judgement call material - every formula is different
---- you may choose to do this differently
---- but clearly, formatting of longer calculations helps readability

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
HAVING      (YEAR(Orders.OrderDate) = 1997)
ORDER BY    Products.ProductID

-- move the condition from the HAVING to the WHERE clause
-- and remove verbose/unnecessary parentheses
---- condition in HAVING or WHERE is logically equivalent, but should absolutely be changed
---- this is subtly, but immensely stylistically better
---- because it conveys the fact that YEAR(orderdate) is 
---- a property of each Order record, not of each product group
---- WHERE is for conditions on records
---- HAVING is for conditions on groups
---- It is also potentially performance-improving, because
---- records will be eliminated prior to grouping, not after
---- hopefully the query optimizer catches this, 
---- but why not make sure instead of hope?

SELECT      Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID = Products.ProductID 
  JOIN      Orders          ON [Order Details].OrderID = Orders.OrderID 
  LEFT JOIN Categories      ON Products.CategoryID = Categories.CategoryID 
  LEFT JOIN Suppliers       ON Products.SupplierID = Suppliers.SupplierID
WHERE       YEAR(Orders.OrderDate) = 1997
GROUP BY    Products.ProductID, 
            Products.ProductName, 
            Categories.CategoryName, 
            Suppliers.CompanyName, 
            YEAR(Orders.OrderDate)
ORDER BY    Products.ProductID

-- the above I would say is far superior to the original
-- I can, with a quick scan, answer all the important questions I had above
-- small/typo errors become apparent
-- people new to the data model and codebase can read it

-- I think the above is sufficient, but some would go further
-- these changes I think improve the readability, but may not be worth the time/effort
-- essentially, this is further "pretty-printing" of the code
-- I think it helps in some places, but hurts in others
-- to me, the code looks "choppy", and certain expression don't appear cohesive
-- so because I can't *consistently* do this, I choose not to


SELECT      Products   .ProductID, 
            Products   .ProductName, 
            Categories .CategoryName, 
            Suppliers  .CompanyName, 
            SUM( (  [Order Details].Quantity 
                  * [Order Details].UnitPrice) 
                  * (1.0 - [Order Details].Discount)
                ) AS totalRevenue
FROM        [Order Details] 
  JOIN      Products        ON [Order Details].ProductID  = Products   .ProductID 
  JOIN      Orders          ON [Order Details].OrderID    = Orders     .OrderID 
  LEFT JOIN Categories      ON Products       .CategoryID = Categories .CategoryID 
  LEFT JOIN Suppliers       ON Products       .SupplierID = Suppliers  .SupplierID
WHERE       YEAR(Orders.OrderDate) = 1997
GROUP BY    Products   .ProductID, 
            Products   .ProductName, 
            Categories .CategoryName, 
            Suppliers  .CompanyName, 
            YEAR(Orders.OrderDate)
ORDER BY    Products   .ProductID

-- some people will also use table aliases,
-- effectively creating a short "variable name" for each table
-- notice in the example below that I've still been careful with alignment
-- all table aliases are aligned vertially, so I
-- have an easy reference for each "variable name"

SELECT      p.ProductID, 
            p.ProductName, 
            c.CategoryName, 
            s.CompanyName, 
            SUM( (  od.Quantity 
                  * od.UnitPrice) 
                  * (1.0 - od.Discount)
                ) AS totalRevenue
FROM        [Order Details] od
  JOIN      Products        p   ON od.ProductID = p.ProductID 
  JOIN      Orders          o   ON od.OrderID   = o.OrderID 
  LEFT JOIN Categories      c   ON p.CategoryID = c.CategoryID 
  LEFT JOIN Suppliers       s   ON p.SupplierID = s.SupplierID
WHERE       YEAR(o.OrderDate) = 1997
GROUP BY    p.ProductID, 
            p.ProductName, 
            c.CategoryName, 
            s.CompanyName, 
            YEAR(o.OrderDate)
ORDER BY    p.ProductID


-- the main benefits in readability, to me, 
-- is fewer characters
-- and some vertical alignments are easier,
-- since each "variable name" is a single letter
-- it also has benefits in coding, by making "auto-complete" quicker

-- however,
-- there is additional cognitive overhead in translating
-- the "variable name" to the table
-- this gets worse the more tables are involved
-- and if you have 6 tables that all begin with p, you have a problem
-- and if you are not intimately familiar with the database/codebase,
-- your cognitive overhead is very high
-- so, can other people read the code? (see quotes at top)
-- 

-- personally, I'm not a fan of this
-- even with many years (*gasp* decades) of coding
-- when I drop into a new environment/datamodel/codebase
-- it takes me too long to adapt to their "variable names"

Wednesday, June 29, 2016

Simple Recursion in SQL with Common Table Expressions

Simple Recursion in SQL


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 6/29/2016
-- Recursion

USE Northwind

-- data in the world is commonly *hierarchical*
-- this is modeled using a table that relates to itself in a 1-many manner
-- in other words, it has a foreign key that references itself

-- to get records from the entire hierarchy, we need recursion

SELECT *
FROM   Employees

-- focusing on just the columns of interest
SELECT   EmployeeID,
         LastName,
         reportsTo 
FROM     Employees
ORDER BY EmployeeID

-- notice that reportsTo is a foreign key
-- that references the primary key in
-- the same table
-- in other words, this relationship is
--   self-referential

-- maybe better like this:
SELECT   EmployeeID,
         LastName,
         reportsTo AS [employeeID of Boss]
FROM     Employees
ORDER BY EmployeeID

-- or
SELECT   EmployeeID,
         LastName,
         reportsTo AS [bossID]
FROM     Employees
ORDER BY EmployeeID


-- so to join a table to itself,
-- we need to use a table alias

SELECT   Employee.EmployeeID,
         Employee.LastName  AS [LastName],
         Employee.ReportsTo AS [ReportsToID],
         Boss.EmployeeID    AS [BossID],
         Boss.LastName      AS [Boss LastName]
FROM     Employees   AS [Employee]
  JOIN   Employees      AS [Boss]  ON Employee.ReportsTo = Boss.EmployeeID
ORDER BY Employee.EmployeeID

-- notice that we lost someone, employeeID=2, 
-- because their reportsTo was NULL
-- we can get them back with an outer JOIN

SELECT      Employee.EmployeeID,
            Employee.LastName  AS [LastName],
            Employee.ReportsTo AS [ReportsToID],
            Boss.EmployeeID    AS [BossID],
            Boss.LastName      AS [Boss LastName]
FROM        Employees          AS [Employee]
  LEFT JOIN Employees          AS [Boss] ON Employee.ReportsTo= Boss.EmployeeID
ORDER BY    Employee.EmployeeID

-- the above has been from the Employee perspective, and their bosses
-- lets do this the other way
-- Bosses, and their employees

SELECT      Boss.EmployeeID  AS [BossID],
            Boss.LastName  AS [Boss LastName],
            Employee.EmployeeID,
            Employee.LastName AS [LastName],
            Employee.ReportsTo AS [ReportsToID]
FROM        Employees   AS [Employee]
  LEFT JOIN Employees   AS [Boss] ON Employee.ReportsTo= Boss.EmployeeID
ORDER BY    Boss.EmployeeID

-- look closely, the JOIN hasn't changed
-- the only difference is the order of the columns
-- and the ORDER BY statement

-- both of the above only represent 
-- a single level in the hierarchy

-- an employee, and their direct supervisor
-- or a boss, and their direct supervisee

-- look at it another way
SELECT   Boss.EmployeeID    AS [BossID],
         Boss.LastName      AS [Boss LastName],
         COUNT(Employee.EmployeeID) AS [Supervisees]
FROM     Employees AS [Employee]
  JOIN   Employees AS [Boss] ON Employee.ReportsTo= Boss.EmployeeID
GROUP BY Boss.employeeID,
         Boss.LastName
ORDER BY Boss.EmployeeID

-- so this shows that Fuller has 5 *direct* reportees
-- and Buchanan has 3 direct reportees

-- however, Buchanan reportees to Fuller
-- so really, Fuller has 8 reportees "under him"

-- and with more full data, 
-- there could be many levels
-- in the hierarchy

-- to do this, we'll need a
-- Common Table Expression, aka CTE

-- clean version-----------------------------------
WITH EmployeesAndTheirBosses 
  (EmployeeID,
  LastName,
  ReportsTo,
  BossLastName,
  depth)
AS
(
SELECT  EmployeeID,
        LastName,
        ReportsTo,
        LastName,
        0 
FROM    Employees
WHERE   reportsTo IS NULL 

UNION ALL 

SELECT   Employees.EmployeeID,
         Employees.LastName,
         Employees.ReportsTo,
         EmployeesAndTheirBosses.LastName,
         EmployeesAndTheirBosses.depth + 1
FROM     Employees
   JOIN  EmployeesAndTheirBosses ON employees.reportsTo = EmployeesAndTheirBosses.EmployeeID
)

SELECT   *
FROM     EmployeesAndTheirBosses
ORDER BY depth




-- annotated version-----------------------------------

-- CTEs define a table that can be referenced in only
-- the very next statement
-- WITH defines what the table looks like: name and columns
WITH EmployeesAndTheirBosses 
         (EmployeeID,
          LastName,
          ReportsTo,
          BossLastName,
          depth)
AS
-- now define what records go into the CTE
(
-- PART A
SELECT   EmployeeID,
         LastName,
         ReportsTo,
         LastName,
         0 -- the person without a boss has a depth of zero
FROM     Employees
WHERE    reportsTo IS NULL -- no boss!
-- the part above is the base condition
-- it's the employee with no boss

UNION ALL -- the ALL part is important for the recursion to work

-- the part below is the recursion
-- PART B
SELECT   Employees.EmployeeID,
         Employees.LastName,
         Employees.ReportsTo,
         EmployeesAndTheirBosses.LastName,
         EmployeesAndTheirBosses.depth + 1
FROM     Employees
   JOIN  EmployeesAndTheirBosses ON employees.reportsTo = EmployeesAndTheirBosses.EmployeeID
   -- notice that this references the CTE: EmployeesAndTheirBosses
   -- this is the recursion
)
-- now that the CTE is defined, select records from it
-- the CTE only exists for this next SELECT statement
-- in other words, the scope of the CTE is the next statement
SELECT   *
FROM     EmployeesAndTheirBosses
ORDER BY depth

-- I've ordered the records by depth, 
-- so you can see the recursion easily

--EmployeeID  LastName             ReportsTo   BossLastName         depth
------------- -------------------- ----------- -------------------- -----------
--2           Fuller               NULL        Fuller               0
--1           Davolio              2           Fuller               1
--3           Leverling            2           Fuller               1
--4           Peacock              2           Fuller               1
--5           Buchanan             2           Fuller               1
--8           Callahan             2           Fuller               1
--6           Suyama               5           Buchanan             2
--7           King                 5           Buchanan             2
--9           Dodsworth            5           Buchanan             2
--11          Kline                6           Suyama               3
--12          Jones                11          Kline                4
--13          Johnson              11          Kline                4

-- think of the recursion happening this way:

--EmployeeID  LastName             ReportsTo   BossLastName         depth
------------- -------------------- ----------- -------------------- -----------
-- ** base condition ** 
--2           Fuller               NULL        Fuller               0
-- ** first level of recursion, all those who report to level above
--1           Davolio              2           Fuller               1
--3           Leverling            2           Fuller               1
--4           Peacock              2           Fuller               1
--5           Buchanan             2           Fuller               1
--8           Callahan             2           Fuller               1
-- ** second level of recursion, all those who report to first level employees
--6           Suyama               5           Buchanan             2
--7           King                 5           Buchanan             2
--9           Dodsworth            5           Buchanan             2
-- ** third level of recursion, all those who report to 2nd level employees
--11          Kline                6           Suyama               3
-- ** fourth level of recursion, all those who report to 3rd level employees
--12          Jones                11          Kline                4
--13          Johnson              11          Kline                4
-- ** fifth level of recursion, all those who report to 4th level employees
-- ** no one

-- STEP 1: Base condition
-- The base condition is from PART A
-- it returns only fuller
-- Fuller is the person with no boss; NULL for reportsTo
-- depth=0 by decision - we could use any number, zero makes sense

-- STEP 2: First level of recursion
-- This joins Part A and Part B
-- So it joins Fuller with all people who reportTo Fuller
-- These are the records with depth= 1
-- depth = depth of fuller plus 1

-- STEP 3: Second level of recursion
-- this Joins First level of recursion with all Employees
-- So it joins Davolio thru Callahan with all employees who report to them
-- These are the records with depth = 2
-- depth = depth of Davolio thru Callahan plus 1

-- STEP 4: Third level of recursion
-- this Joins second level of recursion with all Employees
-- this Joins Suyama thru Dodsworth with employees who report to them
-- These are the records with depth = 3

-- etc.

-- in this case, we only have 4 levels
-- but this would continue if there were more levels

-- Database by Doug
-- Douglas Kline
-- 6/29/2016
-- Recursion

Thursday, May 26, 2016

Working with dates and times in SQL

Working with dates and times in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 5/25/2016
-- working with dates

USE Northwind

-- dates/times are very important to organizations
-- one of the fundamental things that a system stores
-- is when something occurred

-- dates/times are complicated in real life
-- different calendars, leap years, time zones, etc.

-- don't try to do your own date/time calculations
-- it's time consuming, and you're likely to get it wrong

-- advice: get to know how dates/times work 
--   on the platform you are using

-- this video will be using the datetime2 data type
-- however, the concepts apply to datetime and smalldatetime

-- something simple
SELECT GETDATE() AS [Current System Date & Time]

-- see that there are many pieces to 
-- year, month, day of month
-- hour (24), minute, second, millisecond

-- to get any single piece, you use this function

SELECT DATEPART(YEAR,   GETDATE()) AS [Year],
       DATEPART(MONTH,  GETDATE()) AS [Month],
       DATEPART(DAY,    GETDATE()) AS [Day of Month],
       DATEPART(HOUR,   GETDATE()) AS [Hour],
       DATEPART(MINUTE, GETDATE()) AS [Minute],
       DATEPART(SECOND, GETDATE()) AS [Second],
       DATEPART(MS,     GETDATE()) AS [Millisecond]

-- and there are some additional convenience functions
SELECT YEAR(GETDATE())  AS [Year],
       MONTH(GETDATE()) AS [Month],
       DAY(GETDATE())   AS [Day of Month]
       -- not one for hour
       -- not one for minute
       -- not one for second
       -- not one for millisecond

-- note that these are all whole numbers,
-- i.e., exact, integer values
-- they represent a *part* of the date/time
-- they *don't* represent a duration or period of time
-- so you'll never see a "7.2 days" using DATEPART

-- so let's see this work
SELECT   orderdate
FROM     orders
WHERE    YEAR(orderdate)  = 1997
  AND    MONTH(orderdate) = 2
ORDER BY orderdate

-- note that there are performance implications
-- for the next video

-- so how to create a new datetime2?
SELECT DATETIME2FROMPARTS(2016, 2, 15,17,30,1,0,0)
         AS [Feb 15 2016 5:30PM and 1 second]

-- it is possible to *implicitly* make a datetime2
--  starting with a string representation

DECLARE @d  AS datetime2
SET @d = '2016-02-15 17:30:01' -- implicit conversion
SELECT @d AS [datetime2 implicitly converted from varchar]

-- sql server will implicitly convert from
-- a wide variety of formats

-- generally a dangerous thing to do
-- prone to error
-- be *explicit*

-------------------------------

-- what about moving forward or back in time
-- there is a function for that: DATEADD

SELECT   DATEADD(MONTH,          7, GETDATE()) AS [7 months from now],
         DATEADD(DAY,           77, GETDATE()) AS [77 days from now],
         DATEADD(YEAR,         -14, GETDATE()) AS [14 years ago],
         DATEADD(MILLISECOND, 3500, GETDATE()) AS [3.5 seconds from now],
         DATEADD(QUARTER,      -20, GETDATE()) AS [20 quarters ago],
         DATEADD(WEEKDAY,       37, GETDATE()) AS [37 weekdays from now]

-- the arguments are:
--    first: time unit to be used, by namme
--    second: positive or negative integer
--             meaning number of time units added/subtracted
--    third: a datetime2 expression - the starting point in time

-- note the milliseconds example above
-- see the difference here
SELECT   GETDATE()                       AS [now],
         DATEADD(SECOND, 3.5, GETDATE()) AS [3 seconds from now]

-- the 3.5 is onverted from a floating point to an integer
-- so the 0.5 is effectively ignored

-- how do we find the difference between two dates?
-- we use the datediff function

SELECT orderid,
       orderdate,
       shippedDate,
       DATEDIFF(day,  orderdate, shippedDate) AS [days until shipped],
       DATEDIFF(week, orderdate, shippedDate) AS [weeks until shipped]
FROM   orders

-- the arguments are:
-- first:  the unit of time measure, by name
-- second: the first date
-- third:  the second date

-- you might notice something odd in the weeks column
-- note orderid 10249 (second record)
-- days are 5, but it shows 1 week

SELECT orderid,
       orderdate,
       DATEPART(WEEKDAY,orderdate)            AS [weekday of order],
       shippedDate,
       DATEPART(WEEKDAY, shippeddate)         AS [weekday shipped],
       DATEDIFF(day,  orderdate, shippedDate) AS [days until shipped],
       DATEDIFF(week, orderdate, shippedDate) AS [weeks until shipped]
FROM   orders
WHERE  orderID = 10249

-- DATEDIFF gives a count of 
-- the number of time unit *boundaries* passed between the two dates
-- since the orderdate weekday is 6
-- and the shippedDate weekday is 4
-- a single week boundary was passed
-- i.e., 6, 7, *boundary*, 1, 2, 3, 4

--DATEDIFF returns a COUNT of boundaries passed
-- it does NOT measure time (or at least not accurately...)

-- in summary:
--  datetime2, datetime, smalldatetime, etc. are complex data types
--  use the functions provided - don't try to roll-your-own calculations
--  GETDATE gives current system time
--  DATEADD moves forward or back from a given point in time
--  DATEDIFF gives the number of time unit boundaries passed between two datetimes

-- Database by Doug
-- Douglas Kline
-- 5/25/2016
-- working with dates

Sunday, May 22, 2016

Advanced use of the LIKE operator in SQL

Advanced use of the LIKE operator in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 5/19/2016
-- LIKE, advanced

use northwind

-- some advanced things to do with LIKE
-- if you are new to using LIKE, see my earlier video
-- previous video covers simple wildcard matching with % and _

-- we'll cover character sets
-- and specific use cases, i.e., how to do....

-- let's start with a simple example from the products table

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
ORDER BY ProductName

-- notice that there are 77 products

-- now, let's limit that to all the products
-- whose names start with the letter A

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE 'a%'
ORDER BY ProductName

-- what about begins with A or B or C
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE 'a%'
  OR     ProductName LIKE 'b%'
  OR     ProductName LIKE 'c%'
ORDER BY ProductName

-- can also be phrased as

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '[abc]%'
ORDER BY ProductName

-- the square brackets mean:
-- "a single character from within this set"

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '[abc][lho]%'
ORDER BY ProductName

-- this one means
-- first letter is from this set: [abc]
-- second letter is from this set: [lho]
-- then zero or more characters after the first two letters

-- this can also be phrased as a "list", like this:

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '[a-c][lho]%'
ORDER BY ProductName

-- the hyphen means,
-- all characters alphabetically between the two letters
-- what about this?

-- products whose name does not start with
-- an a or b or c
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    NOT   (ProductName LIKE 'a%'
            OR  ProductName LIKE 'b%'
            OR  ProductName LIKE 'c%')
ORDER BY ProductName

-- another option:

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName NOT LIKE '[abc]%' -- using NOT
ORDER BY ProductName

-- can be rephrased like this:
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '[^abc]%'
ORDER BY ProductName

-- note the carat symbol ^ inside the brackets
-- this means
-- "any single character not in this list"

-- and this:
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '[^abc][^lho]%'
ORDER BY ProductName

-- this one means
-- first letter is NOT from this set: [abc]
-- second letter is NOT from this set: [lho]
-- then zero or more characters after the first two letters

-- another 
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '[a-ce-j]%' -- skips d
ORDER BY ProductName
-- means single letter in the set of letters
-- a thru c OR e thru j
-- then zero or more letters

-- so let's go through some common uses

-- number of letters
-- has at least 6 characters
SELECT   'match'   as [test]
WHERE    '1234567' LIKE '______%' -- 6 underscores

SELECT   'match'   as [test]
WHERE    '12345'   LIKE '______%' -- 6 underscores

-- begins with a letter
SELECT   'match'
WHERE    'product' LIKE '[a-z]%'

SELECT   'match'
WHERE    '3product' LIKE '[a-z]%'


-- begins with a letter
-- and is at least 6 characters
SELECT   'match'
WHERE    'a12345' LIKE '[a-z]_____%' -- 5 underscores

SELECT   'match'
WHERE    'a1234' LIKE '[a-z]_____%' -- 5 underscores

SELECT   'match'
WHERE    '012345' LIKE '[a-z]_____%' -- 5 underscores

-- begins with a number
SELECT   'match'
WHERE     '0abcdefgh' LIKE '[0-9]______%' -- 6 underscores


-- begins with a letter, contains a number
SELECT   'match'
WHERE    'abc5' LIKE '[a-z]%[0-9]%'

-- ending / trailing spaces

SELECT   'match'
WHERE    ' abc' LIKE ' %'

-- ending / trailing punctuation

SELECT   'match'
WHERE    '?abc345' LIKE '[!@#$%?,.]%'

-- whitespace char, i.e., tab

-- note that char(9) returns the ASCII tab character
-- also note that + is the concatenation operator
-- thus the format string is square brackets
-- with a space and a tab character
-- it checks for tabs or spaces

SELECT   'match'
WHERE    'db' + char(9) + 'Doug' LIKE '%[ ' + char(9) + ']%'

-- detect punctuation character
-- ASCII has 
SELECT   'match'
WHERE    'abc{abc' 
   LIKE  '%[' 
         + char(33) + '-' + char(47) -- ASCII '!' thru '/'
         + char(58) + '-' + char(64) -- ASCII ':' thru '@'
         + char(91) + '-' + char(96) -- ASCII '[' thru '`'
         + char(123)+ '-' + char(126)-- ASCII '{' thru '~'
         + ']%'

SELECT   'match'
WHERE    'abc]abc' 
   LIKE  '%[' 
         + char(33) + '-' + char(47) -- ASCII '!' thru '/'
         + char(58) + '-' + char(64) -- ASCII ':' thru '@'
         + char(91) + '-' + char(96) -- ASCII '[' thru '`'
         + char(123)+ '-' + char(126)-- ASCII '{' thru '~'
         + ']%'


-- contains unprintable characters
SELECT   'match'
WHERE    'abc'
         + char(7) -- bell character
         + 'abc' 
   LIKE  '%[^'  -- note the carat here
         + char(32) + '-' + char(126)
         + ']%'

-- same technique can be used for whitespace characters
-- same technique can be used for unicode character set

-- Database by Doug
-- Douglas Kline
-- 5/19/2016
-- LIKE, advanced

Saturday, May 21, 2016

The SQL LIKE Operator

A short video covering the LIKE operator in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 5/11/2016
-- LIKE

use northwind

-- the LIKE statement is a logical operator that evaluates
-- if a set of characters matches a pattern
-- if it matches, it returns true
-- otherwise it returns false

-- since LIKE returns a logical value,
-- it is common to use LIKE in the WHERE clause, 
-- which expects logical expressions

-- if you don't know the WHERE clause,
-- check out another video on my channel

-- let's start with a simple example from the products table

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
ORDER BY ProductName

-- notice that there are 77 products

-- now, let's limit that to all the products
-- whose names start with the letter A

SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE 'a%'
ORDER BY ProductName

-- now all the products whose names end in A
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '%a'
ORDER BY ProductName

-- now all the products that have an A anywhere
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName LIKE '%a%'
ORDER BY ProductName

-- and all the ones without an A anywhere
SELECT   ProductName,
         ProductID,
         Unitprice
FROM     Products
WHERE    ProductName NOT LIKE '%a%'
ORDER BY ProductName

-- as you can see, the % character is special
-- it's a "wildcard" character, i.e.,
-- it can "match" on characters

-- before we go on, the general format of the LIKE operator is

-- [characterExpression] LIKE [formatExpression]

-- the [characterExpression] is the part that will be tested
-- the [formatExpression] defines the "test", or the format
--     used to test the [characterExpression] 

-- in the example above:
 --  ProductName LIKE '%a%'
 -- ProductName is the [characterExpression]
 -- '%a%' is the [formatExpression]

 -- ProductName            LIKE   '%a%'
 -- [characterExpression]  LIKE   [formatExpression]
 -- 

 -- its best to think of LIKE as a comparison operator, 
 -- similar to <, >, <=, >=, !=, etc.
 -- e.g., WHERE age > 35
-- the expressions on the left and right are compared,
--  based on a set of rules

-- there are two main special characters for LIKE

-- special character #1: % can match on zero or more characters

-- for demonstration purposes
-- a record with the word Match! is shown if the WHERE
-- clause evaluates to true
 
SELECT 'Match!'
WHERE  1=1

SELECT 'Match!'
WHERE  1=2

-- some examples...

SELECT 'Match!'
WHERE  'Fred' LIKE 'F%'

-- yes, Fred starts with F
-- the F matches on F, 
--  and the % matches on 'red'

SELECT 'Match!'
WHERE  'Fred' LIKE 'S%' 

-- no Fred doesn't start with S

 -- note that LIKE is case insensitive
SELECT 'Match!'
WHERE  'Fred' LIKE 'f%'

SELECT 'Match!'
WHERE  'fred' LIKE 'F%'

-- what about this
SELECT 'Match!'
WHERE  'F' LIKE 'F%'

-- yes, it matches
-- in this case, the F matches on 'F'
-- and the '%' matches on '', i.e., nothing

-- again, % matches on *zero* or more characters

-- how about this one

SELECT 'Match!'
WHERE  'Fred' LIKE 'F%d'

-- must start with F and end with d, any number of characters

-- or this

SELECT 'Match!'
WHERE  'F' LIKE 'F%'

-- or this

SELECT 'Match!'
WHERE  'Fred' LIKE '%r%'

-- must contain an r, any number of characters

-- or this

SELECT 'Match!'
WHERE  'Fred' LIKE '%f%'

-- must contain an f, any number of characters

-- or this

SELECT 'Match!'
WHERE  'F' LIKE '%ed'

-- must end in ed, any number of characters

-- and of course, most characters are NOT special
-- and must match exactly

SELECT 'Match!'
WHERE  'Fred' LIKE 'Fred'

-- but with no special characters 
-- it may make more sense to do this:

SELECT 'Match!'
WHERE  'Fred' = 'Fred'

-- special character #2: _ (underscore) matches exactly one character

SELECT 'Match!'
WHERE  'Fred' LIKE '____' -- 4 underscores

-- must have exactly 4 characters

SELECT 'Match!'
WHERE  '1F7.' LIKE '____' -- 4 underscores

-- must have exactly 4 characters

SELECT 'Match!'
WHERE  'Fred' LIKE '___' -- 3 underscores

-- must have exactly 3 characters

SELECT 'Match!'
WHERE  'Fred' LIKE 'F___' -- 3 underscores

-- must start with F, and have exactly 4 characters

SELECT 'Match!'
WHERE  'Fred' LIKE 'F__d' -- 2 underscores

-- must start with F, end in d, and have exactly 4 characters

-- combining the special characters

SELECT 'Match!'
WHERE  'Frederick' LIKE 'F__%' -- 2 underscores

-- must start with F and have at least 3 characters

SELECT 'Match!'
WHERE  'Frederick' LIKE '%_d_%'

-- must contain a d
-- d must not be the first or last letter
-- implies at least 3 characters

-- in summary
-- LIKE is a comparison operator
-- is compares a character string to a format string
-- % is a special character that matches on zero or more characters
-- _ is a special character that matches on exactly one character
-- other characters must match exactly
-- remember, digits and most punctuation are characters

-- thanks for watching

-- Database by Doug
-- Douglas Kline
-- 5/11/2016
-- LIKE

Friday, April 8, 2016

The DELETE statement in SQL

A short video covering the DELETE statement in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug                                         -- get code card up here
-- Douglas Kline
-- 4/8/2016
-- DELETE - how to remove records from  table  

USE Northwind

-- might want to review the SELECT statement
-- before using DELETE
-- in particular, review the WHERE clause,
-- because DELETE uses the WHERE clause

-- public service announcement :-)
/* ***WARNING!*** *****************

*INSERT, UPDATE and DELETE are *data modifying* commands

*INSERT, UPDATE and DELETE can do set-at-a-time operations

**this means that a seemingly simple statement
** can potentially change millions of records

* the database engine will not ask "Are you sure?"
* there is no simple "Undo"

* professionals normally don't, at the command line,
* execute INSERT, UPDATE, and DELETE commands
* on a production database

* treat these statements with respect!

*************************************/
-- the DELETE statement removes *entire* existing rows
-- if you want to remove *values* in a row, use UPDATE
-- (that is consider a change to a row)

-- let's look at the Orders table

SELECT *
FROM   Orders
ORDER BY orderID desc  -- note the descending order

-- now insert a bogus row to play with:
INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Akron', 'VINET')

-- see what happened, and note the OrderID
SELECT *
FROM   Orders
ORDER BY orderID desc

-- OrderID=11109 for the new record

-- now that we have a row, here's a simple DELETE statement

DELETE
FROM    ORDERS
WHERE   orderID = 11109

-- NOTE: there is no "are you sure"
-- or "you are about to delete a million records, okay"
-- and SSMS doesn't have an "undo delete" button

-- now check the record
SELECT *
FROM   Orders
ORDER BY orderID desc

-- note that one record was affected
-- because the primary key was set to a single number

-- the most possible damage is a single record

-- let's put that record back in...
INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Akron', 'VINET')

-- see what happened, and note the OrderID
SELECT *
FROM   Orders
ORDER BY orderID desc

-- OrderID=11110 for the new record

-- I like to do this before I do a DELETE:
SELECT *
FROM   Orders
WHERE  Shipcity='Akron'
 AND   CustomerID = 'VINET'

-- OK, that's the one (or more) records I want to get rid of
-- copy the statement down, and change to delete
DELETE
FROM   Orders
WHERE  Shipcity='Akron'
 AND   CustomerID = 'VINET'

-- and check the results
SELECT *
FROM   Orders
ORDER BY orderID desc

-- the general format for the DELETE statement is:
/*
DELETE
FROM    
WHERE   
*/

-- notice that the DELETE statement is very similar to SELECT
-- except you don't specify any columns...
-- because DELETE operates on *records*, not on columns

-- but what if you want to specify records to be deleted,
-- based on relationships with other tables,
-- in other words, a join

-- put the record back in:
INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Akron', 'VINET')

-- and another
INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Cleveland', 'RATTC')

-- and another
INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('San Diego', 'BONAP')
-- consider this:

SELECT         *
FROM           Orders
   LEFT JOIN   [Order Details] ON Orders.OrderID = [Order Details].orderID
WHERE          [order details].orderID IS NULL

-- this is a classic "find unmatched"
-- in other words, this shows the orders 
-- without any lines on them - no related [order details] records
-- "widow" records because there are no "related" records,
--   and there *should* be, orders don't make sense without lines

-- btw,
-- you might want to check out my 
-- video on outer joins if you don't know this

-- so how do I delete widow records?
-- copy down, and change to DELETE...
DELETE
FROM           Orders
   LEFT JOIN   [Order Details] ON Orders.OrderID = [Order Details].orderID
WHERE          [order details].orderID IS NULL

-- darn, there's an error

-- this is because I can only DELETE from one table at a time
-- and I've got two tables specified
-- which table's records are to be deleted?

-- so here's the syntax:
DELETE         Orders  -- this is the changed part
FROM           Orders
   LEFT JOIN   [Order Details] ON Orders.OrderID = [Order Details].orderID
WHERE          [order details].orderID IS NULL

-- and check the results
SELECT *
FROM   Orders
ORDER BY orderID desc

-- so the more general syntax is
/*
DELETE      
FROM        
WHERE       
*/

-- where the table-expression normally includes
-- the table-name, with joins

-- so, do professionals normally *DELETE* data?
---
--
--
--
--
--
-- NO.
-- they normally *move* it somewhere,
-- after they make sure it made it to the new location safely
-- they remove it from where it was
-- and of course, they do a backup first
-- 
--
-- Database by Doug                                         
-- Douglas Kline
-- 4/8/2016
-- DELETE - how to remove records from  table  
 

Wednesday, March 30, 2016

Why Data Analytics is Different, Part 1: Introduction

The buzzword lately is analytics.

It’s said many different ways, by many different people.
  • Data Analytics
  • Data Science
  • Business Intelligence
Sometimes you see big thrown in there too, i.e., big data analytics.

To many people, this is just a new way to say some old, broad, techniques such as:

  • Data Analysis
  • Statistics
  • Science
In different disciplines, you find terms like this:
  • Management Science (business)
  • Operations Research (business)
  • Artificial Intelligence (computer science)
  • Machine Learning (computer science)
  • Technical Analysis (finance)
  • Healthcare Analytics (healthcare)
  • Web Analytics (marketing)
  • Statistics (mathematics)
  • Econometrics (economics)

So I’ve been thinking lately, how is data analytics different from all these past things? Is it a superset of all these things? Is it a mash-up of these things? Is it different to different people?

I have a specific perspective on this, so my thinking may not fit for everyone. I’m an Information Systems professor. My undergraduate degree was math and computer science, I have an MBA, and my doctorate degree is in business, majoring in Operations Research and minoring in Information Systems. My corporate experience is in software development. Now I’m a professor in a business school, but I continue to consult on database performance and storage.

I have what I’d call traditional training in statistics. I’ve taken the math courses behind statistical models, calculated eigenvalues, and derived maximum likelihood estimators. Then I took the management science and operations research courses in optimization, and applied techniques to things like quality control, queuing, facility location, scheduling, etc.

For several years, I’ve been thinking that analytics is just a re-branding of all this stuff I already know, dreamed up by marketing geniuses at a company like SAS or EMC.

I couldn’t find anything that really disputed that hypothesis. But I had this sneaking  suspicion that there was something else there.

I think differently now. I’ve dug into it. I went to a course taught by a professional data scientist, consulting regularly for global corporations. I saw how he was applying it. I saw how he approached things. I worked with the modern software tools: R, SQL Server, SQL Server Analysis Services, Azure Machine Learning. I came back and applied the tools in my own academic and consulting work.

Here’s my conclusion: modern analytics is different. And for good reason. Traditional methods were right for their era. Analytics is right for this era.

However, analytics is not different in trivial, obvious, or superficial ways.

Here’s what’s different :
Cost – data and computation are cheaper now
Timing – we want continuous answers
Context – we want analytics integrated into systems
Paradigm – we want to explore data rather than confirm hypotheses

All of these differences change the approach, the methodology, the practices. It changes the relevant techniques. It takes widely applicable, tried-and-true, bread-and-butter techniques and makes them arcane techniques for special cases.

I think the hardest part for traditionalists approaching analytics, is why some of the practices that were drilled into us are no longer appropriate. Mathematically, everything I learned is still true and correct, just much less relevant.

For example, hypothesis testing, estimation, confidence intervals, etc. These topics were foundational material in my training, and heavily used techniques in my toolbox.  But the main motivation behind their development and use was lack of data. We were “controlling” error with these methods, and trying to make sure we weren’t trying to generalize too much based on small samples. We had rules of thumb like "at least 30 data points for ...".

"Not enough data" is typically not a problem today. In recent years, I’ve found my work involves so much data that p-values are incredibly small by traditional standards. I use techniques to reduce the amount of data. I'm not worried about getting at least 30 points, I'm worried about getting the data file from here to there because it's so big. In other words, I have so much data, that I don’t worry about “controlling error” anymore, so hypothesis testing is not as relevant.

The next few articles will explore these differences. I hope that this will help older, more traditionally trained analysts understand modern analytics better. I also hope that this will help younger, more modern analysts understand the traditional methods better.

Next articles:



Why Data Analytics is Different, Part 4: Context

forthcoming...

Why Data Analytics is Different, Part 3: Timing

Data analytics is different than traditional data analysis because the pace is faster.

Traditional data analysis was episodic in nature. We performed a big study by gathering data, analyzing the data, writing it up, mailing copies to an editor, who mailed it to reviewers, etc. Hopefully it found its way to print after several years. Because this type of project was so big, we often would say “thank goodness it’s done.”

This expresses well the nature of old-school data analysis. It was a one-time effort. You do it, then it’s done. You don’t continue the project because you aren’t gathering data anymore. At the point that the paper was published and available to others, the data that the research was founded on was generally two or more years old. So, in 1995, when I cited a 1988 paper, I was citing ten-year-old research. And I was comfortable citing ten-year-old research.

Because of this single-time-extended-project nature, the pace of analysis was, by today’s standards, leisurely. A PhD candidate might spend years on their dissertation.

Of course, the analysis tools were unwieldy and complex. You might have to learn the FORTRAN-like SAS language, and submit your analysis for processing on the mainframe. And because of the cost (see previous article), you had limited data, and needed to be very careful to that you didn’t violate the statistical assumptions necessary for small data sets. Does the data fit a normal distribution? Is there multi-collinearity? Is there a pattern in the residuals? Is that an outlier?

Today, data is gathered on an ongoing basis. We record/track/store everything that happens, as it happens, because we can. We might need that data.

Because data is gathered on an ongoing basis, we can do analytics on an ongoing basis. And we are expected to. With today’s tools and technologies, it’s hard to justify not doing real-time analytics.
For traditionalists, this can be difficult. They used to work for extended periods to get “the definitive answer”. It was an achievement. There was great skill and knowledge in the methodologies necessary for small amounts of data. Traditionalists were used to reading the 1989 work of Schlepsky & Jones, based on a survey of 38 people, which stated that consumers were 6% more likely to whatever … That 6% number based on a linear regression would get cited for years in the literature, because it was the best answer available. And all those citations justified the two-year research project of Schlepsky & Jones.

Today, there is no definitive, one-time answer. There’s the answer at a moment in time. In a few seconds, there will be another moment in time with another answer. For traditionalists, think about it this way: each answer for a moment in time has its own parameter, i.e., the 6%. But it also may have its own model, i.e., the linear regression.

In this new world, where data is cheap and every moment in time has its own model, we have to change our methods. Or rather, we get to change our methods. (I never really liked pouring tons of work into creating a wall of statistical methodology in defense against reviewers trying to poke holes in the wall.)

So what kinds of things do we get to do in this new world with tons of data? We get to stop worrying about multi-collinearity, normality, residual patterns, etc. We get to explore large amounts of data for patterns that “hit you over the head” when you see them. If there isn’t a clear, obvious pattern, we get to stop looking at it and look at something else. We don’t have to spend time trying to make the data support or not support our pre-defined hypothesis.

Personally, I like this new world better. I get to discover new knowledge at a fast pace, rather than confirm theories at a slow pace.


Why Data Analytics is Different, Part 5: Paradigm

forthcoming...

Why Data Analytics is Different, Part 2: Cost

Data used to be expensive.

People like to say today that “space is cheap”, meaning that the cost per byte has dropped dramatically over time: $10,000/1GB in 1990 to about $0.10/1GB in 2010. See a nice article with a graph here.

But that’s only a small part of the cost.

In traditional data analysis, we didn’t really worry about storage cost. We worried about how much it cost to acquire the data.

Only recently have we gained the ability to gather data on an ongoing basis. (There are a number of drivers for this: bandwidth, computing power, personal devices, storage cost.)

In years past, getting data was a project. It could take weeks, months, or years to acquire data for analysis. You talked about the cost per observation, and knew that more data meant more cost. Because the cost per observation was such a concern, you knew that you needed to plan a study/survey/experiment. You might hire an expert in survey or experimental design. In fact, you probably knew what statistical test you were going to perform, and had calculated exactly the number of observations you needed to answer your question with your required confidence level.

Because of the high cost of data acquisition, a data analyst/researcher/scientist would spend much more of their time on planning. In general, the process was like this:
  1. Develop a research question, aka, hypothesis 
  2. Choose the statistical method / test that would answer the question 
  3. Write a proposal to fund data acquisition and analysis, with these items: 
    • Experimental design showing number of observations needed to answer the question, with statistical method specified in step 2 
    • Cost estimate of data acquisition 
    • Justification of the importance of the research question, i.e., is the answer worth the cost of the data acquisition? 
  4. After funding is secured, execute the data acquisition plan 
  5. Perform the statistical analysis specified in Step 2 
  6. Stop, do not perform additional analysis 
Even without external funding, the mental process was much the same. The “cost estimate of data acquisition” was always there, even if it was a sunk cost, e.g., your own man-hours. Many hypotheses never went studied, because data acquisition was too costly.

Step 6 above deserves some discussion. You normally did not re-work data that you had acquired. With limited data, performing multiple statistical tests requires caution. The mathematical reasoning goes something like this: If you are 95% confident of your conclusions for Hypotheses 1, 2, and 3, then you are really only 85.7% (= 0.95*0.95*0.95) confident of all three, with the same data. Like the probability of rolling a 1 thru 5 on a six-sided die is 5/6 =83.3%. But the probability of doing that three times is (5/6)*(5/6)*(5/6) = 57.8%.

And remember, you have just enough data from your data acquisition to get our 95% confidence on a single hypothesis for your grant. If you want to test 3 hypotheses and be 95% confident, write that in your grant, because you will need more data. There are special tests like the Bonferroni Method, or Scheffe’s Contrast of Means that deal appropriately with multiple comparisons by adjusting confidence. 

If you just went ahead and ran your three hypotheses without adjusting for the multiple tests, you were called a “Data Miner”, and you were told “you are just mining the data.” And that was a bad thing. Today, we have classes in Data Mining – it has become a good thing. 

Why is data mining okay today? Why don’t we worry anymore about multiple comparisons? Because data is cheap. We have tons of it. We have 99.999999999% confidence, because we have so much data. When data was expensive and limited, you had to worry about these things. Today, if you do three hypothesis tests, or thirty, your confidence is still very high. To the point where you don’t have to explicitly worry about it. 

If you are in the machine learning world, the issue of not having enough data is related to, and sometimes exactly the same as, over-fitting. We say that a model is over-fitted and has trouble generalizing, to observations it hasn’t seen before. Over-fitting of models is a symptom of not having enough data to support accurate estimation of the parameters in your model. Techniques such as jittering, noise injection, and re-sampling were created to address over-fitting, and measures such as the various information criteria (AIC, BIC, etc.) and the Vapnik-Chervonenkis dimension (VC dimension), were developed to compare models for likelihood of over-fitting. 

I'm not trying to boor you with the technical details, but making a point. Tons of research and effort was poured into dealing with the cost of data. Cost of data was a central theme impacting our entire approach to data analysis. 

Today, we have a whole set of new techniques because of too much data. For data with too many dimensions, we have developed data reduction techniques such as principle component analysis (PCA), feature elimination, or bayesian data reduction, to make the data manageable. We have developed visualization techniques like R’s faceting to show many plots across dimensions to allow our eyes to see patterns and eliminate variables, or focus on the relevant ones. 

When there are just too many data points, we sample from the data, and do analysis on only a subset. We regularly take the second-by-second stock price data and aggregate it into daily prices, e.g., minimum, maximum, open, close (candlestick). Our practices today embrace reducing the amount of data for analysis.

Traditional data analysts/scientists/statisticians/researchers need to realize that the drastic reduction in the cost of data calls for changes in methods. All the traditional knowledge about statistics and experimental methods is still relevant and useful, but are likely to be applied differently. 

Modern data scientists will still find themselves in the occasional situation of not having sufficient data. They should educate themselves on traditional methods to be able to recognize these situations and how to appropriately deal with them. 

I started with cost of data as the first difference in this series of Why Data Analytics is Different, because it highly influences the other differences. Cost of data influenced traditional analysis and continues to influence modern approaches to data analysis. It impacts the other differences: timing, context, and paradigm.

Monday, March 21, 2016

The UPDATE statement in SQL

A short video covering the UPDATE statement in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug                                         -- get code card up here
-- Douglas Kline
-- 3/21/2016
-- UPDATE - how to change data

USE Northwind

-- might want to review the SELECT statement
-- before using UPDATE
-- in particular, review the WHERE clause,
-- because UPDATE uses the WHERE clause

-- public service announcement
/* ***WARNING!*** *****************

*INSERT, UPDATE and DELETE are *data modifying* commands

*INSERT, UPDATE and DELETE can do set-at-a-time operations

**this means that a seemingly simple statement
** can change millions of records

* the database engine will not ask "Are you sure?"
* there is no simple "Undo"

* professionals normally don't, at the command line,
* execute INSERT, UPDATE, and DELETE commands
* on a production database

* treat these statements with respect!

*************************************/

-- the UPDATE statement changes existing rows
-- UPDATE changes records, not tables
-- again, the row must already exist

-- let's look at the Orders table

SELECT *
FROM   Orders
ORDER BY orderID desc

-- now insert a bogus row to play with:
INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Akron', 'VINET')

-- see what happened, and note the OrderID
SELECT *
FROM   Orders
ORDER BY orderID desc

-- OrderID=11099 for the new record

-- now that we have a row to modify, here's a simple update
-- change the orderdate to right now

UPDATE Orders
SET    OrderDate = GETDATE()
WHERE  orderID = 11099

-- and check the record
SELECT *
FROM   Orders
ORDER BY orderID desc

-- note that one record was affected
-- because the primary key was set to a single number
-- most possible damage is a single record

-- let's change two fields
UPDATE Orders
SET    RequiredDate = '20160401 17:00:00.00',
       ShipName     = 'Douglas Kline'
WHERE  orderID = 11099

-- and check the record
SELECT *
FROM   Orders
ORDER BY orderID desc

-- so the general form of the UPDATE statement is
/*
UPDATE   <table name>
SET      <assignment statement list>
WHERE    <logical expression>

or

UPDATE   <table name>
SET      <field name 1> = <expression for new value>,
         <field name 2> = <expression for new value>,
         <field name 3> = <expression for new value>,
         ...
WHERE    <logical expression>
*/

-- another example
UPDATE   Orders
SET      Freight        = Freight + 25.0,
         RequiredDate   = DateAdd(week, 1, RequiredDate),
         ShipCountry    = 'U' + 'S' + 'A'
WHERE    CustomerID     = 'VINET'
  AND    EmployeeID     IS NULL

-- and check the results
SELECT *
FROM   Orders
ORDER BY orderID desc

-- notice a few things
-- several fields are set in the same statement
-- the new values are expressions, rather than literala values
-- the new value for the requiredDate field is a function of the existing value
-- the WHERE clause is logical expression
-- potentially, many records could be updated with this single statement


-- what happens if the UPDATE would violate the table constraints
-- here's the <simplified> definition fo the Order table

/*
CREATE TABLE Orders
 OrderID        int IDENTITY(1,1) NOT NULL,               <-----******
 CustomerID     nchar(5)          NULL,
 EmployeeID     int               NULL,
 OrderDate      datetime          NULL,
 RequiredDate   datetime          NULL,
 ShippedDate    datetime          NULL,
 ShipVia        int               NULL,
 Freight        money             NULL CONSTRAINT [DF_Orders_Freight]  DEFAULT ((0)),      <-----*****
 ShipName       nvarchar(40)      NULL,
 ShipAddress    nvarchar(60)      NULL,
 ShipCity       nvarchar(15)      NULL,
 ShipRegion     nvarchar(15)      NULL,
 ShipPostalCode nvarchar(10)      NULL,
 ShipCountry    nvarchar(15)      NULL,
 CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED 
(
 OrderID ASC
)
*/

UPDATE   Orders
SET      OrderID = 500000
WHERE    OrderID = 11099

-- Cannot update identity column 'OrderID'.
-- IDENTITY columns are not update-able

UPDATE   Orders
SET      CustomerID = 'Barney'
WHERE    OrderID = 11099

-- String or binary data would be truncated.
-- CustomerID is nchar(5), and 'Barney' is 6 characters

UPDATE   Orders
SET      CustomerID = 'Barne'
WHERE    OrderID = 11099

-- The UPDATE statement conflicted with the FOREIGN KEY constraint "FK_Orders_Customers". 
-- The conflict occurred in database "Northwind", table "dbo.Customers", column 'CustomerID'.
-- that would violate referential constraint
-- 'Barne' doesn't exist in the Customers table

UPDATE   Orders
SET      CustomerID = 'VINET'
WHERE    OrderID =  50000

-- no such record, 0 rows affected
-- no error - you need to check rows affected if it's important

-- in summary,
-- UPDATE modifies fields in existing rows
-- specify values according to their data type - you have to know the data types
-- various things could "block" an UPDATE
-- these "blocks" are your friends! they keep your data clean!
-- sometimes zero rows are updated - that's okay
-- sometimes millions of rows are updated!
-- treat INSERT, UPDATE, and DELETE with respect!


-- Database by Doug
-- Douglas Kline
-- 3/21/2016
-- UPDATE - how to change data

/*
DELETE FROM  Orders
WHERE  OrderID > 11077


*/

Friday, March 18, 2016

The SQL INSERT Statement

A short video covering the INSERT statement in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 3/15/2016
-- INSERT - how to create new data

USE Northwind

-- might want to review the SELECT statement
-- before using INSERT

-- public service announcement
/* ***WARNING!*** *****************

*INSERT is a *data modifying* command

*INSERT can do set-at-a-time operations

**this means that a seemingly simple INSERT statement
** can change millions of records

* the database engine will not ask "Are you sure?"
* there is no simple "Undo"

* professionals normally don't, at the command line,
* execute INSERT, UPDATE, and DELETE commands
* on a production database

* treat these statements with respect!

*************************************/

-- the INSERT statement creates entire rows
-- in a table that already exists

-- the form of a table must be maintained

-- in other words, a table is always a rectangle

-- every row has the same number of columns
-- every column has the same number of rows

-- you can't INSERT half a row
-- you can't INSERT a new value in 
--  a row that already exists (you'd use UPDATE)

-- let's look at the Orders table

SELECT *
FROM   Orders
ORDER BY OrderID desc

-- Here is a basic statement to create 
-- a new record in the Orders table

INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Akron', 'VINET')

-- note that only two fields are specified
-- and CustomerID comes before ShipCity in the table definition


-- OK, let's see if it worked
SELECT   *
FROM     Orders
ORDER BY OrderID desc

-- notice an entire row was created
-- even though we supplied only two field values

-- most of the unspecified fields got NULL values

-- however, OrderID and Freight got non-NULL values
-- to see why, let's look at the  table defition

/*
CREATE TABLE Orders
 OrderID        int IDENTITY(1,1) NOT NULL,               <-----******
 CustomerID     nchar(5)          NULL,
 EmployeeID     int               NULL,
 OrderDate      datetime          NULL,
 RequiredDate   datetime          NULL,
 ShippedDate    datetime          NULL,
 ShipVia        int               NULL,
 Freight        money             NULL CONSTRAINT [DF_Orders_Freight]  DEFAULT ((0)),      <-----*****
 ShipName       nvarchar(40)      NULL,
 ShipAddress    nvarchar(60)      NULL,
 ShipCity       nvarchar(15)      NULL,
 ShipRegion     nvarchar(15)      NULL,
 ShipPostalCode nvarchar(10)      NULL,
 ShipCountry    nvarchar(15)      NULL,
 CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED 
(
 OrderID ASC
)
*/

-- the two key pieces above are:
-- the IDENTITY(1,1) for OrderID and the DEFAULT((0)) for Freight

-- IDENTITY(1,1) for OrderID means that the database engine
--    will provide a unique value for this field
--    beginning with 1, and incrementing by 1 for each new record
--    we *cannot* provide a value for this field
--    it is completely controlled by the database engine
--
--    this is commonly used for primary key fields

-- the DEFAULT((0)) for Freight means that if no value
--    is provided for Freight, the database engine will set the 
--    value to zero
--    

-- so back to the INSERT statement

-- the general form of the INSERT statement is:
/*
INSERT INTO  ()
VALUES                  ()
*/

-- more specifically:
/*
INSERT INTO  (col1,       col2,       col3,...)
VALUES                  (col1-value, col2-value, col3-value,...)
*/

-- note that col1, col2, etc, can be specified in any order
-- they don't have to be consistent with the table definition

-- so here's an insert into the Products table

INSERT INTO Products 
           (ProductName, 
            SupplierID, 
            CategoryID, 
            QuantityPerUnit, 
            Unitprice, 
            UnitsInStock,
            UnitsOnOrder,
            ReorderLevel,
            Discontinued)
VALUES     ('Bowzer Biscuits', 
            2,
            3,
            NULL,
            12.0,
            10,
            5,
            8,
            0)

-- I've formatted this differently, for readability

-- let's see if it worked
SELECT   *
FROM     Products
ORDER BY ProductID desc

-- notice that the productID got a value, even though I didn't provide it

-- also notice that the bit field discontinued was specified as 0

-- and let's go back to orders, and specify some dates
INSERT INTO Orders
           (customerID,
            EmployeeID,
            OrderDate,
            RequiredDate,
            ShippedDate,
            Freight)
VALUES     ('VINET',
            3,
            GETDATE(),
            '20160501 17:00:00.000', -- 5:00pm May 1 2016
            NULL,
            NULL)


SELECT      *
FROM        Orders
ORDER BY    OrderID desc

-- notice that character-based data is delimited by single quotes
--    that I can use GETDATE() to provide the current server datetime value for the orderdate
--    and that dates are specified as delimited characters, and converted to datetime

-- so what happens if I provide an invalid value?

INSERT INTO Orders
            (orderID)
VALUES      (1)
-- Cannot insert explicit value for identity column in table 'Orders' 
--    when IDENTITY_INSERT is set to OFF.
-- IDENTITY fields can only be set by the database engine

INSERT INTO Orders
            (customerID)
VALUES      ('Barney')
-- String or binary data would be truncated.
-- CustomerID is a nchar(5), and 'Barney' has 6 letters

INSERT INTO Orders
            (customerID)
VALUES      ('Barne')

--The INSERT statement conflicted with the FOREIGN KEY constraint "FK_Orders_Customers". 
-- The conflict occurred in database "Northwind", table "dbo.Customers", column 'CustomerID'.
-- 'Barne' is not a valid customerID - it doesn't exist in the Customers table


-- in summary,
-- INSERT creates entire rows
-- specify values according to their data type - you have to know the data types
-- various things could "block" an insert
-- these "blocks" are your friends! they keep your data clean!
-- treat INSERT, UPDATE, and DELETE with respect!


-- Database by Doug
-- Douglas Kline
-- 3/15/2016
-- INSERT - how to create new data

Wednesday, February 17, 2016

Fixing Amazon Music Player blocked by Group Policy

A few months ago, a new Active Directory Group Policy was rolled out to my network in response to some nasty viruses going around.

And my Amazon Music Player stopped working, with a message:

C:\Users\...\AppData\Local\Amazon Music\Amazon Music.exe
This program is blocked by group policy. For more information, contact your system administrator.

Error message for Amazon Music blocked by Group Policy


Not sure of the technical details, but the Group Policy effectively blocked any exe files from executing in the \users\AppData\local folder. And Amazon Music Player installs, by default into this folder.

I did some google searches, and found some complaints by system administrators. I also found some information on how to more selectively block executables, rather than block the entire folder. I also submitted a ticket to my help desk to see if they could unblock this particular executable, since I *must* work to music. But they declined, citing security concerns. They've got more important things to worry about.

So months went by, while I made due with Amazon's web-based player. However, I have also been unable to upload my large music library into Amazon's Music Library. The only way to do it is through Amazon Music Player, which was now blocked by the group policy.

And since I recently gave in to a subscription for the Amazon Music Library, I needed to get this working.

So here's what I did. I copied the entire folder containing the executable to another location. This is the folder I copied:
 C:\Users\...\AppData\Local\Amazon Music\
I copied it to a spare drive I had, so it ended up here:
H:\Amazon Music\
Then I double-clicked on the executable here:
H:\Amazon Music\Amazon Music.exe
 Worked fine. So I put a shortcut to this copy of the executable on my desktop, and that's what I'm using.

A couple of caveats.


  • Copying just the executable did not work. 
    • Apparently there are some dll files whose relative paths must be maintained.
  • The cached files remain in the original location in appdata\local
    • I noticed that there are cache folders (that grow quite large) contained in the \Amazon Music\data folder. I cleared them, but even though I'm running the executable from a different location, the cache files still show up in the \Amazon Music\data folder.
Well, this got me running. I imagine it is not a permanent or ideal solution, but I can upload my mp3s into my Amazon Library now.


Getting your music off Amazon cloud drive

A few years ago, I had a free trial of Amazon Music, but wasn't ready to make the commitment to the paid version. However, I had moved a ton of music up into Amazon Music.

When I didn't sign up for the paid subscription to Amazon Music, they moved all but 250(?) uploaded songs into my Amazon Cloud Drive. They neatly placed it into a folder called Archived Music.

Well, since then I've gone ahead and committed to the paid subscription for Amazon Music. However, most of my music isn't in my Amazon Music library, it's in my Amazon Cloud Drive. You'd think there would be an easy way to move it back, but I couldn't find one. 

So, how to get my 10GB of mp3 files (that I painstakingly ripped from my CD collection) out of my cloud drive and into my Amazon Music library?

Unfortunately, there's a cap on downloads. Based on internet searches, it seems that you can only download 500 items at a time. So I couldn't select my entire Archived Music folder, since it had thousands of files.

Well, I couldn't find a solution online, so this is how I did it.

The music was arranged in the Archived Music folder by artist, so I first sorted by the folder names.

Then I selected the first 10 folders, and downloaded them. It was somewhere between 200 and 300 items, well under the cap.

Selecting and downloading artist folders from Amazon Cloud Drive

The download arrives as a zip file called "AmazonCloudDriveDownload.zip". 

Then, I selected the next set of artists and downloaded them. It arrived in my download folder as "AmazonCloudDriveDownload(1).zip".

How the zip files end up in your downloads folder


I was able to get 4 downloads going simultaneously, and going from three concurrent downloads to four didn't seem to slow any of them down. I probably could have done more simultaneously, but I didn't push it.

Depending on my artists, and the bit rate at which I ripped the mp3 files, the zip files were anywhere between 200MB and 2.2GB. It took 21 downloads to get all the files. Not sure if there was any throttling along the way, by Amazon or by my service provider. 

Once I got them all downloaded, I unzipped all the artist folders into a common folder. I wanted to make sure I didn't miss any artist folders, so I pulled up my local artists folders and compared side-by-side with the folders on the cloud drive. I did find that I had missed a whole download somehow, so I was glad I double-checked.
Comparing my local artist folders with those on Amazon Cloud Drive

 I was able to get all of them downloaded in an afternoon, on a pretty fast connection. All in all it was not as painful as I thought it would be.

Next step is to upload them into my Amazon Music Library. I'll let you know how that goes.

Followers