Wednesday, March 30, 2016

Why Data Analytics is Different, Part 1: Introduction

The buzzword lately is analytics.

It’s said many different ways, by many different people.
  • Data Analytics
  • Data Science
  • Business Intelligence
Sometimes you see big thrown in there too, i.e., big data analytics.

To many people, this is just a new way to say some old, broad, techniques such as:

  • Data Analysis
  • Statistics
  • Science
In different disciplines, you find terms like this:
  • Management Science (business)
  • Operations Research (business)
  • Artificial Intelligence (computer science)
  • Machine Learning (computer science)
  • Technical Analysis (finance)
  • Healthcare Analytics (healthcare)
  • Web Analytics (marketing)
  • Statistics (mathematics)
  • Econometrics (economics)

So I’ve been thinking lately, how is data analytics different from all these past things? Is it a superset of all these things? Is it a mash-up of these things? Is it different to different people?

I have a specific perspective on this, so my thinking may not fit for everyone. I’m an Information Systems professor. My undergraduate degree was math and computer science, I have an MBA, and my doctorate degree is in business, majoring in Operations Research and minoring in Information Systems. My corporate experience is in software development. Now I’m a professor in a business school, but I continue to consult on database performance and storage.

I have what I’d call traditional training in statistics. I’ve taken the math courses behind statistical models, calculated eigenvalues, and derived maximum likelihood estimators. Then I took the management science and operations research courses in optimization, and applied techniques to things like quality control, queuing, facility location, scheduling, etc.

For several years, I’ve been thinking that analytics is just a re-branding of all this stuff I already know, dreamed up by marketing geniuses at a company like SAS or EMC.

I couldn’t find anything that really disputed that hypothesis. But I had this sneaking  suspicion that there was something else there.

I think differently now. I’ve dug into it. I went to a course taught by a professional data scientist, consulting regularly for global corporations. I saw how he was applying it. I saw how he approached things. I worked with the modern software tools: R, SQL Server, SQL Server Analysis Services, Azure Machine Learning. I came back and applied the tools in my own academic and consulting work.

Here’s my conclusion: modern analytics is different. And for good reason. Traditional methods were right for their era. Analytics is right for this era.

However, analytics is not different in trivial, obvious, or superficial ways.

Here’s what’s different :
Cost – data and computation are cheaper now
Timing – we want continuous answers
Context – we want analytics integrated into systems
Paradigm – we want to explore data rather than confirm hypotheses

All of these differences change the approach, the methodology, the practices. It changes the relevant techniques. It takes widely applicable, tried-and-true, bread-and-butter techniques and makes them arcane techniques for special cases.

I think the hardest part for traditionalists approaching analytics, is why some of the practices that were drilled into us are no longer appropriate. Mathematically, everything I learned is still true and correct, just much less relevant.

For example, hypothesis testing, estimation, confidence intervals, etc. These topics were foundational material in my training, and heavily used techniques in my toolbox.  But the main motivation behind their development and use was lack of data. We were “controlling” error with these methods, and trying to make sure we weren’t trying to generalize too much based on small samples. We had rules of thumb like "at least 30 data points for ...".

"Not enough data" is typically not a problem today. In recent years, I’ve found my work involves so much data that p-values are incredibly small by traditional standards. I use techniques to reduce the amount of data. I'm not worried about getting at least 30 points, I'm worried about getting the data file from here to there because it's so big. In other words, I have so much data, that I don’t worry about “controlling error” anymore, so hypothesis testing is not as relevant.

The next few articles will explore these differences. I hope that this will help older, more traditionally trained analysts understand modern analytics better. I also hope that this will help younger, more modern analysts understand the traditional methods better.

Next articles:



Why Data Analytics is Different, Part 4: Context

forthcoming...

Why Data Analytics is Different, Part 3: Timing

Data analytics is different than traditional data analysis because the pace is faster.

Traditional data analysis was episodic in nature. We performed a big study by gathering data, analyzing the data, writing it up, mailing copies to an editor, who mailed it to reviewers, etc. Hopefully it found its way to print after several years. Because this type of project was so big, we often would say “thank goodness it’s done.”

This expresses well the nature of old-school data analysis. It was a one-time effort. You do it, then it’s done. You don’t continue the project because you aren’t gathering data anymore. At the point that the paper was published and available to others, the data that the research was founded on was generally two or more years old. So, in 1995, when I cited a 1988 paper, I was citing ten-year-old research. And I was comfortable citing ten-year-old research.

Because of this single-time-extended-project nature, the pace of analysis was, by today’s standards, leisurely. A PhD candidate might spend years on their dissertation.

Of course, the analysis tools were unwieldy and complex. You might have to learn the FORTRAN-like SAS language, and submit your analysis for processing on the mainframe. And because of the cost (see previous article), you had limited data, and needed to be very careful to that you didn’t violate the statistical assumptions necessary for small data sets. Does the data fit a normal distribution? Is there multi-collinearity? Is there a pattern in the residuals? Is that an outlier?

Today, data is gathered on an ongoing basis. We record/track/store everything that happens, as it happens, because we can. We might need that data.

Because data is gathered on an ongoing basis, we can do analytics on an ongoing basis. And we are expected to. With today’s tools and technologies, it’s hard to justify not doing real-time analytics.
For traditionalists, this can be difficult. They used to work for extended periods to get “the definitive answer”. It was an achievement. There was great skill and knowledge in the methodologies necessary for small amounts of data. Traditionalists were used to reading the 1989 work of Schlepsky & Jones, based on a survey of 38 people, which stated that consumers were 6% more likely to whatever … That 6% number based on a linear regression would get cited for years in the literature, because it was the best answer available. And all those citations justified the two-year research project of Schlepsky & Jones.

Today, there is no definitive, one-time answer. There’s the answer at a moment in time. In a few seconds, there will be another moment in time with another answer. For traditionalists, think about it this way: each answer for a moment in time has its own parameter, i.e., the 6%. But it also may have its own model, i.e., the linear regression.

In this new world, where data is cheap and every moment in time has its own model, we have to change our methods. Or rather, we get to change our methods. (I never really liked pouring tons of work into creating a wall of statistical methodology in defense against reviewers trying to poke holes in the wall.)

So what kinds of things do we get to do in this new world with tons of data? We get to stop worrying about multi-collinearity, normality, residual patterns, etc. We get to explore large amounts of data for patterns that “hit you over the head” when you see them. If there isn’t a clear, obvious pattern, we get to stop looking at it and look at something else. We don’t have to spend time trying to make the data support or not support our pre-defined hypothesis.

Personally, I like this new world better. I get to discover new knowledge at a fast pace, rather than confirm theories at a slow pace.


Why Data Analytics is Different, Part 5: Paradigm

forthcoming...

Why Data Analytics is Different, Part 2: Cost

Data used to be expensive.

People like to say today that “space is cheap”, meaning that the cost per byte has dropped dramatically over time: $10,000/1GB in 1990 to about $0.10/1GB in 2010. See a nice article with a graph here.

But that’s only a small part of the cost.

In traditional data analysis, we didn’t really worry about storage cost. We worried about how much it cost to acquire the data.

Only recently have we gained the ability to gather data on an ongoing basis. (There are a number of drivers for this: bandwidth, computing power, personal devices, storage cost.)

In years past, getting data was a project. It could take weeks, months, or years to acquire data for analysis. You talked about the cost per observation, and knew that more data meant more cost. Because the cost per observation was such a concern, you knew that you needed to plan a study/survey/experiment. You might hire an expert in survey or experimental design. In fact, you probably knew what statistical test you were going to perform, and had calculated exactly the number of observations you needed to answer your question with your required confidence level.

Because of the high cost of data acquisition, a data analyst/researcher/scientist would spend much more of their time on planning. In general, the process was like this:
  1. Develop a research question, aka, hypothesis 
  2. Choose the statistical method / test that would answer the question 
  3. Write a proposal to fund data acquisition and analysis, with these items: 
    • Experimental design showing number of observations needed to answer the question, with statistical method specified in step 2 
    • Cost estimate of data acquisition 
    • Justification of the importance of the research question, i.e., is the answer worth the cost of the data acquisition? 
  4. After funding is secured, execute the data acquisition plan 
  5. Perform the statistical analysis specified in Step 2 
  6. Stop, do not perform additional analysis 
Even without external funding, the mental process was much the same. The “cost estimate of data acquisition” was always there, even if it was a sunk cost, e.g., your own man-hours. Many hypotheses never went studied, because data acquisition was too costly.

Step 6 above deserves some discussion. You normally did not re-work data that you had acquired. With limited data, performing multiple statistical tests requires caution. The mathematical reasoning goes something like this: If you are 95% confident of your conclusions for Hypotheses 1, 2, and 3, then you are really only 85.7% (= 0.95*0.95*0.95) confident of all three, with the same data. Like the probability of rolling a 1 thru 5 on a six-sided die is 5/6 =83.3%. But the probability of doing that three times is (5/6)*(5/6)*(5/6) = 57.8%.

And remember, you have just enough data from your data acquisition to get our 95% confidence on a single hypothesis for your grant. If you want to test 3 hypotheses and be 95% confident, write that in your grant, because you will need more data. There are special tests like the Bonferroni Method, or Scheffe’s Contrast of Means that deal appropriately with multiple comparisons by adjusting confidence. 

If you just went ahead and ran your three hypotheses without adjusting for the multiple tests, you were called a “Data Miner”, and you were told “you are just mining the data.” And that was a bad thing. Today, we have classes in Data Mining – it has become a good thing. 

Why is data mining okay today? Why don’t we worry anymore about multiple comparisons? Because data is cheap. We have tons of it. We have 99.999999999% confidence, because we have so much data. When data was expensive and limited, you had to worry about these things. Today, if you do three hypothesis tests, or thirty, your confidence is still very high. To the point where you don’t have to explicitly worry about it. 

If you are in the machine learning world, the issue of not having enough data is related to, and sometimes exactly the same as, over-fitting. We say that a model is over-fitted and has trouble generalizing, to observations it hasn’t seen before. Over-fitting of models is a symptom of not having enough data to support accurate estimation of the parameters in your model. Techniques such as jittering, noise injection, and re-sampling were created to address over-fitting, and measures such as the various information criteria (AIC, BIC, etc.) and the Vapnik-Chervonenkis dimension (VC dimension), were developed to compare models for likelihood of over-fitting. 

I'm not trying to boor you with the technical details, but making a point. Tons of research and effort was poured into dealing with the cost of data. Cost of data was a central theme impacting our entire approach to data analysis. 

Today, we have a whole set of new techniques because of too much data. For data with too many dimensions, we have developed data reduction techniques such as principle component analysis (PCA), feature elimination, or bayesian data reduction, to make the data manageable. We have developed visualization techniques like R’s faceting to show many plots across dimensions to allow our eyes to see patterns and eliminate variables, or focus on the relevant ones. 

When there are just too many data points, we sample from the data, and do analysis on only a subset. We regularly take the second-by-second stock price data and aggregate it into daily prices, e.g., minimum, maximum, open, close (candlestick). Our practices today embrace reducing the amount of data for analysis.

Traditional data analysts/scientists/statisticians/researchers need to realize that the drastic reduction in the cost of data calls for changes in methods. All the traditional knowledge about statistics and experimental methods is still relevant and useful, but are likely to be applied differently. 

Modern data scientists will still find themselves in the occasional situation of not having sufficient data. They should educate themselves on traditional methods to be able to recognize these situations and how to appropriately deal with them. 

I started with cost of data as the first difference in this series of Why Data Analytics is Different, because it highly influences the other differences. Cost of data influenced traditional analysis and continues to influence modern approaches to data analysis. It impacts the other differences: timing, context, and paradigm.

Monday, March 21, 2016

The UPDATE statement in SQL

A short video covering the UPDATE statement in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug                                         -- get code card up here
-- Douglas Kline
-- 3/21/2016
-- UPDATE - how to change data

USE Northwind

-- might want to review the SELECT statement
-- before using UPDATE
-- in particular, review the WHERE clause,
-- because UPDATE uses the WHERE clause

-- public service announcement
/* ***WARNING!*** *****************

*INSERT, UPDATE and DELETE are *data modifying* commands

*INSERT, UPDATE and DELETE can do set-at-a-time operations

**this means that a seemingly simple statement
** can change millions of records

* the database engine will not ask "Are you sure?"
* there is no simple "Undo"

* professionals normally don't, at the command line,
* execute INSERT, UPDATE, and DELETE commands
* on a production database

* treat these statements with respect!

*************************************/

-- the UPDATE statement changes existing rows
-- UPDATE changes records, not tables
-- again, the row must already exist

-- let's look at the Orders table

SELECT *
FROM   Orders
ORDER BY orderID desc

-- now insert a bogus row to play with:
INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Akron', 'VINET')

-- see what happened, and note the OrderID
SELECT *
FROM   Orders
ORDER BY orderID desc

-- OrderID=11099 for the new record

-- now that we have a row to modify, here's a simple update
-- change the orderdate to right now

UPDATE Orders
SET    OrderDate = GETDATE()
WHERE  orderID = 11099

-- and check the record
SELECT *
FROM   Orders
ORDER BY orderID desc

-- note that one record was affected
-- because the primary key was set to a single number
-- most possible damage is a single record

-- let's change two fields
UPDATE Orders
SET    RequiredDate = '20160401 17:00:00.00',
       ShipName     = 'Douglas Kline'
WHERE  orderID = 11099

-- and check the record
SELECT *
FROM   Orders
ORDER BY orderID desc

-- so the general form of the UPDATE statement is
/*
UPDATE   <table name>
SET      <assignment statement list>
WHERE    <logical expression>

or

UPDATE   <table name>
SET      <field name 1> = <expression for new value>,
         <field name 2> = <expression for new value>,
         <field name 3> = <expression for new value>,
         ...
WHERE    <logical expression>
*/

-- another example
UPDATE   Orders
SET      Freight        = Freight + 25.0,
         RequiredDate   = DateAdd(week, 1, RequiredDate),
         ShipCountry    = 'U' + 'S' + 'A'
WHERE    CustomerID     = 'VINET'
  AND    EmployeeID     IS NULL

-- and check the results
SELECT *
FROM   Orders
ORDER BY orderID desc

-- notice a few things
-- several fields are set in the same statement
-- the new values are expressions, rather than literala values
-- the new value for the requiredDate field is a function of the existing value
-- the WHERE clause is logical expression
-- potentially, many records could be updated with this single statement


-- what happens if the UPDATE would violate the table constraints
-- here's the <simplified> definition fo the Order table

/*
CREATE TABLE Orders
 OrderID        int IDENTITY(1,1) NOT NULL,               <-----******
 CustomerID     nchar(5)          NULL,
 EmployeeID     int               NULL,
 OrderDate      datetime          NULL,
 RequiredDate   datetime          NULL,
 ShippedDate    datetime          NULL,
 ShipVia        int               NULL,
 Freight        money             NULL CONSTRAINT [DF_Orders_Freight]  DEFAULT ((0)),      <-----*****
 ShipName       nvarchar(40)      NULL,
 ShipAddress    nvarchar(60)      NULL,
 ShipCity       nvarchar(15)      NULL,
 ShipRegion     nvarchar(15)      NULL,
 ShipPostalCode nvarchar(10)      NULL,
 ShipCountry    nvarchar(15)      NULL,
 CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED 
(
 OrderID ASC
)
*/

UPDATE   Orders
SET      OrderID = 500000
WHERE    OrderID = 11099

-- Cannot update identity column 'OrderID'.
-- IDENTITY columns are not update-able

UPDATE   Orders
SET      CustomerID = 'Barney'
WHERE    OrderID = 11099

-- String or binary data would be truncated.
-- CustomerID is nchar(5), and 'Barney' is 6 characters

UPDATE   Orders
SET      CustomerID = 'Barne'
WHERE    OrderID = 11099

-- The UPDATE statement conflicted with the FOREIGN KEY constraint "FK_Orders_Customers". 
-- The conflict occurred in database "Northwind", table "dbo.Customers", column 'CustomerID'.
-- that would violate referential constraint
-- 'Barne' doesn't exist in the Customers table

UPDATE   Orders
SET      CustomerID = 'VINET'
WHERE    OrderID =  50000

-- no such record, 0 rows affected
-- no error - you need to check rows affected if it's important

-- in summary,
-- UPDATE modifies fields in existing rows
-- specify values according to their data type - you have to know the data types
-- various things could "block" an UPDATE
-- these "blocks" are your friends! they keep your data clean!
-- sometimes zero rows are updated - that's okay
-- sometimes millions of rows are updated!
-- treat INSERT, UPDATE, and DELETE with respect!


-- Database by Doug
-- Douglas Kline
-- 3/21/2016
-- UPDATE - how to change data

/*
DELETE FROM  Orders
WHERE  OrderID > 11077


*/

Friday, March 18, 2016

The SQL INSERT Statement

A short video covering the INSERT statement in SQL.


Here is the SQL that goes with the demo.

-- Database by Doug
-- Douglas Kline
-- 3/15/2016
-- INSERT - how to create new data

USE Northwind

-- might want to review the SELECT statement
-- before using INSERT

-- public service announcement
/* ***WARNING!*** *****************

*INSERT is a *data modifying* command

*INSERT can do set-at-a-time operations

**this means that a seemingly simple INSERT statement
** can change millions of records

* the database engine will not ask "Are you sure?"
* there is no simple "Undo"

* professionals normally don't, at the command line,
* execute INSERT, UPDATE, and DELETE commands
* on a production database

* treat these statements with respect!

*************************************/

-- the INSERT statement creates entire rows
-- in a table that already exists

-- the form of a table must be maintained

-- in other words, a table is always a rectangle

-- every row has the same number of columns
-- every column has the same number of rows

-- you can't INSERT half a row
-- you can't INSERT a new value in 
--  a row that already exists (you'd use UPDATE)

-- let's look at the Orders table

SELECT *
FROM   Orders
ORDER BY OrderID desc

-- Here is a basic statement to create 
-- a new record in the Orders table

INSERT INTO Orders   (ShipCity, CustomerID)
VALUES               ('Akron', 'VINET')

-- note that only two fields are specified
-- and CustomerID comes before ShipCity in the table definition


-- OK, let's see if it worked
SELECT   *
FROM     Orders
ORDER BY OrderID desc

-- notice an entire row was created
-- even though we supplied only two field values

-- most of the unspecified fields got NULL values

-- however, OrderID and Freight got non-NULL values
-- to see why, let's look at the  table defition

/*
CREATE TABLE Orders
 OrderID        int IDENTITY(1,1) NOT NULL,               <-----******
 CustomerID     nchar(5)          NULL,
 EmployeeID     int               NULL,
 OrderDate      datetime          NULL,
 RequiredDate   datetime          NULL,
 ShippedDate    datetime          NULL,
 ShipVia        int               NULL,
 Freight        money             NULL CONSTRAINT [DF_Orders_Freight]  DEFAULT ((0)),      <-----*****
 ShipName       nvarchar(40)      NULL,
 ShipAddress    nvarchar(60)      NULL,
 ShipCity       nvarchar(15)      NULL,
 ShipRegion     nvarchar(15)      NULL,
 ShipPostalCode nvarchar(10)      NULL,
 ShipCountry    nvarchar(15)      NULL,
 CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED 
(
 OrderID ASC
)
*/

-- the two key pieces above are:
-- the IDENTITY(1,1) for OrderID and the DEFAULT((0)) for Freight

-- IDENTITY(1,1) for OrderID means that the database engine
--    will provide a unique value for this field
--    beginning with 1, and incrementing by 1 for each new record
--    we *cannot* provide a value for this field
--    it is completely controlled by the database engine
--
--    this is commonly used for primary key fields

-- the DEFAULT((0)) for Freight means that if no value
--    is provided for Freight, the database engine will set the 
--    value to zero
--    

-- so back to the INSERT statement

-- the general form of the INSERT statement is:
/*
INSERT INTO  ()
VALUES                  ()
*/

-- more specifically:
/*
INSERT INTO  (col1,       col2,       col3,...)
VALUES                  (col1-value, col2-value, col3-value,...)
*/

-- note that col1, col2, etc, can be specified in any order
-- they don't have to be consistent with the table definition

-- so here's an insert into the Products table

INSERT INTO Products 
           (ProductName, 
            SupplierID, 
            CategoryID, 
            QuantityPerUnit, 
            Unitprice, 
            UnitsInStock,
            UnitsOnOrder,
            ReorderLevel,
            Discontinued)
VALUES     ('Bowzer Biscuits', 
            2,
            3,
            NULL,
            12.0,
            10,
            5,
            8,
            0)

-- I've formatted this differently, for readability

-- let's see if it worked
SELECT   *
FROM     Products
ORDER BY ProductID desc

-- notice that the productID got a value, even though I didn't provide it

-- also notice that the bit field discontinued was specified as 0

-- and let's go back to orders, and specify some dates
INSERT INTO Orders
           (customerID,
            EmployeeID,
            OrderDate,
            RequiredDate,
            ShippedDate,
            Freight)
VALUES     ('VINET',
            3,
            GETDATE(),
            '20160501 17:00:00.000', -- 5:00pm May 1 2016
            NULL,
            NULL)


SELECT      *
FROM        Orders
ORDER BY    OrderID desc

-- notice that character-based data is delimited by single quotes
--    that I can use GETDATE() to provide the current server datetime value for the orderdate
--    and that dates are specified as delimited characters, and converted to datetime

-- so what happens if I provide an invalid value?

INSERT INTO Orders
            (orderID)
VALUES      (1)
-- Cannot insert explicit value for identity column in table 'Orders' 
--    when IDENTITY_INSERT is set to OFF.
-- IDENTITY fields can only be set by the database engine

INSERT INTO Orders
            (customerID)
VALUES      ('Barney')
-- String or binary data would be truncated.
-- CustomerID is a nchar(5), and 'Barney' has 6 letters

INSERT INTO Orders
            (customerID)
VALUES      ('Barne')

--The INSERT statement conflicted with the FOREIGN KEY constraint "FK_Orders_Customers". 
-- The conflict occurred in database "Northwind", table "dbo.Customers", column 'CustomerID'.
-- 'Barne' is not a valid customerID - it doesn't exist in the Customers table


-- in summary,
-- INSERT creates entire rows
-- specify values according to their data type - you have to know the data types
-- various things could "block" an insert
-- these "blocks" are your friends! they keep your data clean!
-- treat INSERT, UPDATE, and DELETE with respect!


-- Database by Doug
-- Douglas Kline
-- 3/15/2016
-- INSERT - how to create new data

Followers