People like to say today that “space is cheap”, meaning that the cost per byte has dropped dramatically over time: $10,000/1GB in 1990 to about $0.10/1GB in 2010. See a nice article with a graph here.
But that’s only a small part of the cost.
In traditional data analysis, we didn’t really worry about storage cost. We worried about how much it cost to acquire the data.
Only recently have we gained the ability to gather data on an ongoing basis. (There are a number of drivers for this: bandwidth, computing power, personal devices, storage cost.)
In years past, getting data was a project. It could take weeks, months, or years to acquire data for analysis. You talked about the cost per observation, and knew that more data meant more cost. Because the cost per observation was such a concern, you knew that you needed to plan a study/survey/experiment. You might hire an expert in survey or experimental design. In fact, you probably knew what statistical test you were going to perform, and had calculated exactly the number of observations you needed to answer your question with your required confidence level.
Because of the high cost of data acquisition, a data analyst/researcher/scientist would spend much more of their time on planning. In general, the process was like this:
- Develop a research question, aka, hypothesis
- Choose the statistical method / test that would answer the question
- Write a proposal to fund data acquisition and analysis, with these items:
- Experimental design showing number of observations needed to answer the question, with statistical method specified in step 2
- Cost estimate of data acquisition
- Justification of the importance of the research question, i.e., is the answer worth the cost of the data acquisition?
- After funding is secured, execute the data acquisition plan
- Perform the statistical analysis specified in Step 2
- Stop, do not perform additional analysis
Step 6 above deserves some discussion. You normally did not re-work data that you had acquired. With limited data, performing multiple statistical tests requires caution. The mathematical reasoning goes something like this: If you are 95% confident of your conclusions for Hypotheses 1, 2, and 3, then you are really only 85.7% (= 0.95*0.95*0.95) confident of all three, with the same data. Like the probability of rolling a 1 thru 5 on a six-sided die is 5/6 =83.3%. But the probability of doing that three times is (5/6)*(5/6)*(5/6) = 57.8%.
And remember, you have just enough data from your data acquisition to get our 95% confidence on a single hypothesis for your grant. If you want to test 3 hypotheses and be 95% confident, write that in your grant, because you will need more data. There are special tests like the Bonferroni Method, or Scheffe’s Contrast of Means that deal appropriately with multiple comparisons by adjusting confidence.
And remember, you have just enough data from your data acquisition to get our 95% confidence on a single hypothesis for your grant. If you want to test 3 hypotheses and be 95% confident, write that in your grant, because you will need more data. There are special tests like the Bonferroni Method, or Scheffe’s Contrast of Means that deal appropriately with multiple comparisons by adjusting confidence.
If you just went ahead and ran your three hypotheses without adjusting for the multiple tests, you were called a “Data Miner”, and you were told “you are just mining the data.” And that was a bad thing. Today, we have classes in Data Mining – it has become a good thing.
Why is data mining okay today? Why don’t we worry anymore about multiple comparisons? Because data is cheap. We have tons of it. We have 99.999999999% confidence, because we have so much data. When data was expensive and limited, you had to worry about these things. Today, if you do three hypothesis tests, or thirty, your confidence is still very high. To the point where you don’t have to explicitly worry about it.
If you are in the machine learning world, the issue of not having enough data is related to, and sometimes exactly the same as, over-fitting. We say that a model is over-fitted and has trouble generalizing, to observations it hasn’t seen before. Over-fitting of models is a symptom of not having enough data to support accurate estimation of the parameters in your model. Techniques such as jittering, noise injection, and re-sampling were created to address over-fitting, and measures such as the various information criteria (AIC, BIC, etc.) and the Vapnik-Chervonenkis dimension (VC dimension), were developed to compare models for likelihood of over-fitting.
I'm not trying to boor you with the technical details, but making a point. Tons of research and effort was poured into dealing with the cost of data. Cost of data was a central theme impacting our entire approach to data analysis.
Today, we have a whole set of new techniques because of too much data. For data with too many dimensions, we have developed data reduction techniques such as principle component analysis (PCA), feature elimination, or bayesian data reduction, to make the data manageable. We have developed visualization techniques like R’s faceting to show many plots across dimensions to allow our eyes to see patterns and eliminate variables, or focus on the relevant ones.
When there are just too many data points, we sample from the data, and do analysis on only a subset. We regularly take the second-by-second stock price data and aggregate it into daily prices, e.g., minimum, maximum, open, close (candlestick). Our practices today embrace reducing the amount of data for analysis.
Traditional data analysts/scientists/statisticians/researchers need to realize that the drastic reduction in the cost of data calls for changes in methods. All the traditional knowledge about statistics and experimental methods is still relevant and useful, but are likely to be applied differently.
Modern data scientists will still find themselves in the occasional situation of not having sufficient data. They should educate themselves on traditional methods to be able to recognize these situations and how to appropriately deal with them.
I started with cost of data as the first difference in this series of Why Data Analytics is Different, because it highly influences the other differences. Cost of data influenced traditional analysis and continues to influence modern approaches to data analysis. It impacts the other differences: timing, context, and paradigm.
No comments:
Post a Comment