Traditional data analysis was episodic in nature. We performed a big study by gathering data, analyzing the data, writing it up, mailing copies to an editor, who mailed it to reviewers, etc. Hopefully it found its way to print after several years. Because this type of project was so big, we often would say “thank goodness it’s done.”
This expresses well the nature of old-school data analysis. It was a one-time effort. You do it, then it’s done. You don’t continue the project because you aren’t gathering data anymore. At the point that the paper was published and available to others, the data that the research was founded on was generally two or more years old. So, in 1995, when I cited a 1988 paper, I was citing ten-year-old research. And I was comfortable citing ten-year-old research.
Because of this single-time-extended-project nature, the pace of analysis was, by today’s standards, leisurely. A PhD candidate might spend years on their dissertation.
Of course, the analysis tools were unwieldy and complex. You might have to learn the FORTRAN-like SAS language, and submit your analysis for processing on the mainframe. And because of the cost (see previous article), you had limited data, and needed to be very careful to that you didn’t violate the statistical assumptions necessary for small data sets. Does the data fit a normal distribution? Is there multi-collinearity? Is there a pattern in the residuals? Is that an outlier?
Today, data is gathered on an ongoing basis. We record/track/store everything that happens, as it happens, because we can. We might need that data.
Because data is gathered on an ongoing basis, we can do analytics on an ongoing basis. And we are expected to. With today’s tools and technologies, it’s hard to justify not doing real-time analytics.
For traditionalists, this can be difficult. They used to work for extended periods to get “the definitive answer”. It was an achievement. There was great skill and knowledge in the methodologies necessary for small amounts of data. Traditionalists were used to reading the 1989 work of Schlepsky & Jones, based on a survey of 38 people, which stated that consumers were 6% more likely to whatever … That 6% number based on a linear regression would get cited for years in the literature, because it was the best answer available. And all those citations justified the two-year research project of Schlepsky & Jones.
Today, there is no definitive, one-time answer. There’s the answer at a moment in time. In a few seconds, there will be another moment in time with another answer. For traditionalists, think about it this way: each answer for a moment in time has its own parameter, i.e., the 6%. But it also may have its own model, i.e., the linear regression.
In this new world, where data is cheap and every moment in time has its own model, we have to change our methods. Or rather, we get to change our methods. (I never really liked pouring tons of work into creating a wall of statistical methodology in defense against reviewers trying to poke holes in the wall.)
So what kinds of things do we get to do in this new world with tons of data? We get to stop worrying about multi-collinearity, normality, residual patterns, etc. We get to explore large amounts of data for patterns that “hit you over the head” when you see them. If there isn’t a clear, obvious pattern, we get to stop looking at it and look at something else. We don’t have to spend time trying to make the data support or not support our pre-defined hypothesis.
Personally, I like this new world better. I get to discover new knowledge at a fast pace, rather than confirm theories at a slow pace.
No comments:
Post a Comment