Big data may have just crested the wave of inflated expectations and be barrelling towards the trough of disillusionment, at least if you’re following along with the Gartner Hype Cycle.
In other words, some practitioners are beginning to doubt the marketing jive around big data analysis and starting to take a more critical view of the limits of big data systems.
The promise of big data has been that the more data you collect, the more insights you can get for your organization. An engineer from Google, which has profited as much from big data as anyone, has called that notion “the unreasonable effectiveness of data.”
The latest issue of Science News details the limits of big data in a series of articles, the most recent entitled “Big data studies come with replication challenges.”
The problem, according to Science News, is one of validity. With so much data and so many different tools to analyze it, how can one be sure results are correct?
“Each time a scientist chooses one computer program over another or decides to investigate one variable rather than a different one, the decision can lead to very different conclusions,” Tina Hesman Saey wrote.
The validity problem is not one faced only by big data enthusiasts, but by the science community in general. In an earlier article, Science News tackled the issue of irreplicable results, or the increasing inability of scientists to reproduce the results from previously published studies.
One of the basic tenets of good science is that it can be reproduced by anyone, given the same initial conditions. But an increasing number of researchers have found that even the most carefully designed studies sometimes can’t be reproduced with the same results.
“Replicability is a cornerstone of science, but too many studies are failing the test,” Saey wrote. While dubious science can result from myriad reasons (the pressure on academicians to publish, for one), at least part of the blame can placed on a misuse of statistical analysis, which can be subtle and tricky to do correctly, Saey observed.
Other observers have also voiced weariness around the marketing promises of big data offered by the likes of IBM, Hewlett-Packard and others.
“There is this idea endemic to the marketing of data science that big data analysis can happen quickly, supporting an innovative and rapidly changing company,” wrote John Foreman, the data scientist at MailChimp.com in a recent blog post. “But in my experience and in the experience of many of the analysts I know, this marketing idea bears little resemblance to reality.”
Foreman notes that good statistical modeling requires stable input, at least a few cycles of historical data, and a predicted range of outcomes. Such laborious legwork to get all these elements in place works against the idea, encouraged by many marketing campaigns, that big data systems can deliver fresh results quickly.
Certainly, the validity of big data will be a topic at the O’Reilly Strata + Hadoop World conference, to be held next week in San Jose, California.
In one presentation at the conference, Simon Garland, the chief strategist at database vendor Kx Systems, will talk about how big data is noisy and inconsistent, and cannot be managed well using traditional database analysis systems.
Gartner itself seems to remain bullish about the long-term value of big data systems. In a blog entry for Forbes, research vice president Doug Laney predicted that by 2020, most business functions will be reinvented due to the influence of big data analysis.
Much of the data businesses rely on will come from outside sources, Laney noted. How will weather patterns impact a company’s sales for the next week? How will sentiments about a company’s products expressed social networks drive sales? Such data, coming from multiple sources and in multiple formats, will indeed be “noisy,” Laney wrote. But it will also be valuable.
“Your company’s biggest database isn’t your transaction, CRM, ERP or other internal database. Rather it’s the Web itself and the world of exogenous data now available from syndicated and open data sources,” Laney wrote.