In their zeal to collect as much operational data as possible, organizations hoping to gain an advantage through the use of big data will also need to rethink how they process, analyze and present that material.
“When all this information finally gets to the business, it is difficult for the business to understand what to glean out of the data,” said Sharmila Shahani-Mulligan, CEO and co-founder of big-data startup ClearStory Data. “We know this has been a problem for several years now.”
Shahani-Mulligan was one of a number of speakers at the O’Reilly Strata + Hadoop World conference Thursday in New York to offer tips on making the move from data to big data. She suggested that the executive dashboard is giving way to emerging technique of interactive storytelling, which gives data more readily apparent context and meaning.
Meanwhile, organizations should watch Google closely, advised M.C. Srivas, chief technology officer of Hadoop distributor MapR Technologies. “Google, with its vast and varied infrastructure, can provide us with a glimpse into the future of where computing is going,” said Srivas, who worked at Google before co-founding MapR.
One of the basic rules to pick up from Google is that “more data beats complex algorithms,” Srivas said. “This is something that Google has demonstrated again and again: The company that can process the most data will have an advantage over everybody else in the future.”
A number of MapR customers are following this principle, Srivas said.
Millennial Media, a leader in the mobile advertising market, collects up to about 4TB of mobile user data each day, combining with petabytes of data on hand to build profiles of mobile users.
Cisco collects data from its firewalls worldwide, aggregating about a million events per second, all to better detect security threats. Credit agency TransHuman collects data from multiple sources to provide real-time credit scores.
But once an organization has committed to collecting more data, the question becomes what to do with it.
Visualization is a handy tool, but picking the correct visualization is vitally important, advised Miriah Meyer, an assistant professor in the University of Utah’s School of Computing.
The most challenging and important step in visualization is “gaining an understanding of the user’s needs and then being able to translate that into a set of visualization requirements,” Meyer said.
Meyer worked with one researcher who was comparing the human genome with that of lizards. The researcher tried off-the-shelf data visualization tools, but found that they hid many pertinent details and were not intuitive to work with.
The tool Meyer helped create, called Mizzbee, allowed the researcher to get insights from the data that couldn’t be gleamed from generic visualization software.
“When done well, visualization has the potential not only to support science but to also influence it,” Meyer said. “We have to move beyond thinking that visualization is just about pretty pictures and instead embrace that it is a deep investigation into sense making.”
Dashboards are one form of visualization that could be used less, Shahani-Mulligan said.
Organizations have been using dashboards for well over a decade and not much has changed with them over that time, Shahani-Mulligan said. While they are fine for capturing key performance indicators and basic performance metrics, they are too brittle for advanced and timely analysis of big data, she said.
Dashboards are biased to look at data from predetermined contexts. They limit the amount of data that can be seen. And they aren’t interactive. “You can’t really dig in and see what is happening underneath the visuals,” Shahani-Mulligan said.
“This is a problem that we need to solve as data becomes updated from sources faster, as decision times to get down to a day or week, and as more sources of data become available,” Shahani-Mulligan said. “We need to make it possible for businesses to see more information than they have been able to do.”
An emerging technique, called interactive storytelling, promises to provide a way to interact with data in more natural ways, Shahani-Mulligan said. ClearStory uses the Apache Spark steaming data-processing software as part of an interactive storytelling system.
“Interactive storytelling is about bringing more data to the surface, so [business executives] can actually see it in a way that has context and meaning,” Shahani-Mulligan said. She estimated that interactive storytelling could help businesses make decisions twice as quickly as they could by using traditional tools.
Much of big-data analysis is based on statistics, which few software engineers know how to do in detail, said Pinterest Chief Data Scientist John Rauser, who also worked at Amazon as a chief architect.
“I suspect many people in this audience are faking it when it comes to statistics,” Rauser said, provoking an audible collective gasp from the audience.
Nonetheless, not having intimate knowledge in power analysis, generalized linear models or other statistical methods does not mean that meaningful statistical analysis can’t be done, he said. Statistics is a field heavy in dense mathematical formulas, but the basic concepts are intuitive to the locally minded. Instead, engineers should look closely at what they are studying, and translate the questions being asked into a series of simple computational methods.
“If you can program a computer, you have direct access to the deepest and most fundamental ideas in statistics,” Rauser said.