No Unified Stack Soon for Big Data
Despite the growing interest in big data platforms, it may be some time before organizations will be able to deploy a standardized big data software stack, concluded a panel of speakers Wednesday during a virtual panel hosted by GigaOm.
The panelists agreed that a standardized stack of big data analysis software would make it easier to develop large scale data analysis systems, in much the same way the open source LAMP stack engendered a whole generation of Web 2.0 services over the past decade. But the ways software such as Hadoop can be used vary so much that it may be difficult to settle on one core package of technologies, the panelists said.
LAMP is an abbreviation for a set of software programs that work very well together: Linux, the Apache Web server, the MySQL database and a set of programming languages--Perl, Python and PHP.
LAMP "provided a common framework upon which people could build. It was freely available. It was easily understood. It ran on almost anything. It created a foundation upon which a generation of start ups grew up," said independent consultant Paul Miller, who moderated the panel, "Designing for Big Data: The New Architectural Stack."
"As we're beginning to see an explosion of interest in big data, do we need a stack that is similarly ubiquitous? Do we need a LAMP stack for big data?" Miller asked.
All agreed that not having a standardized stack slows deployments of big data systems. "There isn't a standard stack, and people aren't clear which piece works best for a particular workload. There's a trial and error period going on now," said Jo Maitland, a research director covering cloud technology for GigaOm Pro.
One reason LAMP was so popular was that its users all had similar needs, all based around putting services online, pointed out Mark Baker, Canonical Ubuntu server product manager. The needs around analysis, on the other hand, tend to vary from business to business, and change often, he noted.
Large Web services companies that use Hadoop, such as eBay and Twitter, are running in a "continuous beta," and they hire a lot of technically competent staff to handle the pace of rapid change," said Mark Staimer, president of Dragon Slayer Consulting.
"Having a constantly evolving platform and stack is fine for them. They have the process and culture within the company to manage it," Staimer said. The more traditional "brick and mortar" companies are "much more conservative," Staimer added. "They like to see a fully baked solution."
Arriving at such a stack may be difficult, given the variety of technologies available, and the degrees of difficulty inherent in connecting them together in various configurations.
"Now we have loads of different pieces out there that you can plug together. Just in the database space, there is MongoDB, Cassandra, HSpace," Maitland said. All this choice "makes it more difficult for people. We're in a mashup situation with all these different components."
Such variety came about to address differing needs among users, Baker said. MySQL, for instance, is really fast at reading data, but the Cassandra data store, on the other hand, can write data more quickly. The production company behind the U.K. television show "Britain's Got Talent," used a Cassandra database to log the votes of viewers choosing their favorite performer, because it could ingest a high number of writes simultaneously, Baker noted.
A number of companies have released commercial Hadoop distributions, such as Cloudera, Hortonworks and MapR, in which all the software components are integrated. But even Hadoop itself is not suited for all jobs, Maitland argued. It processes data as batch jobs, meaning the full data set must be written to a file before it can be analyzed. Many jobs, however, involve the analysis of a continually updated data, such as click streams or Twitter messages.
Also, a stack would need to have support from more than one company to be an industry standard, Maitland said. "If there is going to be a stack, it needs to be [managed by] an open source organization and not necessarily managed by a specific company," Maitland said.
Another problem with not having a standardized stack is that it drives up the cost of hiring experts to manage and use such systems. Right now the competition for experts is fierce.
"Trying to build [a big data system] takes knowledge and skill. To plug those into your infrastructure can take time and money," Baker said. "There is no standard roadmap -- it is a feeling along process. Putting it all together is not a simple task."
"You can't have the explosive growth in an industry with so much specialized knowledge that is required as of right now," Maitland said.
"The average business analyst can't write queries against Hadoop," Staimer added.