Hortonworks has released a preview distribution of the next generation of Apache Hadoop, one that promises to broaden the scope of the kinds of analysis that can be carried out on the data processing platform.
“Hadoop 2.0 is truly a fundamental architecture change, one that makes Hadoop significantly more than just a batch platform,” said Arun Murthy, a founder of Hortonworks, and one of the core engineers developing Hadoop. The update “will fuel a whole new wave of innovation,” he said.
Apache Software Foundation
The Hortonworks Data Platform 2.0 Community Preview contains a number of new components for the Hadoop environment, most notably YARN (Yet Another Resource Negotiator), a successor to Hadoop’s MapReduce job scheduler.
Hadoop started as a “single application platform,” one primarily built for crawling and indexing Web content, Murthy said. Organizations are now looking to use it for other kinds of jobs, such as interactive querying or analysis of real time streams of data.
YARN improves on MapReduce by expanding the types of jobs that can be done on a Hadoop platform. MapReduce pretty much could only manage batch processing jobs, executing data analysis across any number of nodes and returning the results when it has completed.
In contrast, YARN is a general-purpose resource management framework. It provides a foundation to run nonbatch processing jobs, such as those that run indefinitely on live streams of data, and those that involve interactive queries, in which users interrogate the data on the fly. “You can now have both the batch MapReduce jobs and interactive SQL queries running right next to each other in YARN,” Murthy said.
Using YARN, “you have a cluster that is aware of all the different types of workloads and resource needs, so they can all cohabitate. You don’t get one workload dominating or taking over all the resources of the cluster,” said Shaun Connolly, Hortonworks vice president of corporate strategy for Hortonworks. Previously, organizations would have to run separate clusters to execute different styles of jobs.
HDP 2.0 includes a number of other new components as well, including the Apache Tez, an add-on to YARN for speeding large, interactive jobs, and Stinger, a collection of technologies that provides the ability to run SQL queries against a Hadoop repository.
This preview of HDP 2.0, a full Hadoop distribution, runs in either the Oracle VirtualBox or the VMware virtual environments.
Hortonworks announced HDP 2.0 at the 2013 Hadoop Summit, being held this week in San Jose, California. Also at the conference, Rackspace announced it would offer Hadoop as a service, with analysis tools from Pentaho. Splunk released a new tool, called Hunk to explore Hadoop repositories. Data warehouse systems provider Teradata unveiled new Hadoop appliances. And VMware updated its vSphere virtualization management software to support Hadoop clusters.