Yahoo has helped the Indian Institute of Technology Bombay to set up a Hadoop cluster lab in Mumbai by donating a cluster of servers running the open-source Hadoop software.
Apache Hadoop is an open-source distributed-computing project of the Apache Software Foundation that Yahoo supports.
Yahoo runs a large number of its critical operations using Hadoop, and it cannot do all the research required around Hadoop within the company, said Prabhakar Raghavan, senior vice president and head of Yahoo Labs, in a telephone interview on Thursday.
Yahoo announced in June last year its own distribution of Hadoop, citing interest from the Apache Hadoop community that it publish the version of Hadoop it tests and deploys on its own large clusters.
The cluster lab at Mumbai will help researchers at the institute study areas such as searching and ranking techniques, information extraction and natural language processing.
Academic researchers wanting to research Web-related issues have typically not been able to get access to the compute resources and terabytes of data that is required for research into “Web-scale problems”, Raghavan said.
Starting with providing Hadoop researchers at Carnegie Mellon University with a 4,000-processor supercomputer in 2007, Yahoo has helped other universities in the U.S. to set up Hadoop clusters, he said.
Raghavan did not give more details on the cluster installed at the IIT, only saying that servers with hundreds of CPUs and capability to handle terabytes of data have been deployed there.
Besides IIT Bombay, Yahoo is helping set up similar clusters at academic institutions in Germany and Singapore, Raghavan said. These are the first three academic institutions outside the U.S. where Yahoo is helping set up such clusters, he added.
Yahoo plans to later network some of the clusters around the world to create a “bigger utility”. Before that, the system administration capabilities of Hadoop have to be strengthened to prevent a student in one institution crashing the work at another institution, Raghavan said.
Yahoo teamed in 2008 with Computational Research Laboratories (CRL), a lab run by India’s Tata Group, to offer supercomputing facilities free to academic institutions in India that are researching large scale computing, particularly around Hadoop. That collaboration continues, but is focused on high-performance supercomputing, Raghavan said.
Partnering with academic institutions on Hadoop helps Yahoo build a pool of engineers familiar with the Hadoop platform, according to Raghavan. Some of them have even been hired by Yahoo, he added.
Yahoo has also benefited from ideas that have come up from academic institutions doing research using Hadoop, Raghavan said. He did not however give specific instances of ideas that were picked up from this collaboration with academic institutions.