Users of the Hadoop data processing platform now have two more tools to help them sort through their mountains of information.
Hadoop distributor MapR has integrated the LucidWorks Search into its own distribution. Cloudera, meanwhile, has launched the first full release of its open source Impala SQL query engine for Hadoop.
“Using search as the user interface for big data is very interesting. Search is well suited to leveraging a lot of different types of information, especially unstructured information,” said Jack Norris, chief marketing officer for MapR. “We’re seeing some really interesting applications with search engines at their core, even if a typical user would not think of them as search engine driven.”
LucidWorks Search is the commercial version of the open source Apache Lucene/Solr full-text search engine. With the new MapR integration, LucidWorks Search can search through either data on the Hadoop File Systems (HDFS) or on files on other file systems.
LucidWorks Search offers snapshots and mirrors for high availability, and eliminates much of the work required to install Lucene/Solr from scratch. It also offers native support for more data sources, a graphical user interface and a security framework.
The search engine could be used in a dynamic Web application to quickly retrieve photos, advertising, product recommendations, and other information that can be used to populate Web sites on the fly. “This isn’t a lower cost substitute for data warehouses. This is about leveraging new data sources and doing some things that have a dramatic impact on the business,” Norris said.
MapR and LucidWorks have been working together on pairing their technologies since 2011, when they formed a joint marketing agreement. Earlier this year, they released a connector that makes it easy to use Lucene/Solr with the MapR Hadoop distribution.
LucidWords Search works with the MapR’s newly released M7 distribution, in beta form. In addition to supporting LucidWorks Search, the M7 edition has been re-architected to eliminate compactions or background consistency checks, speeding performance.
Also this week, Cloudera released version 1.0 of Cloudera Impala, an open source SQL-compliant query engine for Hadoop. SQL is the database interface language used in relational database management systems (RDMS) and is well-known by database administrators.
Impala was designed to execute queries faster than Hadoop’s Hive, because it doesn’t use the MapReduce framework, which requires search results to be written to disk. Instead, users can query data stored in HDFS and HBase directly. Users can query data either interactively or through batch processes.
Cloudera first released a version of this engine last October as a beta. Since then, the software has been tested by companies such as 37signals and Expedia.
Impala is the core component of the Cloudera Enterprise RTQ (Real-Time Query) supplemental package for the Cloudera Hadoop platform. Impala can be downloaded at no cost.
Updated May 6 to correct information on the Cloudera Impala technology.