Apache Tajo brings data warehousing to Hadoop by Joab Jackson.
From the post:
Organizations that want to extract more intelligence from their Hadoop deployments might find help from the relatively little known Tajo open source data warehouse software, which the Apache Software Foundation has pronounced as ready for commercial use.
The new version of Tajo, Apache software for running a data warehouse over Hadoop data sets, has been updated to provide greater connectivity to Java programs and third party databases such as Oracle and PostGreSQL.
While less well-known than other Apache big data projects such as Spark or Hive, Tajo could be a good fit for organizations outgrowing their commercial data warehouses. It could also be a good fit for companies wishing to analyze large sets of data stored on Hadoop data processing platforms using familiar commercial business intelligence tools instead of Hadoop’s MapReduce framework.
Tajo performs the necessary ETL (extract-transform-load process) operations to summarize large data sets stored on an HDFS (Hadoop Distributed File System). Users and external programs can then query the data through SQL.
The latest version of the software, issued Monday, comes with a newly improved JDBC (Java Database Connectivity) driver that its project managers say makes Tajo as easy to use as a standard relational database management system. The driver has been tested against a variety of commercial business intelligence software packages and other SQL-based tools. (Just so you know, I took out the click following stuff and inserted the link to the Tajo project page only.)
Being surprised by Apache Tajo I looked at the list of the top level projects at Apache and while I recognized a fair number of them by name, I could tell you the status only of those I actively follow. Hard to say what other jewels are hidden there.
Joab cites several large data consumers who have found Apache Tajo faster than Hive for their purposes. Certainly an option to keep in mind.