Archive for the ‘Apache Tajo’ Category

Apache Tajo brings data warehousing to Hadoop

Tuesday, March 10th, 2015

Apache Tajo brings data warehousing to Hadoop by Joab Jackson.

From the post:

Organizations that want to extract more intelligence from their Hadoop deployments might find help from the relatively little known Tajo open source data warehouse software, which the Apache Software Foundation has pronounced as ready for commercial use.

The new version of Tajo, Apache software for running a data warehouse over Hadoop data sets, has been updated to provide greater connectivity to Java programs and third party databases such as Oracle and PostGreSQL.

While less well-known than other Apache big data projects such as Spark or Hive, Tajo could be a good fit for organizations outgrowing their commercial data warehouses. It could also be a good fit for companies wishing to analyze large sets of data stored on Hadoop data processing platforms using familiar commercial business intelligence tools instead of Hadoop’s MapReduce framework.

Tajo performs the necessary ETL (extract-transform-load process) operations to summarize large data sets stored on an HDFS (Hadoop Distributed File System). Users and external programs can then query the data through SQL.

The latest version of the software, issued Monday, comes with a newly improved JDBC (Java Database Connectivity) driver that its project managers say makes Tajo as easy to use as a standard relational database management system. The driver has been tested against a variety of commercial business intelligence software packages and other SQL-based tools. (Just so you know, I took out the click following stuff and inserted the link to the Tajo project page only.)

Being surprised by Apache Tajo I looked at the list of the top level projects at Apache and while I recognized a fair number of them by name, I could tell you the status only of those I actively follow. Hard to say what other jewels are hidden there.

Joab cites several large data consumers who have found Apache Tajo faster than Hive for their purposes. Certainly an option to keep in mind.

Apache Tajo

Wednesday, April 2nd, 2014

Apache Tajo SQL-on-Hadoop engine now a top-level project by Derrick Harris.

From the post:

Apache Tajo, a relational database warehouse system for Hadoop, has graduated to to-level status within the Apache Software Foundation. It might be easy to overlook Tajo because its creators, committers and users are largely based in Korea — and because there’s a whole lot of similar technologies, including one developed at Facebook — but the project could be a dark horse in the race for mass adoption. Among Tajo’s lead contributors are an engineer from LinkedIn and members of the Hortonworks technical team, which suggests those companies see some value in it even among the myriad other options.

It is far too early to be choosing winners in the Hadoop ecosystem.

There are so many contenders, with their individual boosters, that if you don’t like the solutions offered today, wait a week or so, another one will pop up on the horizon.

Which isn’t a bad thing. There isn’t any reason to think IT has uncovered the best data structures or algorithms for your data. Anymore than you would have thought that twenty years ago.

The caution I would offer is to hold tightly to your requirements and not those of some solution. Compromise may be necessary on your part, but fully understand what you are giving up and why.

The only utility that software can have, for any given user, is that it performs some task they require to be performed. For vendors, adopters, promoters, software has other utilities, which are unlikely to interest you.

Apache Tajo

Wednesday, March 27th, 2013

Apache Tajo

From the webpage:


Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer.


  • Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
  • Rudiment ETL that transforms one data format to another data format.
  • Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
  • Command line interface to allow users to submit SQL queries
  • Java API to enable clients to submit SQL queries to Tajo

If you ever wanted to get in on the ground floor of a data warehouse project, this could be your chance!

I first saw this at ‎Apache Incubator: Tajo – a Relational and Distributed Data Warehouse for Hadoop by Alex Popescu.