Archive for the ‘Sqoop’ Category

MS SQL Server -> Hadoop

Thursday, January 16th, 2014

Community Tutorial 04: Import from Microsoft SQL Server into the Hortonworks Sandbox using Sqoop

From the webpage:

For a simple proof of concept I wanted to get data from MS SQL Server into the Hortonworks Sandbox in an automated fasion using Sqoop. Apache Sqoop provides a way of efficiently transferring bulk data between Apache Hadoop and relational databases. This tutorial will show you how to use Sqoop to import data into the Hortonworks Sandbox from a Microsoft SQL Server data source.

You’ll have to test this one without me.

I have thought about setting up a MS SQL Server but never got around to it. 😉

The Book on Apache Sqoop is Here!

Monday, July 15th, 2013

The Book on Apache Sqoop is Here! by Justin Kestelyn.

From the post:

Continuing the fine tradition of Clouderans contributing books to the Apache Hadoop ecosystem, Apache Sqoop Committers/PMC Members Kathleen Ting and Jarek Jarcec Cecho have officially joined the book author community: their Apache Sqoop Cookbook is now available from O’Reilly Media (with a pelican the assigned cover beast).

The book arrives at an ideal time. Hadoop has quickly become the standard for processing and analyzing Big Data, and in order to integrate a new Hadoop deployment into your existing environment, you will very likely need to transfer data stored in legacy relational databases into your new cluster.

Sqoop is just the ticket; it optimizes data transfers between Hadoop and RDBMSs via a command-line interface listing 60 parameters. This new cookbook focuses on applying these parameters to common use cases — one recipe at a time, Kate and Jarek guide you from basic commands that don’t require prior Sqoop knowledge all the way to very advanced use cases. These recipes are sufficiently detailed not only to enable you to deploy Sqoop in your environment, but also to understand its inner workings.

Good to see a command with a decent number of options, sixty (60).

A little lite when compared to ps at one hundred and eight-six (186) options and formatting flags.

I didn’t find a quick answer to the question: Which *nix command has the most options and formatting flags?

If you have a candidate, sing out!

Apache Bigtop: The “Fedora of Hadoop”…

Wednesday, June 26th, 2013

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉


(Participate if you can but at least send a note of appreciation to Cloudera.)

What’s New in Apache Sqoop 1.4.2

Thursday, November 8th, 2012

What’s New in Apache Sqoop 1.4.2 by by Jarek Jarcec Cecho.

Jarek highlights the key features and fixes of this release of Apache Sqoop (its first as a top level project).

Those include:

  • Hadoop 2.0.0 Support
  • Compatibility with Old Connectors
  • Incremental Imports of Free Form Queries
  • Implicit and Explicit Connector Pickup Improvements
  • Exporting Only a Subset of Columns
  • Verbose Logging
  • Hive Imports

From the Apache Sqoop homepage:

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Hortonworks Data Platform v1.0 Download Now Available

Thursday, June 21st, 2012

Hortonworks Data Platform v1.0 Download Now Available

From the post:

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

I’m a member of the general public! And you probably are too! 😉

See the rest of the post for more goodies that are included with this release.

Sqoop Graduation Meetup

Thursday, April 12th, 2012

Sqoop Graduation Meetup by Kathleen Ting.

From the post:

Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption.

Not only was this Sqoop’s second Meetup but also a celebration for Sqoop’s graduation from the Incubator, cementing its status as a Top-Level Project in Apache Software Foundation. Sqoop’s come a long way since its beginnings three years ago as a contrib module for Apache Hadoop submitted by Aaron Kimball. As a result, it was fitting that Aaron gave the first talk of the night by discussing its history: “Sqoop: The Early Days.” From Aaron, we learned that Sqoop’s original name was “SQLImport” and that it was conceived out of his frustration from the inability to easily query both unstructured and structured data at the same time. (Emphasis added.)

I don’t think the extra 20 people were present because of Sqoop.

Did you see the picture of the cake?

My vote goes for the cake as explanation. Yours? 😉

Congratulations to Sqoop, Sqoop team and community!

Let’s make sure on its first birthday a bigger cake is required!

Apache Bigtop 0.3.0 (incubating) has been released

Wednesday, April 4th, 2012

Apache Bigtop 0.3.0 (incubating) has been released by Roman Shaposhnik.

From the post:

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

  • Apache Hadoop 1.0.1
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Thoughts on what is missing from this ecosystem?

What if you moved from the company where you wrote the scripts? And they needed new scripts?

Re-write? On what basis?

Is your “big data” big enough to need “big documentation?”

Apache Sqoop Graduates from Incubator

Tuesday, April 3rd, 2012

Apache Sqoop Graduates from Incubator by Arvind Prabhakar.

From the post:

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

In its monthly meeting in March of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache Sqoop, thus graduating it from the Incubator. This is a significant milestone in the life of Sqoop, which has come a long way since its inception almost three years ago.

For moving data in and out of Hadoop, Sqoop is your friend. Drop by and say hello.

What’s New in Apache Sqoop 1.4.0-incubating

Tuesday, January 3rd, 2012

What’s New in Apache Sqoop 1.4.0-incubating

New features and improvements in the first incubating release:

If you are interested in learning more about the changes, a complete list for Sqoop 1.4.0-incubating can be found here.  You are also encouraged to give this new release a try.  Any help and feedback is more than welcome. For more information on how to report problems and to get involved, visit the Sqoop project website at

BTW, “Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.” (From Apache Sqoop (incubating))