Archive for the ‘Kafka’ Category

Confluence: Mapping @apachekafka connect schema types – to usual suspects

Thursday, March 8th, 2018

Confluence has posted a handy mapping from Kafka connect schema types to MySQL, Oracle, PostgreSQL, SQLite, SQL Server and Vertica.

The sort of information that I will waste 10 to 15 minutes every time I need it. Posting it here means I’ll cut the wasted time down to maybe 5 minutes if I remember I posted about it. 😉

Top 5 Cloudera Engineering Blogs of 2017

Tuesday, January 9th, 2018

Top 5 Cloudera Engineering Blogs of 2017

From the post:

1. Working with UDFs in Apache Spark

2. Offset Management For Apache Kafka With Apache Spark Streaming

3. Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

4. Up and running with Apache Spark on Apache Kudu

5. Apache Impala Leads Traditional Analytic Database

Kudos to Cloudera for a useful list of “top” blog posts for 2017.

We might disagree on the top five but it’s a manageable number of posts and represents the quality of Cloudera postings all year long.


Streaming SQL for Apache Kafka

Wednesday, December 27th, 2017

Streaming SQL for Apache Kafka by Jojjat Jafarpour.

From the post:

We are very excited to announce the December release of KSQL, the streaming SQL engine for Apache Kafka! As we announced in the November release blog, we are releasing KSQL on a monthly basis to make it even easier for you to get up and running with the latest and greatest functionality of KSQL to solve your own business problems.

The December release, KSQL 0.3, includes both new features that have been requested by our community as well as under-the-hood improvements for better robustness and resource utilization. If you have already been using KSQL, we encourage you to upgrade to this latest version to take advantage of the new functionality and improvements.

The KSQL Github page links to:

  • KSQL Quick Start: Demonstrates a simple workflow using KSQL to write streaming queries against data in Kafka.
  • Clickstream Analysis Demo: Shows how to build an application that performs real-time user analytics.

These are just quick start materials but are your ETL projects ever as simple as USERID to USERID? Or have such semantically transparent fields? Or what Itake to be semantically transparent fields (they may not be).

As I pointed out in Where Do We Write Down Subject Identifications? earlier today, where do I record what I know about what appears in those fields? Including on what basis to merge them with other data?

If you see where KSQL is offering that ability, please ping me because I’m missing it entirely. Thanks!

Apache Kafka: Online Talk Series [Non-registration for 5 out of 6]

Saturday, December 9th, 2017

Apache Kafka: Online Talk Series

From the webpage:

Watch this six-part series of online talks presented by Kafka experts. You will learn the key considerations in building a scalable platform for real-time stream data processing, with Apache Kafka at its core.

This series is targeted to those who want to understand all the foundational concepts behind Apache Kafka, streaming data, and real-time processing on streams. The sequence begins with an introduction to Kafka, the popular streaming engine used by many large scale data environments, and continues all the way through to key production planning, architectural and operational methods to consider.

Whether you’re just getting started or have already built stream processing applications for critical business functions, you will find actionable tips and deep insights that will help your enterprise further derive important business value from your data systems.

Video titles:

1. Introduction To Streaming Data and Stream Processing with Apache Kafka, Jay Kreps, Confluent CEO and Co-founder, Apache Kafka Co-creator.

2. Deep Dive into Apache Kafka by Jun Rao, Confluent Co-founder, Apache Kafka Co-creator.

3. Data Integration with Apache Kafka by David Tucker, Director, Partner Engineering and Alliances.

4. Demystifying Stream Processing with Apache Kafka, Neha Narkhede, Confluent CTO and Co-Founder, Apache Kafka Co-creator.

5. A Practical Guide to Selecting a Stream Processing Technology by Michael Noll, Product Manager, Confluent.

6. Streaming in Practice: Putting Kafka in Production by Roger Hoover, Engineer, Confluent. (Registration required. Anyone know a non-registration version of Hoover’s presentation?)

I was able to find versions of the first five videos that don’t require you to register to view them.

I make it a practice to dodge marketing department registrations whenever possible.


Apache Kafka 8.2.1 (and reasons to upgrade from 8.2)

Thursday, March 12th, 2015

Apache Kafka 8.2.1 has been released!

I don’t normally note point releases but a tweet by Michael G. Noll pointing to the 8.2.1 release notes caused this post.

The release notes read:


  • [KAFKA-1919] – Metadata request issued with no backoff in new producer if there are no topics
  • [KAFKA-1952] – High CPU Usage in 0.8.2 release
  • [KAFKA-1971] – starting a broker with a conflicting id will delete the previous broker registration
  • [KAFKA-1984] – java producer may miss an available partition

More than you can get into a tweet but still important information.

Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka

Thursday, January 22nd, 2015

Webinar: Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka by Helena Edelson.

From the post:

On Tuesday, January 13 I gave a webinar on Apache Spark, Spark Streaming and Cassandra. Over 1700 registrants from around the world signed up. This is a follow-up post to that webinar, answering everyone’s questions. In the talk I introduced Spark, Spark Streaming and Cassandra with Kafka and Akka and discussed wh​​​​y these particular technologies are a great fit for lambda architecture due to some key features and strategies they all have in common, and their elegant integration together. We walked through an introduction to implementing each, then showed how to integrate them into one clean streaming data platform for real-time delivery of meaning at high velocity. All this in a highly distributed, asynchronous, parallel, fault-tolerant system.

Video | Slides | Code | Diagram

About The Presenter: Helena Edelson is a committer on several open source projects including the Spark Cassandra Connector, Akka and previously Spring Integration and Spring AMQP. She is a Senior Software Engineer on the Analytics team at DataStax, a Scala and Big Data conference speaker, and has presented at various Scala, Spark and Machine Learning Meetups.

I have long contended that it is possible to have a webinar that has little if any marketing fluff and maximum technical content. Helena’s presentation is an example of that type of webinar.

Very much worth the time to watch.

BTW, being so content full, questions were answered as part of this blog post. Technical webinars just don’t get any better organized than this one.

Perhaps technical webinars should be marked with TW and others with CW (for c-suite webinars). To prevent disorientation in the first case and disappointment in the second one.

Integrating Kafka and Spark Streaming: Code Examples and State of the Game

Wednesday, October 1st, 2014

Integrating Kafka and Spark Streaming: Code Examples and State of the Game by Michael G. Noll.

From the post:

Spark Streaming has been getting some attention lately as real-time data processing tool, often mentioned alongside Apache Storm. If you ask me, no real-time data processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization.

In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.

If mid-week is when you like to brush up on emerging technologies, Michael’s post is a good place to start.

The post is well organized and has enough notes, asides and references to enable you to duplicate the example and to expand your understanding of Kafka and Spark Streaming.

Apache Storm 0.9 Training Deck and Tutorial

Monday, September 15th, 2014

Apache Storm 0.9 Training Deck and Tutorial by Michael G. Noll.

From the post:

Today I am happy to share an extensive training deck on Apache Storm version 0.9, which covers Storm’s core concepts, operating Storm in production, and developing Storm applications. I also discuss data serialization with Apache Avro and Twitter Bijection.

The training deck (130 slides) is aimed at developers, operations, and architects.

What the training deck covers

  1. Introducing Storm: history, Storm adoption in the industry, why Storm
  2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
  3. Operating Storm: architecture, hardware specs, deploying, monitoring
  4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps (with kafka-storm-starter), performance and scalability tuning
  5. Playing with Storm using Wirbelsturm

What a great way to start the week! Well, at least if you were intending to start learning about Storm this week.

BTW, see Michael’s post for links to other resources, such as his tutorial on Kafka.

Apache Kafka for Beginners

Saturday, September 13th, 2014

Apache Kafka for Beginners by Gwen Shapira and Jeff Holoman.

From the post:

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.

So now that the word is out, it seems the world wants to know: What does it do? Why does everyone want to use it? How is it better than existing solutions? Do the benefits justify replacing existing systems and infrastructure?

In this post, we’ll try to answers those questions. We’ll begin by briefly introducing Kafka, and then demonstrate some of Kafka’s unique features by walking through an example scenario. We’ll also cover some additional use cases and also compare Kafka to existing solutions.

What is Kafka?

Kafka is one of those systems that is very simple to describe at a high level, but has an incredible depth of technical detail when you dig deeper. The Kafka documentation does an excellent job of explaining the many design and implementation subtleties in the system, so we will not attempt to explain them all here. In summary, Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable. (emphasis in original)

A great reference to use for your case to technical management about Kafka. In particular the line:

even a small three-node cluster can process close to a million events per second with an average latency of 3ms.

Sure, there are applications with more stringent processing requirements, but there are far more applications with less than a million events per second.

Does your topic map system get updated more than a million times a second?

Analyzing 1.2 Million Network Packets…

Sunday, June 15th, 2014

Analyzing 1.2 Million Network Packets per Second in Real Time by James Sirota and Sheetal Dolas.

Slides giving an overview of OpenSOC (Open Security Operations Center).

I mention this in case you are not the NSA and simply streaming the backbone of the Internet to storage for later analysis. Some business cases require real time results.

The project is also a good demonstration of building a high throughput system using only open source software.

Not to mention a useful collaboration between Cisco and Hortonworks.

BTW, take a look at slide 18. I would say they are adding information to the representative of a subject, wouldn’t you? While on the surface this looks easy, merging that data with other data, say held by local law enforcement, might not be so easy.

For example, depending on where you are intercepting traffic, you will be told I am about thirty (30) miles from my present physical location or some other answer. 😉 Now, if someone had annotated an earlier packet with that information and it was accessible to you, well, your targeting of my location could be a good deal more precise.

And there is the question of using data annotated by different sources who may have been attacked by the same person or group.

Even at 1.2 million packets per second there is still a role for subject identity and merging.


Friday, May 23rd, 2014

Kafka-Storm-Starter by Michael G. Noll.

From the webpage:

Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format.

If you aren’t excited already (from their respective homepages):

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Apache Storm is a free and open source distributed realtime computation system.

Apache Avro™ is a data serialization system.

Now are you excited?


Note the superior organization of the project documentation!

Following the table of contents you find:

Quick Start

Show me!

$ ./sbt test

Short of starting up remotely and allowing you to import/keyboard data, I can’t imagine an easier way to start project documentation.

It’s a long weekend in the United States so check out Michael G. Noll’s GitHub repository for other interesting projects.

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)


One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

Choking Cassandra Bolt

Tuesday, August 7th, 2012

Got your attention? Good!

Brian O’Neill details in A Big Data Trifecta: Storm, Kafka and Cassandra an architecture that was fast enough to choke the Cassandra Bolt component. (And also details how to fix that problem.)

Based on the exchange of tuples. Writing at 5,000 writes per second on a laptop.

More details to follow but I think you can get enough from the post to start experimenting on your own.

I first saw this at: Alex Popesu’s myNoSQL under A Big Data Triefecta: Storm, Kafka and Cassandra.