Announcing Pulsar: Real-time Analytics at Scale by Sharad Murthy and Tony Ng.
From the post:
We are happy to announce Pulsar – an open-source, real-time analytics platform and stream processing framework. Pulsar can be used to collect and process user and business events in real time, providing key insights and enabling systems to react to user activities within seconds. In addition to real-time sessionization and multi-dimensional metrics aggregation over time windows, Pulsar uses a SQL-like event processing language to offer custom stream creation through data enrichment, mutation, and filtering. Pulsar scales to a million events per second with high availability. It can be easily integrated with metrics stores like Cassandra and Druid.
eBay provides a platform that enables millions of buyers and sellers to conduct commerce transactions. To help optimize eBay end users’ experience, we perform analysis of user interactions and behaviors. Over the past years, batch-oriented data platforms like Hadoop have been used successfully for user behavior analytics. More recently, we have newer use cases that demand collection and processing of vast numbers of events in near real time (within seconds), in order to derive actionable insights and generate signals for immediate action. Here are examples of such use cases:
- Real-time reporting and dashboards
- Business activity monitoring
- Marketing and advertising
- Fraud and bot detection
We identified a set of systemic qualities that are important to support these large-scale, real-time analytics use cases:
- Scalability – Scaling to millions of events per second
- Latency – Sub-second event processing and delivery
- Availability – No cluster downtime during software upgrade, stream processing rule updates , and topology changes
- Flexibility – Ease in defining and changing processing logic, event routing, and pipeline topology
- Productivity – Support for complex event processing (CEP) and a 4GL language for data filtering, mutation, aggregation, and stateful processing
- Data accuracy – 99.9% data delivery
- Cloud deployability – Node distribution across data centers using standard cloud infrastructure
Given our unique set of requirements, we decided to develop our own distributed CEP framework. Pulsar CEP provides a Java-based framework as well as tooling to build, deploy, and manage CEP applications in a cloud environment. Pulsar CEP includes the following capabilities:
- Declarative definition of processing logic in SQL
- Hot deployment of SQL without restarting applications
- Annotation plugin framework to extend SQL functionality
- Pipeline flow routing using SQL
- Dynamic creation of stream affinity using SQL
- Declarative pipeline stitching using Spring IOC, thereby enabling dynamic topology changes at runtime
- Clustering with elastic scaling
- Cloud deployment
- Publish-subscribe messaging with both push and pull models
- Additional CEP capabilities through Esper integration
On top of this CEP framework, we implemented a real-time analytics data pipeline.
That should be enough to capture your interest!
I saw it coming off of a two and one-half hour conference call. Nice way to decompress.
Other places to look:
If you don’t know Docker already, you will. Courtesy of the Pulsar Get Started page.
Nice to have yet another high performance data tool.