Archive for the ‘Architecture’ Category

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Monday, June 1st, 2015

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop by Ted Malaska.

From the post:

Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.

The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.

In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop.

Streaming Patterns

The four basic streaming patterns (often used in tandem) are:

  • Stream ingestion: Involves low-latency persisting of events to HDFS, Apache HBase, and Apache Solr.
  • Near Real-Time (NRT) Event Processing with External Context: Takes actions like alerting, flagging, transforming, and filtering of events as they arrive. Actions might be taken based on sophisticated criteria, such as anomaly detection models. Common use cases, such as NRT fraud detection and recommendation, often demand low latencies under 100 milliseconds.
  • NRT Event Partitioned Processing: Similar to NRT event processing, but deriving benefits from partitioning the data—like storing more relevant external information in memory. This pattern also requires processing latencies under 100 milliseconds.
  • Complex Topology for Aggregations or ML: The holy grail of stream processing: gets real-time answers from data with a complex and flexible set of operations. Here, because results often depend on windowed computations and require more active data, the focus shifts from ultra-low latency to functionality and accuracy.

In the following sections, we’ll get into recommended ways for implementing such patterns in a tested, proven, and maintainable way.

Great post on patterns for near real-time data processing.

What I have always wondered about is how much of a use case is there for “near real-time processing” of data? If human decision makers are in the loop, that is outside of ecommerce and algorithmic trading, what is the value-add of “near real-time processing” of data?

For example, Kai Wähner in Real-Time Stream Processing as Game Changer in a Big Data World with Hadoop and Data Warehouse gives the following as common use cases for “near real-time processing” of data”

  • Network monitoring
  • Intelligence and surveillance
  • Risk management
  • E-commerce
  • Fraud detection
  • Smart order routing
  • Transaction cost analysis
  • Pricing and analytics
  • Market data management
  • Algorithmic trading
  • Data warehouse augmentation

Ecommerce, smart order routing, algorithmic trading all fall into the category of no human involved so those may need real-time processing.

But take network monitoring for example. From the news reports I understand that hackers had free run of the Sony network for months. I suppose you must have network monitoring at all before real-time network monitoring would be useful at all.

I would probe to make sure that “real-time” was necessary for the use cases at hand before simply assuming it. In smaller organizations, access to data and “real-time” results are more often a symptom of control issues as opposed to any actual use case for the data.

Architecture Matters…

Sunday, February 23rd, 2014

Architecture Matters : Building Clojure Services At Scale At SoundCloud by Charles Ditzel.

Charles points to three posts on Clojure services at scale:

Building Clojure Services at Scale by Joseph Wilk.

Architecture behind our new Search and Explore experience by Petar Djekic.

Evolution of SoundCloud’s Architecture by Sean Treadway.

If you aren’t already following Charle’s blog (I wasn’t, am now), you should be.

Getty Art & Architecture Thesaurus Now Available

Saturday, February 22nd, 2014

Art & Architecture Thesaurus Now Available as Linked Open Data by James Cuno.

From the post:

We’re delighted to announce that today, the Getty has released the Art & Architecture Thesaurus (AAT)® as Linked Open Data. The data set is available for download at vocab.getty.edu under an Open Data Commons Attribution License (ODC BY 1.0).

The Art & Architecture Thesaurus is a reference of over 250,000 terms on art and architectural history, styles, and techniques. It’s one of the Getty Research Institute’s four Getty Vocabularies, a collection of databases that serves as the premier resource for cultural heritage terms, artists’ names, and geographical information, reflecting over 30 years of collaborative scholarship. The other three Getty Vocabularies will be released as Linked Open Data over the coming 18 months.

In recent months the Getty has launched the Open Content Program, which makes thousands of images of works of art available for download, and the Virtual Library, offering free online access to hundreds of Getty Publications backlist titles. Today’s release, another collaborative project between our scholars and technologists, is the next step in our goal to make our art and research resources as accessible as possible.

What’s Next

Over the next 18 months, the Research Institute’s other three Getty Vocabularies—The Getty Thesaurus of Geographic Names (TGN)®, The Union List of Artist Names®, and The Cultural Objects Name Authority (CONA)®—will all become available as Linked Open Data. To follow the progress of the Linked Open Data project at the Research Institute, see their page here.

A couple of points of particular interest:

Getty documentation says this is the first industrial application of ISO 25964 Information and documentation – Thesauri and interoperability with other vocabularies..

You will probably want to read AAT Semantic Representation rather carefully.

A great source of data and interesting reading on the infrastructure as well.

I first saw this in a tweet by Semantic Web Company.

Third Age of Computing?

Monday, August 26th, 2013

The ‘third era’ of app development will be fast, simple, and compact by Rik Myslewski.

From the post:

The tutorial was conducted by members of the HSA – heterogeneous system architecture – Foundation, a consortium of SoC vendors and IP designers, software companies, academics, and others including such heavyweights as ARM, AMD, and Samsung. The mission of the Foundation, founded last June, is “to make it dramatically easier to program heterogeneous parallel devices.”

As the HSA Foundation explains on its website, “We are looking to bring about applications that blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing of DSP via high bandwidth shared memory access with greater application performance at low power consumption.”

Last Thursday, HSA Foundation president and AMD corporate fellow Phil Rogers provided reporters with a pre-briefing on the Hot Chips tutorial, and said the holy grail of transparent “write once, use everywhere” programming for shared-memory heterogeneous systems appears to be on the horizon.

According to Rogers, heterogeneous computing is nothing less than the third era of computing, the first two being the single-core era and the muti-core era. In each era of computing, he said, the first programming models were hard to use but were able to harness the full performance of the chips.

(…)

Exactly how HSA will get there is not yet fully defined, but a number of high-level features are accepted. Unified memory addressing across all processor types, for example, is a key feature of HSA. “It’s fundamental that we can allocate memory on one processor,” Rogers said, “pass a pointer to another processor, and execute on that data – we move the compute rather than the data.”
(…)

Rik does a deep dive with references to the HSA Programmers Reference Manual to Project Sumatra that bring data-parallel algorithms to Java 9 (2015).

The only discordant note is that Nivdia and Intel are both missing from the HSA Foundation. Invited but not present.

Customers of Nvidia and/or Intel (I’m both) should contact Nvidia (Contact us) and Intel (contact us) and urge them to join the HSA Foundation. And pass this request along.

Sharing of memory is one of the advantages of HSA (heterogeneous systems architecture) and it is the where the semantics of shared data will come to the fore.

I haven’t read the available HSA documents in detail, but the HSA Programmer’s Reference Manual appears to presume that shared data has only one semantic. (It never says that but that is my current impression.)

We have seen that the semantics of data is not “transparent.” The same demonstration illustrates that data doesn’t always have the same semantic.

Simply because I am pointed to a particular memory location, there is no reason to presume I should approach that data with the same semantics.

For example, what if I have a Social Security Number (SSN). In processing that number for the Social Security Administration, it may serve to recall claim history, eligibility, etc. If I am accessing the same data to compare it to SSN records maintained by the Federal Bureau of Investigation (FBI), it may not longer be a unique identifier in the same sense as at the SSA.

Same “data,” but different semantics.

Who you gonna call? Topic Maps!

PS: Perhaps not as part of the running code but to document the semantics you are using to process data. Same data, same memory location, multiple semantics.

Big Ball of Mud

Saturday, August 27th, 2011

Big Ball of Mud by Brian Foote and Joseph Yoder.

I ran across a reference to this paper by John Schmidt in his reply to a comment on his post Four Canonical Techniques That Really Work (Or Not).

The authors present seven patterns of software systems:

  • BIG BALL OF MUD
  • THROWAWAY CODE
  • PIECEMEAL GROWTH
  • KEEP IT WORKING
  • SHEARING LAYERS
  • SWEEPING IT UNDER THE RUG
  • RECONSTRUCTION

All the superlatives have been used before so I will simply say read it.

Think about topic maps, Semantic Web apps, information systems you have helped write or design. Do you recognize any of them after reading this paper? What would you do differently today?