Archive for the ‘Scalability’ Category

HyperLogLog — Cornerstone of a Big Data Infrastructure

Thursday, May 2nd, 2013

HyperLogLog — Cornerstone of a Big Data Infrastructure

From the introduction:

In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.

After seeing the HyperLogLog references in Approximate Methods for Scalable Data Mining I started looking for a fuller explanation/illustration of HyperLogLog.

Stumbled on this posting.

Includes a great HyperLogLog (HLL) simulation written in JavaScript.

Enjoy!

Approximate Methods for Scalable Data Mining

Thursday, May 2nd, 2013

Approximate Methods for Scalable Data Mining by Andrew Clegg.

Slides from a presentation at: Data Science London 24/04/13.

To get your interest, a nice illustration of HyperLogLog algorithm, “Billions of distinct values in 1.5KB of RAM with 2% relative error.”

Has a number of other useful illustrations and great references.

Slides and Photos: Bigger Data & Smarter Scaling

Tuesday, November 6th, 2012

Slides and Photos: Bigger Data & Smarter Scaling by Marci Windsheimer.

Report on the TimesOpen event on data and scaling.

You will find here:

Our most recent TimesOpen event was our biggest yet. Here are some highlights:

Should be more than enough inducement to catch the next TimesOpen event!

Exploiting Parallelism and Scalability (XPS) (NSF)

Thursday, October 25th, 2012

Exploiting Parallelism and Scalability (XPS) (NSF)

From the announcement:

Synopsis of Program:

Computing systems have undergone a fundamental transformation from the single-processor devices of the turn of the century to today’s ubiquitous and networked devices and warehouse-scale computing via the cloud. Parallelism has become ubiquitous at many levels. The proliferation of multi- and many-core processors, ever-increasing numbers of interconnected high performance and data intensive edge devices, and the data centers servicing them, is enabling a new set of global applications with large economic and social impact. At the same time, semiconductor technology is facing fundamental physical limits and single processor performance has plateaued. This means that the ability to achieve predictable performance improvements through improved processor technologies has ended.

The Exploiting Parallelism and Scalability (XPS) program aims to support groundbreaking research leading to a new era of parallel computing. XPS seeks research re-evaluating, and possibly re-designing, the traditional computer hardware and software stack for today’s heterogeneous parallel and distributed systems and exploring new holistic approaches to parallelism and scalability. Achieving the needed breakthroughs will require a collaborative effort among researchers representing all areas– from the application layer down to the micro-architecture– and will be built on new concepts and new foundational principles. New approaches to achieve scalable performance and usability need new abstract models and algorithms, programming models and languages, hardware architectures, compilers, operating systems and run-time systems, and exploit domain and application-specific knowledge. Research should also focus on energy- and communication-efficiency and on enabling the division of effort between edge devices and clouds.

Full proposals due: February 20, 2013, (due by 5 p.m. proposer’s local time).

I see the next wave of parallelism and scalability being based on language and semantics. Less so on more cores and better designs in silicon.

Not surprising since I work in languages and semantics every day.

Even so, consider a go-cart that exceeds 160 miles per hour (260 km/h) remains a go-cart.

Go beyond building a faster go-cart.

Consider language and semantics when writing your proposal for this program.

A Provably Correct Scalable Concurrent Skip List

Thursday, August 16th, 2012

From High Scalability, report of Paper: A Provably Correct Scalable Concurrent Skip List.

From the post:

In MemSQL Architecture we learned one of the core strategies MemSQL uses to achieve their need for speed is lock-free skip lists. Skip lists are used to efficiently handle range queries. Making the skip-lists lock-free helps eliminate contention and make writes fast.

If this all sounds a little pie-in-the-sky then here’s a very good paper on the subject that might help make it clearer: A Provably Correct Scalable Concurrent Skip List.

The cited paper is by Maurice Herlihy, Yossi Lev, Victor Luchangco, and Nir Shavit. The authors shared Sun Microsystems as an employer so you know the paper is dated.

For more background on lock-free data structures, including Keir Fraser’s “Practical lock freedom” dissertation, see: Practical lock-free data structures.

Compressive Genomics [Compression as Merging]

Wednesday, July 11th, 2012

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

Concurrent Programming for Scalable Web Architectures

Wednesday, June 6th, 2012

Concurrent Programming for Scalable Web Architectures by Benjamin Erb.

Abstract:

Web architectures are an important asset for various large-scale web applications, such as social networks or e-commerce sites. Being able to handle huge numbers of users concurrently is essential, thus scalability is one of the most important features of these architectures. Multi-core processors, highly distributed backend architectures and new web technologies force us to reconsider approaches for concurrent programming in order to implement web applications and fulfil scalability demands. While focusing on different stages of scalable web architectures, we provide a survey of competing concurrency approaches and point to their adequate usages.

High Scalability has a good list of topics and the table of contents.

Or you can jump to the thesis homepage.

Just in case you are thinking about taking your application to “web scale.” ;-)

Paxos Made Moderately Complex

Saturday, May 12th, 2012

Paxos Made Moderately Complex

From the post:

If you are a normal human being and find the Paxos protocol confusing, then this paper, Paxos Made Moderately Complex, is a great find. Robbert van Renesse from Cornell University has written a clear and well written paper with excellent explanations.

The Abstract:

For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. The initial description avoids optimizations that complicate comprehension. Next we discuss liveness, and list various optimizations that make the protocol practical.

If you need safety (“freedom from inconsistency”) and fault-tolerant topic map results, you may want to spend some quality time with this paper.

As with most things, user requirements are going to drive the choices you have to make.

Hard for me to think a “loosely consistent” merging system is useful, but for TV entertainment data that may be enough. Who is sleeping with who probably has lag time in reporting anyway.

For more serious data, Paxos may be your protocol of choice.

Cell Architectures (adding dashes of heterogeneity)

Saturday, May 12th, 2012

Cell Architectures

From the post:

A consequence of Service Oriented Architectures is the burning need to provide services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture.

A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard. A shard is a subset of a much larger dataset, typically a range of users, for example.

Cell Architectures have several advantages:

  • Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
  • Cell are added in an incremental fashion as more capacity is required.
  • Cells isolate failures. One cell failure does not impact other cells.
  • Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.
  • Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
  • Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

The intersection of semantic heterogeneity and scaling remains largely unexplored.

I suggest scaling in a homogeneous environment and then adding dashes of heterogeneity to see what breaks.

Adjust and try again.

Cassandra Radical NoSQL Scalability

Monday, February 27th, 2012

Cassandra Radical NoSQL Scalability by Tim Berglund.

From the description:

Cassandra is a scalable, highly available, column-oriented data store in use use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines.

This open source project managed by the Apache foundation offers a compelling combination of a rich data model, a robust deployment track record, and a sound architecture. This video presents the Cassandra’s data model, works through its API in Java and Groovy, talks about how to deploy it and looks at use cases in which it is an appropriate data storage solution.

It explores the Amazon Dynamo project and Google’s BigTable and explains how its architecture helps us achieve the gold standard of scalability: horizontal scalability on commodity hardware. You will be ready to begin experimenting with Cassandra immediately and planning its adoption in your next project.

Take some time to look at CQL – Cassandra Query Language.

BTW, Berglund is a good presenter.

Journal of Web Semantics: Special Issue on Scalability

Wednesday, February 1st, 2012

Journal of Web Semantics: Special Issue on Scalability

Navigation Note:

Title of article goes to download page for the article.

Author’s names are hyperlinks to pages that list the author’s publications in the Journal of Web Semantics.

Editorial – Special Issue “Scalability”
Jeff Heflin, Heiner Stuckenschmidt

Scalable Distributed Indexing and Query Processing over Linked Data
Marcel Karnstedt, Kai-Uwe Sattler, Manfred Hauswirth

Searching Web Data: an Entity Retrieval and High-Performance Indexing Model
Renaud Delbru, Stephane Campinas, Giovanni Tummarello

WebPIE: A Web-scale parallel inference engine using MapReduce
Jacopo Urbani, Spyros Kotoulas, Jason Massen, Frank van Harmelen, Henri Bal

Scalable and Distributed Methods for Entity Matching, Consolidation and Disambiguation over Linked Data Corpora
Aidan Hogan, Antoine Zimmermann, Juergen Umbrich, Axel Polleres, Stefan Decker