Archive for the ‘Erlang’ Category

xquerl (“We always do it nice and rough” Tina Turner)

Thursday, September 14th, 2017


From the webpage:

Erlang XQuery 3.1 Processor

This is a currently a draft/proof-of-concept. Please don’t try to use it for “real” computing!

It is passing about 91% its (~25k) test cases.

Features it has:

  • Module Feature
  • Higher-Order Function Feature

Features it does not have, but might later:

  • XQuery Update Facility
  • Schema Aware Feature
  • Typed Data Feature
  • Static Typing Feature
  • Serialization Feature

If you want to combine an interest in Erlang along with XQuery 3.1, you have arrived!

Decide for yourself which is the “nice” part and which is the “rough.”


Functional Programming in Erlang – MOOC – 20 Feb. 2017

Wednesday, February 8th, 2017

Functional Programming in Erlang with Simon Thompson (co-author of Erlang Programming)

From the webpage:

Functional programming is increasingly important in providing global-scale applications on the internet. For example, it’s the basis of the WhatsApp messaging system, which has over a billion users worldwide.

This free online course is designed to teach the principles of functional programming to anyone who’s already able to program, but wants to find out more about the novel approach of Erlang.

Learn the theory of functional programming and apply it in Erlang

The course combines the theory of functional programming and the practice of how that works in Erlang. You’ll get the opportunity to reinforce what you learn through practical exercises and more substantial, optional practical projects.

Over three weeks, you’ll:

  • learn why Erlang was developed, how its design was shaped by the context in which it was used, and how Erlang can be used in practice today;
  • write programs using the concepts of functional programming, including, in particular, recursion, pattern matching and immutable data;
  • apply your knowledge of lists and other Erlang data types in your programs;
  • and implement higher-order functions using generic patterns.

The course will also help you if you are interested in Elixir, which is based on the same virtual machine as Erlang, and shares its fundamental approach as well as its libraries, and indeed will help you to get going with any functional language, and any message-passing concurrency language – for example, Google Go and the Akka library for Scala/Java.

If you are not excited already, remember that XQuery is a functional programming language. What if your documents were “immutable data?”

Use #FLerlangfunc to see Twitter discussions on the course.

That looks like a committee drafted hashtag. 😉

SOAP and ODBC Erlang Libraries!

Friday, April 22nd, 2016

Bet365 donates Erlang libraries to GitHub by Cliff Saran.

From the post:

Online bookie Bet365 has released code into the GitHub open-source library to encourage enterprise developers to use the Erlang functional programming language.

The company has used Erlang since 2012 to overcome the challenges of using higher performance hardware to support ever-increasing volumes of web traffic.

“Erlang is a precision tool for developing distributed systems that demand scale, concurrency and resilience. It has been a superb technology choice in a business such as ours that deals in high traffic volumes,” said Chandru Mullaparthi, head of software architecture at Bet365.

I checked, the SOAP library is out and the ODBC library is forthcoming.

Cliff’s post ends with this cryptic sentence:

These releases represent the first phase of a support programme that will aim to address each of the major issues surrounding the uptake of Erlang.

That sounds promising!

Following @cmullaparthi to catch developing news.

Conference Videos for the Holidays

Wednesday, November 18th, 2015

As you know, I saw Alexander Songe’s CRDT: Datatype for the Apocalypse presentation earlier today.

With holidays approaching next week, November 23rd-27th, 2015 in the United States, I thought some of you may need additional high quality video references.

Clojure TV

Elixir Conf 2014.

Elixir Conf 2015

Erlang Solutions



No slight intended for any conference videos I didn’t list. I will list different conference videos for the next holiday list, which will appear in December 2015.


PS: I have to apologize for the poor curating of videos by their hosts. With only a little more effort, these videos could be a valuable day to day resource.

Solving the Stable Marriage problem…

Friday, August 21st, 2015

Solving the Stable Marriage problem with Erlang by Yan Cui.

With all the Ashley Madison hack publicity, I didn’t know there was a “stable marriage problem.” 😉

Turns out is it like the Eight-Queens problem. Is is a “problem” but it isn’t one you are likely to encounter outside of a CS textbook.

Yan sets up the problem with this quote from Wikipedia:

The stable marriage problem is commonly stated as:

Given n men and n women, where each person has ranked all members of the opposite sex with a unique number between 1 and n in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. If there are no such people, all the marriages are “stable”. (It is assumed that the participants are binary gendered and that marriages are not same-sex).

The wording is a bit awkward. I would rephrase it to say that for no pair, both partners prefer some other partner. One of the partner’s can prefer someone else, but if the someone else does not share that preference, both marriages are “stable.”

The Wikipedia article does observe:

While the solution is stable, it is not necessarily optimal from all individuals’ points of view.

Yan sets up the problem and then walks through the required code.


dgol – Distributed Game Of Life

Wednesday, August 19th, 2015

dgol – Distributed Game Of Life by Mirko Bonadei and Gabriele Lana.

From the webpage:

This project is an implementation of the Game of life done by Gabriele Lana and me during the last months.

We took it as a “toy project” to explore all the nontrivial decisions that need to be made when you have to program a distributed system (eg: choose the right supervision strategy, how to make sub-systems communicate each other, how to store data to make it fault tolerant, ecc…).

It is inspired by the Torben Hoffman’s version and on the talk Thinking like an Erlanger.

The project is still under development, at the moment we are doing a huge refactoring of the codebase because we are reorganizing the supervision strategy.

Don’t just nod at the Thinking like an Erlanger link. Part of its description reads:

If you find Erlang is a bit tough, or if testing gives you headaches, this webinar is for you. We will spend most of this intensive session looking at how to design systems with asynchronous message passing between processes that do not share any memory.

Definitely watch the video and progress in this project!


Saturday, August 1st, 2015

Lasp: A Language for Distributed, Eventually Consistent Computations by Christopher S. Meiklejohn and Peter Van Roy.

From the webpage:

Why Lasp?

Lasp is a new programming model designed to simplify large scale, fault-tolerant, distributed programming. Lasp is being developed as part of the SyncFree European research project. It leverages ideas from distributed dataflow extended with convergent replicated data types, or CRDTs. This supports computations where not all participants are online together at a given moment. The initial design supports synchronization-free programming by combining CRDTs together with primitives for composing them inspired by functional programming. This lets us write long-lived fault-tolerant distributed applications, including ones with nonmonotonic behavior, in a functional paradigm. The initial prototype is implemented as an Erlang library built on top of the Riak Core distributed systems infrastructure.


Other resources include:

Lasp-dev, the mailing list for Lasp developers.

Lasp at Github.

I was reminded to post about Lasp by this post from Christopher Meiklejohn:

This post is a continuation of my first post about leaving Basho Technologies after almost 4 years.

It has been quite a long time in the making, but I’m finally happy to announce that I am the recipient of a Erasmus Mundus fellowship in their Joint Doctorate in Distribute Computing program. I will be pursuing a full-time Ph.D., with my thesis devoted to developing the Lasp programming language for distributed computing with the goals of simplifying deterministic, distributed, edge computation.

Starting in February 2016, I will be moving to Belgium to begin my first year of studies at the Université catholique de Louvain supervised by Peter Van Roy followed by a second year in Lisbon at IST supervised by Luís Rodrigues.

If you like this article, please consider supporting my writing on gittip.

Looks like exciting developments are ahead for Lash!

Congratulations to Christopher Meiklejohn!

Erlang/OTP 18.0 has been released

Tuesday, June 30th, 2015

Erlang/OTP 18.0 has been released by Henrik.

From the post:

Erlang/OTP 18.0 is a new major release with new features, quite a few (characteristics) improvements, as well as some incompatibilities.
A non functional but major change this release is the change of license to APL 2.0 (Apache Public License).

Some highlights of the release are:

  • Starting from 18.0 Erlang/OTP is released under the APL 2.0 (Apache Public License)
  • erts: The time functionality has been extended. This includes a new API for
    time, as well as “time warp” modes which alters the behavior when system time changes. You are strongly encouraged to use the new API instead of the old API based on erlang:now/0. erlang:now/0 has been deprecated since it is a scalability bottleneck.
    For more information see the Time and Time Correction chapter of the ERTS User’s Guide. Here is a link
  • erts: Beside the API changes and time warp modes a lot of scalability and performance improvements regarding time management has been made. Examples are:

    • scheduler specific timer wheels,
    • scheduler specific BIF timer management,
    • parallel retrieval of monotonic time and system time on OS:es that support it.
  • erts: The previously introduced “eager check I/O” feature is now enabled by default.
  • erts/compiler: enhanced support for maps. Big maps new uses a HAMT (Hash Array Mapped Trie) representation internally which makes them more efficient. There is now also support for variables as map keys.
  • dialyzer: The -dialyzer() attribute can be used for suppressing warnings
    in a module by specifying functions or warning options.
    It can also be used for requesting warnings in a module.
  • ssl: Remove default support for SSL-3.0 and added padding check for TLS-1.0 due to the Poodle vulnerability.
  • ssl: Remove default support for RC4 cipher suites, as they are consider too weak.
  • stdlib: Allow maps for supervisor flags and child specs
  • stdlib: New functions in ets:

    • take/2. Works the same as ets:delete/2 but
      also returns the deleted object(s).
    • ets:update_counter/4 with a default object as

You can find the Release Notes with more detailed info at

A major holiday approaches in the United States (July 4th). A time when budget puffing terror alerts are issued, fatal automobile accidents surge, driving while intoxicated arrests jump, the usual marks of a US holiday.

If you spend some time with Erlang/OTP 18, you can greet your co-workers who survive the long weekend, albeit with frayed nerves from long proximity to family members and hangovers to boot, with some new tricks.

Structure and Interpretation of Computer Programs (LFE Edition)

Thursday, February 26th, 2015

Structure and Interpretation of Computer Programs (LFE Edition)

From the webpage:

This Gitbook (available here) is a work in progress, converting the MIT classic Structure and Interpretation of Computer Programs to Lisp Flavored Erlang. We are forever indebted to Harold Abelson, Gerald Jay Sussman, and Julie Sussman for their labor of love and intelligence. Needless to say, our gratitude also extends to the MIT press for their generosity in licensing this work as Creative Commons.


This is a huge project, and we can use your help! Got an idea? Found a bug? Let us know!.

Writing, or re-writing if you are transposing a CS classic into another language, is far harder than most people imagine. Probably even more difficult than the original because your range of creativity is bound by the organization and themes of the underlying text.

I may have some cycles to donate to proof reading. Anyone else?

NkBASE distributed database (Erlang)

Tuesday, February 24th, 2015

NkBASE distributed database (Erlang)

From the webpage:

NkBASE is a distributed, highly available key-value database designed to be integrated into Erlang applications based on riak_core. It is one of the core pieces of the upcoming Nekso’s Software Defined Data Center Platform, NetComposer.

NkBASE uses a no-master, share-nothing architecture, where no node has any special role. It is able to store multiple copies of each object to achive high availabity and to distribute the load evenly among the cluster. Nodes can be added and removed on the fly. It shows low latency, and it is very easy to use.

NkBASE has some special features, like been able to work simultaneously as a eventually consistent database using Dotted Version Vectors, a strong consistent database and a eventually consistent, self-convergent database using CRDTs called dmaps. It has also a flexible and easy to use query language that (under some circunstances) can be very efficient, and has powerful support for auto-expiration of objects.

The minimum recommended cluster size for NkBASE is three nodes, but it can work from a single node to hundreds of them. However, NkBASE is not designed for very high load or huge data (you really should use the excellent Riak and Riak Enterprise for that), but as an in-system, flexible and easy to use database, useful in multiple scenarios like configuration, sessions, cluster coordination, catalogue search, temporary data, cache, field completions, etc. In the future, NetComposer will be able to start and manage multiple kinds of services, including databases like a full-blown Riak.

NkBASE has a clean code base, and can be used as a starting point to learn how to build a distributed Erlang system on top of riak_core, and to test new backends or replication mechanisms. NkBASE would have been impossible without the incredible work from Basho, the makers of Riak: riak_core, riak_dt and riak_ensemble.

Several things caught my attention about NkBASE.

That it is written in Erlang was the first thing.

That is is based on riak_core was the second thing.

But the thing that sealed it appearance here was:

NkBASE is not designed for very high load or huge data (you really should use the excellent Riak and Riak Enterprise for that)


A software description that doesn’t read like Topper in Dilbert?


See the GitHub page for all the details but this looks promising, for the right range of applications.

Scientific Computing on the Erlang VM

Tuesday, January 6th, 2015

Scientific Computing on the Erlang VM by Duncan McGreggor.

From the post:

This tutorial brings in the New Year by introducing the Erlang/LFE scientific computing library lsci – a ports wrapper of NumPy and SciPy (among others) for the Erlang ecosystem. The topic of the tutorial is polynomial curve-fitting for a given data set. Additionally, this post further demonstrates py usage, the previously discussed Erlang/LFE library for running Python code from the Erlang VM.


The content of this post was taken from a similar tutorial done by the same author for the Python Lisp Hy in an IPython notebook. It, in turn, was completely inspired by the Clojure Incantor tutorial on the same subject, by David Edgar Liebke.

This content is also available in the lsci examples directory.


The lsci library (pronounced “Elsie”) provides access to the fast numerical processing libraries that have become so popular in the scientific computing community. lsci is written in LFE but can be used just as easily from Erlang.

Just in case Erlang was among your New Year’s Resolutions. 😉

Well, that’s not the only reason. You are going to encounter data processing that was performed in systems or languages that are strange to you. Assuming access to the data and a sufficient explanation of what was done, you need to be able to verify analysis in a language comfortable to you.

There isn’t now nor is there likely to be a shortage of languages and applications for data processing. Apologies to the various evangelists who dream of world domination for their favorite. Unless and until that happy day for someone arrives, the rest of us need to survive in a multilingual and multi-application space.

Which means having the necessary tools for data analysis/verification in your favorite tool suite counts for a lot. It is the difference in taking someone’s word for analysis and verifying the analysis for yourself. There is a world of difference between those two positions.

Stuff Goes Bad: Erlang in Anger

Wednesday, September 17th, 2014

Stuff Goes Bad: Erlang in Anger by Fred Herbert.

From the webpage:

This book intends to be a little guide about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from, and a dictionary of different code snippets and practices that helped developers debug production systems that were built in Erlang.

From the introduction:

This book is not for beginners. There is a gap left between most tutorials, books, training sessions, and actually being able to operate, diagnose, and debug running systems once they’ve made it to production. There’s a fumbling phase implicit to a programmer’s learning of a new language and environment where they just have to figure how to get out of the guidelines and step into the real world, with the community that goes with it.

This book assumes that the reader is proficient in basic Erlang and the OTP framework. Erlang/OTP features are explained as I see fit — usually when I consider them tricky — and it is expected that a reader who feels confused by usual Erlang/OTP material will have an idea of where to look for explanations if necessary.

What is not necessarily assumed is that the reader knows how to debug Erlang software, dive into an existing code base, diagnose issues, or has an idea of the best practices about deploying Erlang in a production environment. (footnote numbers omitted)

With exercises no less.

Reminds me of a book I had some years ago on causing and then debugging Solr core dumps. 😉 I don’t think it was ever a best seller but it was a fun read.

Great title by the way.

I first saw this in a tweet by Chris Meiklejean.

Erlang/OTP [New Homepage]

Monday, June 16th, 2014

Erlang/OTP [New Homepage]

I saw a tweet advising that the Erlang/OTP homepage had been re-written.

This shot from the Wayback Machine, dated October 11, 2011, Erlang/OTP homepage 2011, is how I remember the old homepage.

Today, the page seems a bit deep to me but includes details like the top three reasons to use Erlang/OTP for a cluster system (C/S):

  • Cost cheaper to use an open source C/S than write or rent one
  • Speed To Market quicker to use an C/S than write one
  • Availability and Reliability Erlang/OTP systems have been measured at 99.9999999% uptime (31ms a year downtime) (emphasis added)

That would be a good question to ask at the next big data conference: What is the measured reliability of system X?

Functional Geekery

Friday, May 30th, 2014

Functional Geekery by Steve Proctor.

I stumbled across episode 9 of Functional Geekery (a podcast) in Clojure Weekly, May 29th, 2014 and was interested to hear the earlier podcasts.

It’s only nine other episodes and not a deep blog history but still, I thought it would be nice to have a single listing of all the episodes.

Do be aware that each episode has a rich set of links to materials mentioned/discussed in each podcast.

If you enjoy these podcasts, do be sure to encourage others to listen to them and encourage Steve to continue with his excellent work.

  • Episode 1 – Robert C. Martin

    In this episode I talk with Robert C. Martin, better known as Uncle Bob. We run the gamut from Structure and Interpretation of Computer Programs, introducing children to programming, TDD and the REPL, compatibility of Functional Programming and Object Oriented Programming

  • Episode 2 – Craig Andera

    In this episode I talk with fellow podcaster Craig Andera. We talk about working in Clojure, ClojureScript and Datomic, as well as making the transition to functional programming from C#, and working in Clojure on Windows. I also get him to give some recommendations on things he learned from guests on his podcast, The Cognicast.

  • Episode 3 – Fogus

    In this episode I talk with Fogus, author of The Joy of Clojure and Functional JavaScript. We cover his history with functional languages, working with JavaScript in a functional style, and digging into the history of software development.

  • Episode 4 – Zach Kessin

    In this episode I talk with fellow podcaster Zach Kessin. We cover his background in software development and podcasting, the background of Erlang, process recovery, testing tools, as well as profiling live running systems in Erlang.

  • Episode 5 – Colin Jones

    In this episode I talk with Colin Jones, software craftsman at 8th Light. We cover Colin’s work on the Clojure Koans, making the transition from Ruby to Clojure, how functional programming affects the way he does object oriented design now, and his venture into learning Haskell.

  • Episode 6 – Reid Draper

    In this episode I talk with Reid Draper. We cover Reid’s intro to functional programming through Haskell, working in Erlang, distributed systems, and property testing; including his property testing tool simple-check, which has since made it into a Clojure contrib project as test.check.

  • Episode 7 – Angela Harms and Jason Felice on avi

    In this episode I talk with Angela Harms and Jason Felice about avi. We talk about the motivation of a vi implementation written in Clojure, the road map of where avi might used, and expressivity of code.

  • Functional Geekery Episode 08 – Jessica Kerr

    In this episode I talk with Jessica Kerr. In this episode we talk bringing functional programming concepts to object oriented languages; her experience in Scala, using the actor model, and property testing; and much more!

  • Functional Geekery Episode 9 – William E. Byrd

    In this episode I talk with William Byrd. We talk about miniKanren and the differences between functional, logic and relational programming. We also cover the idea of thinking at higher levels of abstractions, and comparisons of relational programming to topics such as SQL, property testing, and code contracts.

  • Functional Geekery Episode 10 – Paul Holser

    In this episode I talk with Paul Holser. We start out by talking about his junit-quickcheck project, being a life long learner and exploring ideas about computation from other languages, and what Java 8 is looking like in with the support of closures and lambdas.


Lisp Flavored Erlang

Saturday, May 24th, 2014

Lisp Flavored Erlang These are your father’s parentheses Elegant weapons, for a more …civilized age1.

From the homepage:


LFE has many origins, depending upon whether you’re looking at Lisp (and here), Erlang, or LFE-proper. The LFE community of contributors embraces all of these and more.

From the original release message:

I have finally released LFE, Lisp Flavoured Erlang, which is a lisp syntax front-end to the Erlang compiler. Code produced with it is compatible with “normal” Erlang code. The is an LFE-mode for Emacs and the lfe-mode.el file is include in the distribution… (Robert Virding)

I haven’t looked up the numbers but I am sure that LFE is in the terminology of academia, one of the less often taught languages. However, it sounds deeply interesting as we all march towards scalable concurrent processing.


Saturday, May 24th, 2014


Almost two hundred (195 as of May 24, 2014) links gathered in the following groups:

  • API Clients
  • Blogs
  • Books
  • Community
  • Database clients
  • Debugging and profiling
  • Documentation
  • Documentation tools
  • Editors and IDEs
  • Erlang for beginners
  • Erlang Internals
  • Erlang interviews and resources
  • Erlang – more advanced topics
  • Exercises
  • Http clients
  • Json
  • Load testing tools
  • Loggers
  • Network
  • Other languages on top of the Erlang VM
  • Package managers
  • Podcasts
  • Projects using Erlang
  • Style guide and Erlang Enhancement Proposals
  • Testing Frameworks
  • Videos
  • Utils
  • War diaries
  • Web frameworks
  • Web servers

This should supply you with plenty of beach reading. 😉

Getting functional with Erlang

Wednesday, May 21st, 2014

Getting functional with Erlang by Mark Nijhof.

From the webpage:

This book will get you started writing Erlang applications right from the get go. After some initial chapters introducing the language syntax and basic language features we will dive straight into building Erlang applications. While writing actual code you will discover and learn more about the different Erlang and OTP features. Each application we create is geared towards a different use-case, exposing the different mechanics of Erlang. 

I want this to become the book I would have read myself, simple and to the point. Something to help you get functional with Erlang quickly. I imagine you; with one hand holding your e-reader while typing code with the other hand.

I have made a broad assumption: Because only smart people would want to learn Erlang (that is you), that you are then also smart enough to find your way to all the language specifics when needed. So this book is not meant as a complete reference guide for Erlang. But it will teach you enough to give you a running start.

When you have reached the end of this book you will be able to build a full blown Erlang application and release it into production. You will understand the core Erlang features like; pattern matching, message passing, working with processes, and hot code swapping.

I haven’t bought a copy, but that is a reflection on my book budget and not Mark’s book.

Take a look and pass this along to others. Mark is using a publishing model that merits encouragement.

Erlang OTP 17.0 Released!

Wednesday, April 9th, 2014

Erlang OTP 17.0 has been released

From the news release:

Erlang/OTP 17.0 is a new major release with new features, characteristics improvements, as well as some minor incompatibilities. See the README file and the documentation for more details.

Among other things, the README file reports:


The default encoding of Erlang files has been changed from ISO-8859-1 to UTF-8.

The encoding of XML files has also been changed to UTF-8.

A reminder that supporting UTF-8 as UTF-8 is greatly preferred.

Is Parallel Programming Hard,…

Tuesday, March 25th, 2014

Is Parallel Programming Hard, And, If So, What Can You Do About It? by Paul E. McKenney.

From Chapter 1 How To Use This Book:

The purpose of this book is to help you understand how to program shared-memory parallel machines without risking your sanity.[1] By describing the algorithms and designs that have worked well in the past, we hope to help you avoid at least some of the pitfalls that have beset parallel-programming projects. But you should think of this book as a foundation on which to build, rather than as a completed cathedral. Your mission, if you choose to accept, is to help make further progress in the exciting field of parallel programming—progress that should in time render this book obsolete. Parallel programming is not as hard as some say, and we hope that this book makes your parallel-programming projects easier and more fun.

In short, where parallel programming once focused on science, research, and grand-challenge projects, it is quickly becoming an engineering discipline. We therefore examine the specific tasks required for parallel programming and describe how they may be most effectively handled. In some surprisingly common special cases, they can even be automated.

This book is written in the hope that presenting the engineering discipline underlying successful parallel-programming projects will free a new generation of parallel hackers from the need to slowly and painstakingly reinvent old wheels, enabling them to instead focus their energy and creativity on new frontiers. We sincerely hope that parallel programming brings you at least as much fun, excitement, and challenge that it has brought to us!

I should not have been surprised by:

16.4 Functional Programming for Parallelism

When I took my first-ever functional-programming class in the early 1980s, the professor asserted that the side- effect-free functional-programming style was well-suited to trivial parallelization and analysis. Thirty years later, this assertion remains, but mainstream production use of parallel functional languages is minimal, a state of affairs that might well stem from this professor’s additional assertion that programs should neither maintain state nor do I/O. There is niche use of functional languages such as Erlang, and multithreaded support has been added to several other functional languages, but mainstream production usage remains the province of procedural languages such as C, C++, Java, and FORTRAN (usually augmented with OpenMP or MPI).

The state of software vulnerability is testimony enough to the predominance of C, C++, and Java.

I’m not real sure I would characterize Erlang as a “niche” language. Niche languages aren’t often found running telecommunications networks, or at least that is my impression.

I would take McKenney’s comments as a challenge to use functional languages such as Clojure and Erlang to make in-roads into mainstream production.

While you use this work to learn the procedural approach to parallelism, you can be building contrasts to a functional one.

I first saw this in Nat Torkington’s Four short links: 13 March 2014.

Monitoring Real-Time Bidding at AdRoll

Friday, March 7th, 2014

Monitoring Real-Time Bidding at Adroll by Brian Troutwine.

From the description:

This is the talk I gave at Erlang Factory SF Bay Area 2014. In it I discussed the instrumentation by default approach taken in the AdRoll real-time bidding team, discuss the technical details of the libraries we use and lessons learned to adapt your organization to deal with the onslaught of data from instrumentation.

The problem domain:

  • Low latency ( < 100ms per transaction )
  • Firm real-time system
  • Highly concurrent ( > 30 billion transactions per day )
  • Global, 24/7 operation

(emphasis in original)

They are not doing semantic processing subject to those requirements. 😉

But, that’s ok. If needed, you can assign semantics to the data and its containers separately.

A very impressive use of Erlang.

CQRS with Erlang

Monday, March 3rd, 2014

CQRS with Erlang by Bryan Hunter.


Bryan Hunter introduces CQRS and one of its implementations done in Erlang, outlining the areas where Erlang shines.

You will probably enjoy this presentation more after reading: Introduction to CQRS by Kanasz Robert, which reads in part:

CQRS means Command Query Responsibility Segregation. Many people think that CQRS is an entire architecture, but they are wrong. CQRS is just a small pattern. This pattern was first introduced by Greg Young and Udi Dahan. They took inspiration from a pattern called Command Query Separation which was defined by Bertrand Meyer in his book “Object Oriented Software Construction”. The main idea behind CQS is: “A method should either change state of an object, or return a result, but not both. In other words, asking the question should not change the answer. More formally, methods should return a value only if they are referentially transparent and hence possess no side effects.” (Wikipedia) Because of this we can divide a methods into two sets:

  • Commands – change the state of an object or entire system (sometimes called as modifiers or mutators).
  • Queries – return results and do not change the state of an object.

In a real situation it is pretty simple to tell which is which. The queries will declare return type, and commands will return void. This pattern is broadly applicable and it makes reasoning about objects easier. On the other hand, CQRS is applicable only on specific problems.

Demo Code for the presentation.

RICON West 2013 Videos Posted!

Friday, January 17th, 2014

RICON West 2013 Videos Posted!

Rather than streaming the entire two (2) days, you can now view individual videos from RICON West 2013!

By author:

By title:

  • Bad As I Wanna Be: Coordination and Consistency in Distributed Databases (Bailis) – RICON West 2013
  • Bringing Consistency to Riak (Part 2) (Joseph Blomstedt) – RICON West 2013
  • Building Next Generation Weather Data Distribution and On-demand Forecast Systems Using Riak (Raja Selvaraj)
  • Controlled Epidemics: Riak's New Gossip Protocol and Metadata Store (Jordan West) – RICON West 2013
  • CRDTs: An Update (or maybe just a PUT) (Sam Elliott) – RICON West 2013
  • CRDTs in Production (Jeremy Ong) – RICON West 2013
  • Denormalize This! Riak at State Farm (Richard Simon and Richard Berglund) – RICON West 2013
  • Distributed Systems Archeology (Michael Bernstein) – RICON West 2013
  • Distributing Work Across Clusters: Adventures With Riak Pipe (Susan Potter) – RICON West 2013
  • Dynamic Dynamos: Comparing Riak and Cassandra (Jason Brown) – RICON West 2013
  • LVars: lattice-based data structures for deterministic parallelism (Lindsey Kuper) – RICON West 2013
  • Maximum Viable Product (Justin Sheehy) – RICON West 2013
  • More Than Just Data: Using Riak Core to Manage Distributed Services (O'Connell) – RICON West 2013
  • Practicalities of Productionizing Distributed Systems (Jeff Hodges) – RICON West 2013
  • The Raft Consensus Algorithm (Diego Ongaro) – RICON West 2013
  • Riak Search 2.0 (Eric Redmond) – RICON West 2013
  • Riak Security; Locking the Distributed Chicken Coop (Andrew Thompson) – RICON West 2013
  • RICON West 2013 Lightning Talks
  • Seagate Kinetic Open Storage: Innovation to Enable Scale Out Storage (Hughes) – RICON West 2013
  • The Tail at Scale: Achieving Rapid Response Times in Large Online Services (Dean) – RICON West 2013
  • Timely Dataflow in Naiad (Derek Murray) – RICON West 2013
  • Troubleshooting a Distributed Database in Production (Shoffstall and Voiselle) – RICON West 2013
  • Yuki: Functional Data Structures for Riak (Ryland Degnan) – RICON West 2013
  • Enjoy!


    Saturday, November 30th, 2013

    RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software.

    From the webpage:

    RELEASE is an EU FP7 STREP (287510) project that aims to scale the radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. The trend-setting language we will use is Erlang/OTP which has concurrency and robustness designed in. Currently Erlang/OTP has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. Moreover existing profiling and debugging tools don’t scale.

    I found the project after following a lead to:

    The Design of Scalable Distributed Erlang N. Chechina, P. Trinder, A. Ghaffari, R. Green, K. Lundin, and R. Virding. Symposium on Implementation and Application of Functional Languages 2012 (IFL’12), Oxford, UK, 2012 (Submitted).


    The multicore revolution means that the number of cores in commodity machines is growing exponentially. Many expect 100,000 core clouds (or platforms) to become commonplace, and the best predictions are that core failures on such an architecture will become relatively common, perhaps one hour mean time between core failures. The RELEASE project aims to scale Erlang to build reliable general-purpose software, such as server-based systems, on massively parallel machines. In this paper we present a design of Scalable Distributed (SD) Erlang — an extension of the Distributed Erlang functional programming language for reliable scalability. The design focuses on three aspects of Erlang scalability: scaling the number of Erlang nodes by eliminating transitive connections and introducing scalable groups (s groups); managing process placement in the scaled networks by introducing semi-explicit process placement; and preserving Erlang reliability model.

    You might also want to read Simon St. Laurent’s Distributed resilience with functional programming, an interview with Steve Vinoski.

    Erlang Handbook

    Wednesday, November 27th, 2013

    Erlang Handbook: A concise reference for Erlang

    From the webpage:

    Originally written by Bjarne Däcker and later revised by Robert Virding, the Erlang Handbook is a summary of the language features and the runtime system. It is aimed at people with some programming experience, serving as a quick introduction to the Erlang domain.

    Erlang Handbook (current release, pdf)

    The handbook is just that, a handbook. At forty-six pages, it is a highly useful but also highly condensed view of Erlang.

    I have been reminded of Erlang twice this week already.

    The first time was by The Distributed Complexity of Large-scale Graph Processing research paper with its emphasis on message passing between graph nodes as a processing model.

    The other reminder was Jans Aasman’s How to Use Graph Databases… [Topic Maps as Graph++?].

    Jans was extolling the use of graphs to manage data about telecom customers, with an emphasis on “near real-time.”

    Something kept nagging at me when I was watching the video but it was only afterwards that I remembered Ericsson’s development and use of Erlang for exactly that use case.

    By way of excuse, I was watching Jans’ video at the end of a long day. 😉

    Suggestions on where I can look for anyone using Erlang-based message passing for distributed processing of graphs?

    With a truthful description like this one:

    Erlang is a programming language used to build massively scalable soft real-time systems with requirements on high availability. Some of its uses are in telecoms, banking, e-commerce, computer telephony and instant messaging. Erlang’s runtime system has built-in support for concurrency, distribution and fault tolerance. (from

    are there any contraindications for Erlang?

    Erlang – Concurrent Language for Concurrent World

    Sunday, October 27th, 2013

    Erlang – Concurrent Language for Concurrent World by Zvi Avraham.

    If you need to forward a “why Erlang” to a programmer, this set of slides should be near the top of your list.

    It includes this quote from Joe Armstrong:

    The world is concurrent… I could not drive the car, if I did not understand concurrency…”

    Which makes me wonder: Do all the drivers I have seen in Atlanta understand concurrency?

    That would really surprise me. 😉

    The point should be that systems should be concurrent by their very nature, like the world around us.

    Users should object when systems exhibit sequential behavior.

    Systems that run forever self-heal and scale (Scaling Topic Maps?)

    Monday, August 19th, 2013

    Systems that run forever self-heal and scale by Joe Armstrong.

    From the description:

    Joe Armstrong outlines the architectural principles needed for building scalable fault-tolerant systems built from small isolated parallel components which communicate though well-defined protocols.

    Great visuals on the difference between imperative programming and concurrent programming.

    About half of the data transmission from smart phones uses Erlang.

    A very high level view of the architectural principles for building scalable fault-tolerant systems.

    All of Joe’s talk is important but for today I want to focus on his first principle for scalable fault-tolerant systems:


    Joe enumerates the benefits of isolation of processes as follows:

    Isolation enables:

    • Fault-tolerant
    • Scalability
    • Reliability
    • Testability
    • Comprehensibility
    • Code Upgrade

    Are you aware of any topic map engine that uses multiple, isolated processes for merging topics?

    Not threads, but processes.

    Threads being managed by an operating system scheduler are not really parallel processes, whatever its appearance to the casual user. Erlang processes, on the other hand, do run in parallel and when more processes are required, simply add more hardware.

    We could take a clue from Scalable SPARQL Querying of Large RDF Graphs Jiewen Huang, Daniel J. Abadi and, Kun Ren, partitioning parts of a topic map into different data stores and querying each store for a part of any query.

    But that’s adapting data to a sequential process, not a bad solution but one that you will have to repeat as data or queries change and evolve. Pseudo-parallelism.

    One of a concurrent process approach on immutable topics, associations, occurrences (see Working with Immutable Data by Saša Jurić) would be that different processes could be applying different merging tests to the same set of topics, associations, occurrences.

    Or the speed of your answer might depend on whether you have sent a query over a “free” interface, which is supported by a few processes or over a subscription interface, which has dozens if not hundreds of processes at your disposal.

    The speed and comprehensiveness of a topic map answer to any query might be a economic model for a public topic map service.

    If all I want to know about Anthony Weiner was: “Vote NO!” that could be free.

    If you wanted pics, vics and all, that could be a different price.

    Purely Functional Photoshop [Functional Topic Maps?]

    Sunday, August 11th, 2013

    Purely Functional Photoshop by James Hague.

    From the post:

    One of the first things you learn about Photoshop—or any similarly styled image editor—is to use layers for everything. Don’t modify existing images if you can help it. If you have a photo of a house and want to do some virtual landscaping, put each tree in its own layer. Want to add some text labels? More layers.

    The reason is straightforward: you’re keeping your options open. You can change the image without overwriting pixels in a destructive way. If you need to save out a version of the image without labels, just hide that layer first. Maybe it’s better if the labels are slightly translucent? Don’t change the text; set the opacity of the layer.

    This stuff about non-destructive operations sounds like something from a functional programming tutorial. It’s easy to imagine how all this layer manipulation could look behind the scenes. Here’s a list of layers using Erlang notation:

    A great illustration of one aspect of functional programming using something quite familiar, Photoshop.

    Imagine a set of topics and associations prior to any merging rules being applied. In one of the standard topic map syntaxes.

    Wouldn’t applying merging rules as layers provide greater flexibility to explore what merging rules work best for a data set?

    And wouldn’t opacity of topics and associations, to particular users, be a useful security measure?

    Am I wrong in thinking the equivalent of layers would be a good next step for topic maps?

    RICON East 2013 [videos, slides, resources]

    Monday, July 15th, 2013

    RICON East 2013 [videos, slides, resources]

    I have sorted (by author) and included the abstracts for the RICON East presentations. The RICON East webpage has links to blog entries about the conference.


    Brian Akins, Large Scale Data Service as a Service
    Slides | Video

    Turner Broadcasting hosts several large sites that need to serve “data” to millions of clients over HTTP. A couple of years ago, we started building a generic service to solve this and to retire several legacy systems. We will discuss the general architecture, the growing pains, and why we decided to use Riak. We will also share some implementation details and the use of the service for a few large internet events.

    Neil Conway, Bloom: Big Systems from Small Programs
    Slides | Video

    Distributed systems are ubiquitous, but distributed programs remain stubbornly hard to write. While many distributed algorithms can be concisely described, implementing them requires large amounts of code–often, the essence of the algorithm is obscured by low-level concerns like exception handling, task scheduling, and message serialization. This results in programs that are hard to write and even harder to maintain. Can we do better?

    Bloom is a new programming language we’ve developed at UC Berkeley that takes two important steps towards improving distributed systems development. First, Bloom programs are designed to be declarative and concise, aided by a new philosophy for reasoning about state and time. Second, Bloom can analyze distributed programs for their consistency requirements and either certify that eventual consistency is sufficient, or identify program locations where stronger consistency guarantees are needed. In this talk, I’ll introduce the language, and also suggest how lessons from Bloom can be adopted in other distributed programming stacks.

    Sean Cribbs, Just Open a Socket – Connecting Applications to Distributed Systems
    Slides | Video

    Client-server programming is a discipline as old as computer networks and well-known. Just connect socket to the server and send some bytes back and forth, right?

    Au contraire! Building reliable, robust client libraries and applications is actually quite difficult, and exposes a lot of classic distributed and concurrent programming problems. From understanding and manipulating the TCP/IP network stack, to multiplexing connections across worker threads, to handling partial failures, to juggling protocols and encodings, there are many different angles one must cover.

    In this talk, we’ll discuss how Basho has addressed these problems and others in our client libraries and server-side interfaces for Riak, and how being a good client means being a participant in the distributed system, rather than just a spectator.

    Reid Draper, Advancing Riak CS
    Slides | Video

    Riak CS has come a long way since it was first released in 2012, and then open sourced in March 2013. We’ll take a look at some of the features and improvements in the recently released Riak CS 1.3.0, and planned for the future, like better integration with CloudStack and OpenStack. Next, we’ll go over some of the Riak CS guts that deployers should understand in order to successfully deploy, monitor and scale Riak CS.

    Camille Fournier, ZooKeeper for the Skeptical Architect
    Slides | Video

    ZooKeeper is everywhere these days. It’s a core component of the Hadoop ecosystem. It provides the glue that enables high availability for systems like Redis and Solr. Your favorite startup probably uses it internally. But as every good skeptic knows, just because something is popular doesn’t mean you should use it. In this talk I will go over the core uses of ZooKeeper in the wild and why it is suited to these use cases. I will also talk about systems that don’t use ZooKeeper and why that can be the right decision. Finally I will discuss the common challenges of running ZooKeeper as a service and things to look out for when architecting a deployment.

    Sathish Gaddipati, Building a Weather Data Services Platform on Riak
    Slides | Video

    In this talk Sathish will discuss the size, complexity and use cases surrounding weather data services and analytics, which will entail an overview of the architecture of such systems and the role of Riak in these patterns.

    Sunny Gleason, Riak Techniques for Advanced Web & Mobile Application Development
    Slides | Video

    In recent years, there have been tremendous advances in high-performance, high-availability data storage for scalable web and mobile application development. Often times, these NoSQL solutions are portrayed as sacrificing the crispness and rapid application development features of relational database alternatives. In this presentation, we show the amazing things that are possible using a variety of techniques to apply Riak’s advanced features such as map-reduce, search, and secondary indexes. We review each feature in the context of a demanding real-world Ruby & Javascript “Pinterest clone” application with advanced features such as real-time updates via Websocket, comment feeds, content quarantining, permissions, search and social graph modeling. We pay specific attention to explaining the ‘why’ of these Riak techniques for high-performance, high availability applications, not just the ‘how’.

    Andy Gross, Lessons Learned and Questions Raised from Building Distributed Systems
    Slides | Video

    Shawn Gravelle and Sam Townsend, High Availability with Riak and PostgreSQL
    Slides | Videos

    This talk will cover work to build out an internal cloud offering using Riak and PostgreSQL as a data layer, architectural decisions made to achieve high availability, and lessons learned along the way.

    Rich Hickey, Using Datomic with Riak

    Rich Hickey, the author of Clojure and designer of Datomic, is a software developer with over 20 years of experience in various domains. Rich has worked on scheduling systems, broadcast automation, audio analysis and fingerprinting, database design, yield management, exit poll systems, and machine listening, in a variety of languages.

    James Hughes, Revolution in Storage
    Slides | Video

    The trends of technology are rocking the storage industry. Fundamental changes in basic technology, combined with massive scale, new paradigms, and fundamental economics leads to predictions of a new storage programming paradigm. The growth of low cost/GB disk is continuing with technologies such as Shingled Magnetic Recording. Flash and RAM are continuing to scale with roadmaps, some argue, down to atom scale. These technologies do not come without a cost. It is time to reevaluate the interface that we use to all kinds of storage, RAM, Flash and Disk. The discussion starts with the unique economics of storage (as compared to processing and networking), discusses technology changes, posits a set of open questions and ends with predictions of fundamental shifts across the entire storage hierarchy.

    Kyle Kingsbury, Call Me Maybe: Carly Rae Jepsen and the Perils of Network Partitions
    Slides | Code | Video

    Network partitions are real, but their practical consequences on complex applications are poorly understood. I want to talk about some of the neat ways I’ve found to lose important data, the challenge of building systems which are reliable under partitions, and what it means for you, an application developer.

    Hilary Mason, Realtime Systems for Social Data Analysis
    Slides | Video

    It’s one thing to have a lot of data, and another to make it useful. This talk explores the interplay between infrastructure, algorithms, and data necessary to design robust systems that produce useful and measurable insights for realtime data products. We’ll walk through several examples and discuss the design metaphors that bitly uses to rapidly develop these kinds of systems.

    Michajlo Matijkiw, Firefighting Riak at Scale
    Slides | Video

    Managing a business critical Riak instance in an enterprise environment takes careful planning, coordination, and the willingness to accept that no matter how much you plan, Murphy’s law will always win. At CIM we’ve been running Riak in production for nearly 3 years, and over those years we’ve seen our fair share of failures, both expected and unexpected. From disk melt downs to solar flares we’ve managed to recover and maintain 100% uptime with no customer impact. I’ll talk about some of these failures, how we dealt with them, and how we managed to keep our clients completely unaware.

    Neha Narula, Why Is My Cache So Dumb? Smarter Caching with Pequod
    Slides | Video

    Pequod is a key/value cache we’re developing at MIT and Harvard that automatically updates the cache to keep data fresh. Pequod exploits a common pattern in these computations: different kinds of cached data are often related to each other by transformations equivalent to simple joins, filters, and aggregations. Pequod allows applications to pre-declare these transformations with a new abstraction, the cache join. Pequod then automatically applies the transformations and tracks relationships to materialize data and keep the cache up to date, and in many cases improves performance by reducing client/cacheserver communication. Sound like a database? We use abstractions from databases like joins and materialized views, while still maintaining the performance of an in-memory key/value cache.

    In this talk, I’ll describe the challenges caching solves, the problems that still exist, and how tools like Pequod can make the space better.

    Alex Payne, Nobody ever got fired for picking Java: evaluating emerging programming languages for business-critical systems
    Slides | Video

    When setting out to build greenfield systems, engineers today have a broader choice of programming language than ever before. Over the past decade, language development has accelerated dramatically thanks to mature runtimes like the JVM and CLR, not to mention the prevalence of near-universal targets for cross-compilation like JavaScript. With strong technological foundations to build on and an active open source community, modern languages can evolve from rough hobbyist projects into capable tools in a stunningly short period of time. With so many strong contenders emerging every day, how do you decide what language to bet your business on? We’ll explore the landscape of new languages and provide a decision-making framework you can use to narrow down your choices.

    Theo Schlossnagle and Robert Treat, How Do You Eat An Elephant?
    Slides | Video

    When OmniTI first set out to build a next generation monitoring system, we turned to one of our most trusted tools for data management; Postgres. While this worked well for developing the initial Open Source application, as we continued to grow the Circonus public monitoring service, we eventually ran into scaling issues. This talk will cover some of the changes we made to make the original Postgres system work better, talk about some of the other systems we evaluated, and discuss the eventual solution to our problem; building our own time series database. Of course, that’s only half the story. We’ll also go into how we swapped out these backend data storage pieces in our production environment, all the while capturing and reporting on millions of metrics, without downtime or customer interruption.

    Dr. Margo Seltzer, Automatically Scalable Computation
    Slides | Video

    As our computational infrastructure races gracefully forward into increasingly parallel multi-core and blade-based systems, our ability to easily produce software that can successfully exploit such systems continues to stumble. For years, we’ve fantasized about the world in which we’d write simple, sequential programs, add magic sauce, and suddenly have scalable, parallel executions. We’re not there. We’re not even close. I’ll present trajectory-based execution, a radical, potentially crazy, approach for achieving automatic scalability. To date, we’ve achieved surprisingly good speedup in limited domains, but the potential is tantalizingly enormous.

    Chris Tilt, Riak Enterprise Revisited
    Slides | Video

    Riak Enterprise has undergone an overhaul since it’s 1.2 days, mostly around Mult-DataCenter replication. We’ll talk about the “Brave New World” of replication in depth, how it manages concurrent TCP/IP connections, Realtime Sync, and the technology preview of Active Anti-Entropy Fullsync. Finally, we’ll peek over the horizon at new features such as chaining of Realtime sync messages across multiple clusters.

    Sam Townsend, High Availability with Riak and PostgreSQL
    Slides | Video

    Mark Wunsch, Scaling Happiness Horizontally
    Slides | Video

    This talk will discuss how Gilt has grown its technology organization to optimize for engineer autonomy and happiness and how that optimization has affected its software. Conway’s Law states that an organization that designs systems will inevitably produce systems that are copies of the communication structures of the organization. This talk will work its way between both the (gnarly) technical details of Gilt’s application architecture (something we internally call “LOSA”) and the Gilt Tech organization structure. I’ll discuss the technical challenges we came up against, and how these often pointed out areas of contention in the organization. I’ll discuss quorums, failover, and latency in the context of building a distributed, decentralized, peer-to-peer technical organization.

    Matthew Von-Maszewski, Optimizing LevelDB for Performance and Scale
    Slides | Video

    LevelDB is a flexible key-value store written by Google and open sourced in August 2011. LevelDB provides an ordered mapping of binary keys to binary values. Various companies and individuals utilize LevelDB on cell phones and servers alike. The problem, however, is it does not run optimally on either as shipped.

    This presentation outlines the basic internal mechanisms of LevelDB and then proceeds to discuss the tuning opportunities in the source code for each mechanism. This talk will draw heavily from our experiences optimizing LevelDB for use in Riak, which is handy for running sufficiently large clusters.

    Ryan Zezeski, Yokozuna: Distributed Search You Don’t Think About
    Slides | Video

    Allowing users to run arbitrary and complex searches against your data is a feature required by most consumer facing applications. For example, the ability to get ranked results based on free text search and subsequently drill down on that data based on secondary attributes is at the heart of any good online retail shop. Not only must your application support complex queries such as “doggy treats in a 2 mile radius, broken down by popularity” but it must also return in hundreds of milliseconds or less to keep users happy. This is what systems like Solr are built for. But what happens when the index is too big to fit on a single node? What happens when replication is needed for availability? How do you give correct answers when the index is partitioned across several nodes? These are the problems of distributed search. These are some of the problems Yokozuna solves for you without making you think about it.

    In this talk Ryan will explain what search is, why it matters, what problems distributed search brings to the table, and how Yokozuna solves them. Yokozuna provides distributed and available search while appearing to be a single-node Solr instance. This is very powerful for developers and ops professionals.

    I first saw this in a tweet by Alex Popescu.

    PS: If more videos go up and I miss it, please ping me. Thanks!

    Sherlock’s Last Case

    Sunday, July 14th, 2013

    Sherlock’s Last Case by Joe Armstrong.

    Joe states the Sherlock problem as given one X and millions of Yi’s, “Which Yi is “nearer to X?”

    For some measure of “nearer,” or as we prefer, similarity.

    One solution is given in Programming Erlang: Software for a Concurrent World, 2nd ed., 2013, by Joe Armstrong.

    Joe describes two possibly better solutions in this lecture.

    Great lecture even if he omits a fundamental weakness in TF-IDF.

    From the Wikipedia entry:

    Suppose we have a set of English text documents and wish to determine which document is most relevant to the query “the brown cow”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “brown”, and “cow”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

    However, because the term “the” is so common, this will tend to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to the more meaningful terms “brown” and “cow”. The term “the” is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words “brown” and “cow”. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

    For example, TF-IDF would not find a document with “the brown heifer,” for a query of “the brown cow.”

    TF-IDF does not account for relationships between terms, such as synonymy or polysemy.

    Juam Ramos states as much in describing the limitations of TF-IDF in: Using TF-IDF to Determine Word Relevance in Document Queries:

    Despite its strength, TF-IDF has its limitations. In terms of synonyms, notice that TF-IDF does not make the jump to the relationship between words. Going back to (Berger & Lafferty, 1999), if the user wanted to find information about, say, the word ‘priest’, TF-IDF would not consider documents that might be relevant to the query but instead use the word ‘reverend’. In our own experiment, TF-IDF could not equate the word ‘drug’ with its plural ‘drugs’, categorizing each instead as separate words and slightly decreasing the word’s wd value. For large document collections, this could present an escalating problem.

    Ramos cites Information Retrieval as Statistical Translation by Adam Berger and John Lafferty to support his comments on synonymy or polysemy.

    The Berger and Lafferty treat synonymy and polysemy, issues that TF-IDF misses, as statistical translation issues:

    Ultimately document retrieval systems must be sophisticated enough to handle polysemy and synonymyto know for instance that pontiff and pope are related terms The eld of statistical translation concerns itself with how to mine large text databases to automatically discover such semantic relations Brown et al [3, 4] showed for instance how a system can learn to associate French terms with their English translations given only a collection of bilingual FrenchEnglish sentences We shall demonstrate how in a similar fashion an IR system can from a collection of documents automatically learn which terms are related and exploit these relations to better nd and rank the documents it returns to the user

    Merging powered by the results of statistical translation?

    The Berger and Lafferty paper is more than a decade old so I will be running the research forward.

    Riak 1.4 – More Install Notes on Ubuntu 12.04 (precise)

    Friday, July 12th, 2013

    Following up on yesterday’s post on installing Riak 1.4 with some minor nits.

    Open File Limits

    The Open Files Limit leaves the reader dangling with:

    However, what most needs to be changed is the per-user open files limit. This requires editing /etc/security/limits.conf, which you’ll need superuser access to change. If you installed Riak or Riak Search from a binary package, add lines for the riak user like so, substituting your desired hard and soft limits:

    (next paragraph)


    riak soft nofile 65536
    riak hard nofile 65536

    Tab separated values in /etc/security/limits.conf.

    The same page also suggests an open file value of 50384 if you are starting Riak with init scripts. I don’t know the reason for the difference but 50384 occurs only once in Linux examples so while it may work, I am starting with the higher value.

    Performance Tuning

    I followed the directions at Linux Performance Tuning, but suggest you also add:

    # Added by
    # Network tuning parameters for Riak 1.4
    # As per:

    both here and for your changes to limits.conf.

    Puts others on notice of the reason for the settings and points to documentation.

    Enter the same type of note for your setting of the noatime flag in /etc/fstab (under Mounts and Scheduler in Linux Performance Tuning).

    On reboot, check your settings with:

    ulimit -a

    I was going to do the Riak Fast Track today but got distracted with configuration issues with Ruby, RVM, KDE and the viewer for Riak docs.

    Look for Fast Track notes over the weekend.