Archive for the ‘Scala’ Category

Optimizing Hash-Array Mapped Tries…

Saturday, November 28th, 2015

Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections by Adrian Colyer.

Adrian’s review of Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections by Steinforder & Vinju, 2015, starts this way:

You’d think that the collection classes in modern JVM-based languages would be highly efficient at this point in time – and indeed they are. But the wonderful thing is that there always seems to be room for improvement. Today’s paper examines immutable collections on the JVM – in particular, in Scala and Clojure – and highlights a new CHAMPion data structure that offers 1.3-6.7x faster iteration, and 3-25.4x faster equality checking.

CHAMP stands for Compressed Hash-Array Mapped Prefix-tree.

The use of immutable collections is on the rise…

Immutable collections are a specific area most relevant to functional/object-oriented programming such as practiced by Scala and Clojure programmers. With the advance of functional language constructs in Java 8 and functional APIs such as the stream processing API, immutable collections become more relevant to Java as well. Immutability for collections has a number of benefits: it implies referential transparency without giving up on sharing data; it satisfies safety requirements for having co-variant sub-types; it allows to safely share data in presence of concurrency.

Both Scala and Clojure use a Hash-Array Mapped Trie (HAMT) data structure for immutable collections. The HAMT data structure was originally developed by Bagwell in C/C++. It becomes less efficient when ported to the JVM due to the lack of control over memory layout and the extra indirection caused by arrays also being objects. This paper is all about the quest for an efficient JVM-based derivative of HAMTs.

Fine-tuning data structures for cache locality usually improves their runtime performance. However, HAMTs inherently feature many memory indirections due to their tree-based nature, notably when compared to array-based data structures such as hashtables. Therefore HAMTs presents an optimization challenge on the JVM. Our goal is to optimize HAMT-based data structures such that they become a strong competitor of their optimized array-based counterparts in terms of speed and memory footprints.

Adrian had me at: “a new CHAMPion data structure that offers 1.3-6.7x faster iteration, and 3-25.4x faster equality checking.”

If you want experience with the proposed data structures, the authors have implemented them in the Rascal Metaprogramming Language.

I first saw this in a tweet by Atabey Kaygun

NLP and Scala Resources

Wednesday, October 7th, 2015

Natural Language Processing and Scala Tutorials by Jason Baldridge.

An impressive collection of resources but in particular, the seventeen (17) Scala tutorials.

Unfortunately, given the state of search and indexing it isn’t possible to easily dedupe the content of these materials against others you may have already found.

Companion to “Functional Programming in Scala”

Sunday, February 22nd, 2015

A companion booklet to “Functional Programming in Scala” by Rúnar Óli Bjarnason.

From the webpage:

This full colour syntax-highlighted booklet comprises all the chapter notes, hints, solutions to exercises, addenda, and errata for the book “Functional Programming in Scala” by Paul Chiusano and Runar Bjarnason. This material is freely available online, but is compiled here as a convenient companion to the book itself.

If you talk about supporting alternative forms of publishing, here is your chance to support an alternative form of publishing, financially.

Authors are going to gravitate to models that sustain their ability to write.

It is up to you what model that will be.

Scala DataTable

Monday, February 9th, 2015

Scala DataTable by Martin Cooper.

From the webpage:


Scala DataTable is a lightweight, in-memory table structure written in Scala. The implementation is entirely immutable. Modifying any part of the table, adding or removing columns, rows, or individual field values will create and return a new structure, leaving the old one completely untouched. This is quite efficient due to structural sharing.

Features :

  • Fully immutable implementation.
  • All changes use structural sharing for performance.
  • Table columns can be added, inserted, updated and removed.
  • Rows can be added, inserted, updated and removed.
  • Individual cell values can be updated.
  • Any inserts, updates or deletes keep the original structure and data completely unchanged.
  • Internal type checks and bounds checks to ensure data integrity.
  • RowData object allowing typed or untyped data access.
  • Full filtering and searching on row data.
  • Single and multi column quick sorting.
  • DataViews to store sets of filtered / sorted data.

If you are curious about immutable data structures and want to start with something familiar, this is your day!

See the Github page for example code and other details.

29 GIFs Only ScalaCheck Witches Will Understand

Monday, January 19th, 2015

29 GIFs Only ScalaCheck Witches Will Understand by Kelsey Gilmore-Innis.

From the post:

Because your attention span. Stew O’Connor and I recently gave a talk on ScalaCheck, the property-based testing library for Scala. You can watch the video, or absorb it here in the Internet’s Truest Form. Here are 29 GIFs you have to be a total ScalaCheck witch to get:

1. ScalaCheck is black magick…


I’ll confess, I don’t get some of the images. But they are interesting enough that I am willing to correct that deficiency!

BTW, I think I had a pipe that looked like this one a very, very long time ago. I don’t remember the smoke being pink. 😉

Scala eXchange 2014 (videos)

Saturday, December 13th, 2014

Scala eXchange 2014 Videos are online! Thanks to the super cool folks at Skills Matter for making them available!

As usual, I have sorted the videos by author. I am not sure about using “scala” as a keyword at a Scala conference but suspect it was to permit searching in a database with videos from other conferences.

If you watch these with ear buds while others are watching sporting events, remember to keep the sound down enough that you can hear curses or cheers from others in the room. Mimic their sentiments and no one will be any wiser, except you for having watched these videos. 😉

PS: I could have used a web scraper to obtain the data but found manual extraction to be a good way to practice regexes in Emacs.

Activator Template of the Month: Atomic Scala

Tuesday, September 9th, 2014

Activator Template of the Month: Atomic Scala by Dick Wall.

From the post:

As readers of the Typesafe newsletter may know, every month we promote an Activator template that embodies qualities we look for in tutorials and topics. This month’s template is a return to first principles of Activator as an experimentation and discovery tool. While the selection may be below your level, gentle reader, I will bet as an existing Scala developer, at least someone has asked you at some point how to go about learning Scala from the ground up, or even better, how to go about learning to program.

The Atomic Scala Examples activator template, authored by Bruce Eckel (one of the book authors) is a companion to the book but is useful in its own right (i.e. while I recommend the book, you don’t have to buy it to find the template useful). The template takes each of the examples in the book and provides them in an executable and easily runnable form. If you want to help someone to learn to program, and even better to do so using Scala, here’s your template. You can also download the first 100 pages of the book for free if you want to as well.

Whether this is a technique that works for you or your student(s) won’t be known unless you try.


I first saw this in a tweet by TypeSafe.

Summingbird:… [VLDB 2014]

Monday, August 4th, 2014

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations by Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin.


Summingbird is an open-source domain-specifi c language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using data flow abstractions such as sources, sinks, and stores, and can run on diff erent execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require di fferent bindings for the data flow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefi nitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

Heavy sledding but deeply interesting work. Particularly about “…integrating batch and online processing in a seamless fashion.”

I first saw this in a tweet by Jimmy Lin.

Functional Geekery

Friday, May 30th, 2014

Functional Geekery by Steve Proctor.

I stumbled across episode 9 of Functional Geekery (a podcast) in Clojure Weekly, May 29th, 2014 and was interested to hear the earlier podcasts.

It’s only nine other episodes and not a deep blog history but still, I thought it would be nice to have a single listing of all the episodes.

Do be aware that each episode has a rich set of links to materials mentioned/discussed in each podcast.

If you enjoy these podcasts, do be sure to encourage others to listen to them and encourage Steve to continue with his excellent work.

  • Episode 1 – Robert C. Martin

    In this episode I talk with Robert C. Martin, better known as Uncle Bob. We run the gamut from Structure and Interpretation of Computer Programs, introducing children to programming, TDD and the REPL, compatibility of Functional Programming and Object Oriented Programming

  • Episode 2 – Craig Andera

    In this episode I talk with fellow podcaster Craig Andera. We talk about working in Clojure, ClojureScript and Datomic, as well as making the transition to functional programming from C#, and working in Clojure on Windows. I also get him to give some recommendations on things he learned from guests on his podcast, The Cognicast.

  • Episode 3 – Fogus

    In this episode I talk with Fogus, author of The Joy of Clojure and Functional JavaScript. We cover his history with functional languages, working with JavaScript in a functional style, and digging into the history of software development.

  • Episode 4 – Zach Kessin

    In this episode I talk with fellow podcaster Zach Kessin. We cover his background in software development and podcasting, the background of Erlang, process recovery, testing tools, as well as profiling live running systems in Erlang.

  • Episode 5 – Colin Jones

    In this episode I talk with Colin Jones, software craftsman at 8th Light. We cover Colin’s work on the Clojure Koans, making the transition from Ruby to Clojure, how functional programming affects the way he does object oriented design now, and his venture into learning Haskell.

  • Episode 6 – Reid Draper

    In this episode I talk with Reid Draper. We cover Reid’s intro to functional programming through Haskell, working in Erlang, distributed systems, and property testing; including his property testing tool simple-check, which has since made it into a Clojure contrib project as test.check.

  • Episode 7 – Angela Harms and Jason Felice on avi

    In this episode I talk with Angela Harms and Jason Felice about avi. We talk about the motivation of a vi implementation written in Clojure, the road map of where avi might used, and expressivity of code.

  • Functional Geekery Episode 08 – Jessica Kerr

    In this episode I talk with Jessica Kerr. In this episode we talk bringing functional programming concepts to object oriented languages; her experience in Scala, using the actor model, and property testing; and much more!

  • Functional Geekery Episode 9 – William E. Byrd

    In this episode I talk with William Byrd. We talk about miniKanren and the differences between functional, logic and relational programming. We also cover the idea of thinking at higher levels of abstractions, and comparisons of relational programming to topics such as SQL, property testing, and code contracts.

  • Functional Geekery Episode 10 – Paul Holser

    In this episode I talk with Paul Holser. We start out by talking about his junit-quickcheck project, being a life long learner and exploring ideas about computation from other languages, and what Java 8 is looking like in with the support of closures and lambdas.


Scala eXchange 2013 (screencasts)

Friday, May 23rd, 2014

Scala eXchange 2013

From the webpage:

Join us at the third Annual Scala eXchange 2013 for 2 days of learning Scala skills! Meet the amazing Bill Venners and gain an understanding of the trade-offs between implicit conversions and parameters and how to take advantage of implicit parameters in your own designs. Or join Viktor Klang’s talk to learn strategies for the recovery and healing of your systems, when things go FUBAR. Find out about Lift from David Pollak, or find out about Adept, the new dependency management system for Scala in Fredrik Ekholdt’s talk. Find out about the road to Akka Cluster, and beyond in Jonas Boner’s keynote or about the new design of theMacro-based Scala Parallel Collections with Alex Prokopec! Featuring 2 days of talks over 3 tracks, The Scala eXchange will bring the world’s top Scala experts and many of the creators of Scala stack technologies together with Europe’s Scala community to learn and share skills, exchange ideas and meet like minded people. Don’t miss it!

There are forty-eight (48) screencasts from Scala eXchange 2013 posted for your viewing pleasure.

I can’t think of a better selling point for Scala eXchange 2014 than the screencasts from 2013.

NLTK-like Wordnet Interface in Scala

Wednesday, April 16th, 2014

NLTK-like Wordnet Interface in Scala by Sujit Pal.

From the post:

I recently figured out how to setup the Java WordNet Library (JWNL) for something I needed to do at work. Prior to this, I have been largely unsuccessful at figuring out how to access Wordnet from Java, unless you count my one attempt to use the Java Wordnet Interface (JWI) described here. I think there are two main reason for this. First, I just didn’t try hard enough, since I could get by before this without having to hook up Wordnet from Java. The second reason was the over-supply of libraries (JWNL, JWI, RiTa, JAWS, WS4j, etc), each of which annoyingly stops short of being full-featured in one or more significant ways.

The one Wordnet interface that I know that doesn’t suffer from missing features comes with the Natural Language ToolKit (NLTK) library (written in Python). I have used it in the past to access Wordnet for data pre-processing tasks. In this particular case, I needed to call it at runtime from within a Java application, so I finally bit the bullet and chose a library to integrate into my application – I chose JWNL based on seeing it being mentioned in the Taming Text book (and used in the code samples). I also used code snippets from Daniel Shiffman’s Wordnet page to learn about the JWNL API.

After I had successfully integrated JWNL, I figured it would be cool (and useful) if I could build an interface (in Scala) that looked like the NLTK Wordnet interface. Plus, this would also teach me how to use JWNL beyond the basic stuff I needed for my webapp. My list of functions were driven by the examples from the Wordnet section (2.5) from the NLTK book and the examples from the NLTK Wordnet Howto. My Scala class implements most of the functions mentioned on these two pages. The following session will give you an idea of the coverage – even though it looks a Python interactive session, it was generated by my JUnit test. I do render the Synset and Word (Lemma) objects using custom format() methods to preserve the illusion (and to make the output readable), but if you look carefully, you will notice the rendering of List() is Scala’s and not Python’s.

NLTK is amazing in its own right and creating a Scala interface will give you an excuse to learn Scala. That’s a win-win situation!

GraphChi-DB [src released]

Tuesday, April 15th, 2014


From the webpage:

GraphChi-DB is a scalable, embedded, single-computer online graph database that can also execute similar large-scale graph computation as GraphChi. it has been developed by Aapo Kyrola as part of his Ph.D. thesis.

GraphChi-DB is written in Scala, with some Java code. Generally, you need to know Scala quite well to be able to use it.

IMPORTANT: GraphChi-DB is early release, research code. It is buggy, it has awful API, and it is provided with no guarantees. DO NOT USE IT FOR ANYTHING IMPORTANT.

GraphChi-DB source code arrives!


Dotty open-sourced

Friday, February 28th, 2014

Dotty open-sourced by Martin Odersky.

From the post:

A couple of days ago we open sourced the Dotty, a research platform for new new language concepts and compiler technologies for Scala.

The idea is to provide a platform where new ideas can be tried out without the stringent backwards compatibility constraints of the regular Scala releases. At the same time this is no “castle-in-the-sky” project. We will look only at technologies that have a very good chance of being beneficial to Scala and its ecosystem.

My goal is that the focus of our research and development efforts lies squarely on simplification. In my opinion, Scala has been very successful in its original goal of unifying OOP and FP. But to get there is has acquired some features that in retrospect turned out to be inessential for the main goal, even if they are useful in some applications. XML literals come to mind, as do existential types. In Dotty we try to identify a much smaller set of core features and will then represent other features by encodings into that core.

Right now, there’s a (very early) compiler frontend for a subset of Scala. We’ll work on fleshing this out and testing it against more sources (the only large source it was tested on so far is the dotty compiler itself). We’ll also work on adding transformation and backend phases to make this into a full compiler.

Are you interested in functional programming and adventurous?

If so, this is your stop. 😉

Introduction to Computational Linguistics (Scala too!)

Saturday, February 1st, 2014

Introduction to Computational Linguistics by Jason Baldridge.

From the webpage:

Advances in computational linguistics have not only led to industrial applications of language technology; they can also provide useful tools for linguistic investigations of large online collections of text and speech, or for the validation of linguistic theories.

Introduction to Computational Linguistics introduces the most important data structures and algorithmic techniques underlying computational linguistics: regular expressions and finite-state methods, categorial grammars and parsing, feature structures and unification, meaning representations and compositional semantics. The linguistic levels covered are morphology, syntax, and semantics. While the focus is on the symbolic basis underlying computational linguistics, a high-level overview of statistical techniques in computational linguistics will also be given. We will apply the techniques in actual programming exercises, using the programming language Scala. Practical programming techniques, tips and tricks, including version control systems, will also be discussed.

Jason has created a page of links, which includes a twelve part tutorial on Scala:

If you want to walk through the course on your own, see the schedule.


12 Free eBooks on Scala

Saturday, January 25th, 2014

12 Free eBooks on Scala by Atithya Amaresh.

If you are missing any of these, now is the time to grab a copy:

  1. Functional Programming in Scala
  2. Play for Scala
  3. Scala Cookbook
  4. Lift Cookbook
  5. Scala in Action
  6. Testing in Scala
  7. Programming Scala by Venkat Subramaniam
  8. Programming Scala by Dean Wampler, Alex Payne
  9. Software Performance and Scalability
  10. Scalability Rules
  11. Lift in Action
  12. Scala in Depth


…Scala and Breeze for statistical computing

Friday, January 3rd, 2014

Brief introduction to Scala and Breeze for statistical computing by Darren Wilkinson.

From the post:

In the previous post I outlined why I think Scala is a good language for statistical computing and data science. In this post I want to give a quick taste of Scala and the Breeze numerical library to whet the appetite of the uninitiated. This post certainly won’t provide enough material to get started using Scala in anger – but I’ll try and provide a few pointers along the way. It also won’t be very interesting to anyone who knows Scala – I’m not introducing any of the very cool Scala stuff here – I think that some of the most powerful and interesting Scala language features can be a bit frightening for new users.

To reproduce the examples, you need to install Scala and Breeze. This isn’t very tricky, but I don’t want to get bogged down with a detailed walk-through here – I want to concentrate on the Scala language and Breeze library. You just need to install a recent version of Java, then Scala, and then Breeze. You might also want SBT and/or the ScalaIDE, though neither of these are necessary. Then you need to run the Scala REPL with the Breeze library in the classpath. There are several ways one can do this. The most obvious is to just run scala with the path to Breeze manually specified (or specified in an environment variable). Alternatively, you could run a console from an sbt session with a Breeze dependency (which is what I actually did for this post), or you could use a Scala Worksheet from inside a ScalaIDE project with a Breeze dependency.

It will help if you have an interest in or background with statistics as Darren introduces you to using Scala and the Breeze.

Breeze is described as:

Breeze is a library for numerical processing, machine learning, and natural language processing. Its primary focus is on being generic, clean, and powerful without sacrificing (much) efficiency. Breeze is the merger of the ScalaNLP and Scalala projects, because one of the original maintainers is unable to continue development.

so you are likely to encounter it in several different contexts.

I experience the move from “imperative” to “functional” programming being similar to moving from normalized to non-normalized data.

Normalized data, done by design prior to processing, makes some tasks easier for a CPU. Non-normalized data omits the normalization task (a burden on human operators) and puts that task on a CPU, if and when desired.

Decreasing the burden on people and increasing the burden on CPUs doesn’t trouble me.



Tuesday, December 31st, 2013

Augur: a Modeling Language for Data-Parallel Probabilistic Inference by Jean-Baptiste Tristan,


It is time-consuming and error-prone to implement inference procedures for each new probabilistic model. Probabilistic programming addresses this problem by allowing a user to specify the model and having a compiler automatically generate an inference procedure for it. For this approach to be practical, it is important to generate inference code that has reasonable performance. In this paper, we present a probabilistic programming language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. Our language is fully integrated within the Scala programming language and benefits from tools such as IDE support, type-checking, and code completion. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.

A very good paper but the authors should highlight the caveat in the introduction:

We claim that many MCMC inference algorithms are highly data-parallel (Hillis & Steele, 1986; Blelloch, 1996) if we take advantage of the conditional independence relationships of the input model (e.g. the assumption of i.i.d. data makes the likelihood independent across data points).

(Where i.i.d. = Independent and identically distributed random variables.)

That assumption does allow for parallel processing, but users should be cautious about accepting assumptions about data.

The algorithms will still work, even if your assumptions about the data are incorrect.

But the answer you get may not be as useful as you would like.

I first saw this in a tweet by Stefano Bertolo.

Scala as a platform…

Monday, December 30th, 2013

Scala as a platform for statistical computing and data science by Darren Wilkinson

From the post:

There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.

A feature wish list

It should:

  • be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
  • be free, open-source and platform independent
  • be fast and efficient
  • have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
  • have a strong type system, and be statically typed with good compile-time type checking and type safety
  • have reasonable type inference
  • have a REPL for interactive use
  • have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
  • have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
  • allow imperative programming for those (rare) occasions where it makes sense
  • be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:

  • have excellent data viz capability built-in
  • have vast numbers of statistical routines in the standard library

Darren reviews Scala on each of these points.

Although he still uses R and Python, Darren has hopes for future development of Scala into a full featured data mining platform.

Perhaps his checklist will contribute the requirements needed to make that one of the futures of Scala.

I first saw this in Christophe Lalanne’s A bag of tweets / December 2013.

Graph for Scala

Monday, December 23rd, 2013

Graph for Scala

From the webpage:

Welcome to scalax.collection.Graph

Graph for Scala provides basic graph functionality that seamlessly fits into the Scala standard collections library. Like members of scala.collection, graph instances are in-memory containers that expose a rich, user-friendly interface.

Graph for Scala also has ready-to-go implementations of JSON-Import/Export and Dot-Export. Database emulation and distributed graph processing are due to be supported.

Backed by the Scala core team, Graph for Scala started in 2011 as an open source project in the EPFL Scala incubator space on Assembla. Meanwhile it is also hosted on Github.

Want to take it for a spin? Grab the latest release to get started, then visit the Core User Guide (Warning: Broken Link)to learn more!

If you follow the “Core” option under “Users Guides” on the top menu bar, you will find: Core User Guide: Introduction, which reads in part:

Why Use Graph for Scala?

The most important reasons why Graph for Scala speeds up your development are:

  • Simplicity: Creating, manipulating and querying Graph is intuitive.
  • Consistency: Graph for Scala seamlessly maintains a consistent state of nodes and edges including prevention of duplicates, intelligent addition and removal.
  • Conformity: As a regular collection class, Graph has the same “look and feel” as other members of the Scala collection framework. Whenever appropriate, result types are Scala collection types themselves.
  • Flexibility: All kinds of graphs including mixed graphs, multi-graphs and hypergraphs are supported.
  • Functional Style: Graph for Scala facilitates a concise, functional style of utilizing graph functionality, including traversals, not seen in Java-based libraries.
  • Extendibility: You can easily customize Graph for Scala to reflect the needs of you application retaining all benefits of Graph.
  • Documentation: Ideal progress curve through adequate documentation.

Look and see!

You will find support for hyperedges, directed hyperedges, edges and directed edges.

Further documentation covers exporting to Dot, moving data into and out of JSON, and constraining graphs.

Of Algebirds, Monoids, Monads, …

Tuesday, December 3rd, 2013

Of Algebirds, Monoids, Monads, and Other Bestiary for Large-Scale Data Analytics by Michael G. Noll.

From the post:

Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you get interested in those in the first place.

Goal of this article

The main goal of this is article is to spark your curiosity and motivation for Algebird and the concepts of monoid, monads, and category theory in general. In other words, I want to address the questions “What’s the big deal? Why should I care? And how can these theoretical concepts help me in my daily work?”

You can call this a “blog post” but I rarely see blog posts with a table of contents! 😉

The post should come with a warning: May require substantial time to read, digest, understand.

Just so you know, I was hooked by this paragraph early on:

So let me use a different example because adding Int values is indeed trivial. Imagine that you are working on large-scale data analytics that make heavy use of Bloom filters. Your applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom filters in parallel. Now the money question is: How do you combine or add two Bloom filters in an easy way?

Are you motivated?

I first saw this in a tweet by CompSciFact.

Akka at Conspire

Monday, November 4th, 2013

Akka at Conspire

From the post:

Ryan Tanner has posted a really good series of blogs on how and why they are using Akka, and especially how to design your application to make good use of clustering and routers. Akka provides solid tools but you still need to think where to point that shiny hammer, and Ryan has a solid story to tell:

  1. How We Built Our Backend on Akka and Scala
  2. Why We Like Actors
  3. Making Your Akka Life Easier
  4. Don’t Fall Into Our Anti-Pattern Traps
  5. The Importance of Pulling

PS: And no, we don’t mind anyone using our code, not even if it was contributed by Derek Wyatt (honorary team member) 🙂

Unlike the Peyton Place IT tragedies in Washington, this is a software tale that ends well.


Principles of Reactive Programming [4th of November]

Saturday, November 2nd, 2013

Principles of Reactive Programming [4th of November]

Just in case U.S. government intercepts either prevented you from getting the news or erased data from your calendar, just a reminder that Principles of Reactive Programming starts next Monday and runs for seven (7) weeks.

Even though I am signed up for another course, I am tempted to add this one. Unlikely as two courses is a bit much at one time.

But will be watching the lectures later to prepare for the next time.

Series: The Neophyte’s Guide to Scala

Monday, October 28th, 2013

Series: The Neophyte’s Guide to Scala

From the post:

Daniel Westheide (@kaffeecoder on twitter) has written a wonderful series of blog posts about Scala, including Akka towards the end. The individual articles are:

This series was published between Nov 21, 2012 and Apr 3, 2013 and Daniel has aggregated all content including an EPUB download here. Big kudos, way to go!


Entity Discovery using Mahout CollocDriver

Saturday, October 26th, 2013

Entity Discovery using Mahout CollocDriver by Sujit Pal.

From the post:

I spent most of last week trying out various approaches to extract “interesting” phrases from a collection of articles. The objective was to identify candidate concepts that could be added to our taxonomy. There are various approaches, ranging from simple NGram frequencies, to algorithms such as RAKE (Rapid Automatic Keyword Extraction), to rescoring NGrams using Log Likelihood or Chi-squared measures. In this post, I describe how I used Mahout’s CollocDriver (which uses the Log Likelihood measure) to find interesting phrases from a small corpus of about 200 articles.

The articles were in various formats (PDF, DOC, HTML), and I used Apache Tika to parse them into text (yes, I finally found the opportunity to learn Tika :-)). Tika provides parsers for many common formats, so all we have to do was to hook them up to produce text from the various file formats. Here is my code:

Think of this as winnowing the chaff that your human experts would otherwise read.

A possible next step would be to decorate the candidate “interesting” phrases with additional information before being viewed by your expert(s).

JavaZone 2013

Sunday, September 29th, 2013

JavaZone 2013 (videos)

The JavaZone tweet I saw earlier today said five (5) lost videos had been found so all one hundred and forty-nine (149) videos are up for viewing!

I should have saved this one for the holidays but at one or two a day, you may be done by the holidays! 😉

Hands-On Category Theory

Sunday, September 29th, 2013

Hands-On Category Theory by James Earl Douglas.

From the webpage:

James explores category theory concepts in Scala REPL. This is the first version of the video, we have a separate screen capture and will publish a merged version later, with a reference here.

I don’t often see “hands-on” and “category theory” in the same sentence, much less in a presentation title. 😉

An interesting illustration of category theory being used in day to day programming.

See: Hands-On Category Theory at Github for notes on the presentation.

Perhaps this will pique your interest in category theory!

Principles of Reactive Programming [Nov. 2013]

Monday, September 16th, 2013

Principles of Reactive Programming by Martin Odersky, Erik Meijer and Roland Kuhn.

From the webpage:

This is a follow-on for the Coursera class “Principles of Functional Programming in Scala”, which so far had more than 100’000 inscriptions over two iterations of the course, with some of the highest completion rates of any massive open online course worldwide.

The aim of the second course is to teach the principles of reactive programming. Reactive programming is an emerging discipline which combines concurrency and event-based and asynchronous systems. It is essential for writing any kind of web-service or distributed system and is also at the core of many high-performance concurrent systems. Reactive programming can be seen as a natural extension of higher-order functional programming to concurrent systems that deal with distributed state by coordinating and orchestrating asynchronous data streams exchanged by actors.

In this course you will discover key elements for writing reactive programs in a composable way. You will find out how to apply these building blocks in the construction of event-driven systems that are scalable and resilient.

The course is hands on; most units introduce short programs that serve as illustrations of important concepts and invite you to play with them, modifying and improving them. The course is complemented by a series of assignments, which are also programming projects.

Starts November 4, 2013 and last for seven weeks.

See the webpage for the syllabus, requirements, etc.

Twitter4j and Scala

Monday, July 29th, 2013

Using twitter4j with Scala to access streaming tweets by Jason Baldridge.

From the introduction:

My previous post provided a walk-through for using the Twitter streaming API from the command line, but tweets can be more flexibly obtained and processed using an API for accessing Twitter using your programming language of choice. In this tutorial, I walk-through basic setup and some simple uses of the twitter4j library with Scala. Much of what I show here should be useful for those using other JVM languages like Clojure and Java. If you haven’t gone through the previous tutorial, have a look now before going on as this tutorial covers much of the same material but using twitter4j rather than HTTP requests.

I’ll introduce code, bit by bit, for accessing the Twitter data in different ways. If you get lost with what should go where, all of the code necessary to run the commands is available in this github gist, so you can compare to that as you move through the tutorial.

Update: The tutorial is set up to take you from nothing to being able to obtain tweets in various ways, but you can also get all the relevant code by looking at the twitter4j-tutorial repository. For this tutorial, the tag is v0.1.0, and you can also download a tarball of that version.

Using Twitter4j with Scala to perform user actions by Jason Baldridge.

From the introduction:

My previous post showed how to use Twitter4j in Scala to access Twitter streams. This post shows how to control a Twitter user’s actions using Twitter4j. The primary purpose of this functionality is perhaps to create interfaces for Twitter like TweetDeck, but it can also be used to create bots that take automated actions on Twitter (one bot I’m playing around with is @tshrdlu, using the code in this tutorial and the code in the tshrdlu repository).

This post will only cover a small portion of the things you can do, but they are some of the more common things and I include a couple of simple but interesting use cases. Once you have these things in place, it is straightforward to figure out how to use the Twitter4j API docs (and Stack Overflow) to do the rest.

Jason continues his tutorial on accessing/processing Twitter streams using Twitter4j and Scala.

Since Twitter has enough status for royal baby names, your data should feel no shame being on Twitter. 😉

Not to mention tweeted IRIs can inform readers of content in excess of one hundred and forty (140) characters in length.

ScalaDays 2013 Presentations

Tuesday, June 25th, 2013

ScalaDays 2013 Presentations

Great collection of videos from ScalaDays 2013.

I haven’t had the time to create a better listing but wanted to pass the videos along for your enjoyment.

Building Distributed Systems With The Right Tools:…

Monday, June 24th, 2013

Building Distributed Systems With The Right Tools: Akka Looks Promising

From the post:

Modern day developers are building complex applications that span multiple machines. As a result, availability, scalability, and fault tolerance are important considerations that must be addressed if we are to successfully meet the needs of the business.

As developers building distributed systems, then, being aware of concepts and tools that help in dealing with these considerations is not just important – but allows us to make a significant difference to the success of the projects we work on.

One emerging tool is Akka and it’s clustering facilities. Shortly I’ll show a few concepts to get your mind thinking about where you could apply tools like Akka, but I’ll also show a few code samples to emphasise that these benefits are very accessible.

Code sample for this post is on github.

Why Should I Care About Akka?

Let’s start with a problem… We’re building a holidays aggregration and disitribution platform. What our system does is fetch data from 200 different package providers, and distribute it to over 50 clients via ftp. This is a continuous process.

Competition in this market is fierce and clients want holidays and upto date availability in their systems as fast as possible – there’s a lot of money to be made on last-minute deals, and a lot of money to be lost in selling holiday’s that have already been sold elsewhere.

One key feature then is that the system needs to always be running – it needs high availability. Another important feature is performance – if this is to be maintained as the system grows with new providers and clients it needs to be scalable.

Just think to yourself now, how would you achieve this with the technologies you currently work with? I can’t think of too many things in the .NET world that would guide me towards highly-available, scalable applications, out of the box. There would be a lot of home-rolled infrastructure, and careful designing for scalability I suspect.

Akka Wants to Help You Solve These Problems ‘Easily’

Using Akka you don’t call methods – you send messages. This is because the programming model makes the assumption that you are building distributed, asynchronous applications. It’s just a bonus if a message gets sent and handled on the same machine.

This arises from the fact that the framework is engineered, fundamentally, to guide you into creating highly-available, scalable, fault-tolerant distributed applications…. There is no home-rolled infrastructure (you can add small bits and pieces if you need to).

Instead, with Akka you mostly focus on business logic as message flows. Check out the docs or pick up a book if you want to learn about the enabling concepts like supervision.

If you are contemplating a distributed topic map application, Akka should be of interest.

Work flow could result in different locations reflecting different topic map content.