Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 24, 2013

Building Distributed Systems With The Right Tools:…

Filed under: Akka,Scala,Topic Map Software — Patrick Durusau @ 2:57 pm

Building Distributed Systems With The Right Tools: Akka Looks Promising

From the post:

Modern day developers are building complex applications that span multiple machines. As a result, availability, scalability, and fault tolerance are important considerations that must be addressed if we are to successfully meet the needs of the business.

As developers building distributed systems, then, being aware of concepts and tools that help in dealing with these considerations is not just important – but allows us to make a significant difference to the success of the projects we work on.

One emerging tool is Akka and it’s clustering facilities. Shortly I’ll show a few concepts to get your mind thinking about where you could apply tools like Akka, but I’ll also show a few code samples to emphasise that these benefits are very accessible.

Code sample for this post is on github.

Why Should I Care About Akka?

Let’s start with a problem… We’re building a holidays aggregration and disitribution platform. What our system does is fetch data from 200 different package providers, and distribute it to over 50 clients via ftp. This is a continuous process.

Competition in this market is fierce and clients want holidays and upto date availability in their systems as fast as possible – there’s a lot of money to be made on last-minute deals, and a lot of money to be lost in selling holiday’s that have already been sold elsewhere.

One key feature then is that the system needs to always be running – it needs high availability. Another important feature is performance – if this is to be maintained as the system grows with new providers and clients it needs to be scalable.

Just think to yourself now, how would you achieve this with the technologies you currently work with? I can’t think of too many things in the .NET world that would guide me towards highly-available, scalable applications, out of the box. There would be a lot of home-rolled infrastructure, and careful designing for scalability I suspect.

Akka Wants to Help You Solve These Problems ‘Easily’

Using Akka you don’t call methods – you send messages. This is because the programming model makes the assumption that you are building distributed, asynchronous applications. It’s just a bonus if a message gets sent and handled on the same machine.

This arises from the fact that the framework is engineered, fundamentally, to guide you into creating highly-available, scalable, fault-tolerant distributed applications…. There is no home-rolled infrastructure (you can add small bits and pieces if you need to).

Instead, with Akka you mostly focus on business logic as message flows. Check out the docs or pick up a book if you want to learn about the enabling concepts like supervision.

If you are contemplating a distributed topic map application, Akka should be of interest.

Work flow could result in different locations reflecting different topic map content.

June 5, 2013

Entity recognition with Scala and…

Filed under: Entity Resolution,Natural Language Processing,Scala,Stanford NLP — Patrick Durusau @ 4:05 pm

Entity recognition with Scala and Stanford NLP Named Entity Recognizer by Gary Sieling.

From the post:

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.


The voting on entity recognition made me curious about interactive entity resolution where a user has a voice.

See the next post.

May 31, 2013

Scala 2013 Overview

Filed under: Functional Programming,Programming,Scala — Patrick Durusau @ 10:09 am

Scala 2013 Overview by Sagie Davidovich.

An impressive set of slides on Scala.

Work through all of them and you won’t be a Scala expert but well on your way.

I first saw this at Nice Scala Tutorial by Danny Bickson.

May 19, 2013

Scala Resources & Community links for the newcomer

Filed under: Functional Programming,Scala — Patrick Durusau @ 3:14 pm

Scala Resources & Community links for the newcomer by Raúl Raja.

From the post:

During the last couple of months I have been asked a few times among colleagues and friends hot to get started with Scala. People come to Scala from diverse backgrounds such as… – Java folks looking for a better Java or just tired of waiting for Java features other modern languages such as C# already offer. – Ruby, PHP, and programmers that come from a scripting background looking for type safety. – People trying to bridge the best of both OOP and Functional paradigms. Scala is a vast language full of features with a very technical community. Don’t let your first step discourage you as you don’t need to know everything about Scala to become productive quickly. People in the mailing list will often talk about some crazy shit you don’t need to know just yet. Monads, Monoids, Combinators, Macros and things you may not even know how to pronounce,… Seriously guys as you start to learn about it it’s gonna blow your mind. It’s gonna take some time to digest all the info but it sure it’s worth it. Here is a few resources / steps may help you get started focused on its community and not so much on the technical details of downloading and running your first scala “hello world”

More than a collection to bookmark for “someday,” this is a collection of resources to start following today.

I haven’t looked at all the references but from the ones I checked, I don’t think you will be disappointed.

May 1, 2013

5 Best Free Scala Books

Filed under: Programming,Scala — Patrick Durusau @ 2:58 pm

5 Best Free Scala Books

From the post:

Scala is a modern, object-functional, multi-paradigm, Java-based programming and scripting language that is released under the BSD license. It blends the functional and object-oriented programming models. Scala introduces several innovative language constructs. It improves on Java’s support for object-oriented programming by traits, which are stackable and cannot have constructor parameters. It also offers closures, a feature that dynamic languages like Python and Ruby have adopted.

Scala is particularly useful for building cloud-based/deliverable Software as a Service (SaaS) online applications, and is also proficient to develop traditional, imperative code.

Scala helps programmers write tighter code. It uses a number of techniques to cut down on unnecessary syntax, which helps to make code succint. Typically, code sizes are reduced by an order of 2 or 3 compared to an equivalent Java application.

The focus of this article is to select the finest Scala books which are available to download for free. So get reading and learning.


Programming in Scala by Martin Odersky, Lex Spoon, and Bill Venners.

Programming Scala by Dean Wampler, Alex Payne.

Exploring Lift by Derek Chen-Becker, Marius Danciu and Tyler Weir.

Scala by Example by Martin Odersky.

A Scala Tutorial for Java Programmers by Michel Schinz and Philipp Haller.

I first saw this at: DZone.

April 29, 2013

scalingpipe – …

Filed under: LingPipe,Linguistics,Scala — Patrick Durusau @ 2:07 pm

scalingpipe – porting LingPipe tutorial examples to Scala by Sujit Pal.

From the post:

Recently, I was tasked with evaluating LingPipe for use in our NLP processing pipeline. I have looked at LingPipe before, but have generally kept away from it because of its licensing – while it is quite friendly to individual developers such as myself (as long as I share the results of my work, I can use LingPipe without any royalties), a lot of the stuff I do is motivated by problems at work, and LingPipe based solutions are only practical when the company is open to the licensing costs involved.

So anyway, in an attempt to kill two birds with one stone, I decided to work with the LingPipe tutorial, but with Scala. I figured that would allow me to pick up the LingPipe API as well as give me some additional experience in Scala coding. I looked around to see if anybody had done something similar and I came upon the scalingpipe project on GitHub where Alexy Khrabov had started with porting the Interesting Phrases tutorial example.

Now there’s a clever idea!

Achieves a deep understanding of the LingPipe API and Scala experience.

Not to mention having useful results for other users.

April 14, 2013

A walk-through for the Twitter streaming API

Filed under: Scala,Tweets — Patrick Durusau @ 2:42 pm

A walk-through for the Twitter streaming API by Jason Baldridge.

From the post:

Analyzing tweets is all the rage, and if you are new to the game you want to know how to get them programmatically. There are many ways to do this, but a great start is to use the Twitter streaming API, a RESTful service that allows you to pull tweets in real time based on criteria you specify. For most people, this will mean having access to the spritzer, which provides only a very small percentage of all the tweets going through Twitter at any given moment. For access to more, you need to have a special relationship with Twitter or pay Twitter or an affiliate like Gnip.

This post provides a basic walk-through for using the Twitter streaming API. You can get all of this based on the documentation provided by Twitter, but this will be slightly easier going for those new to such services. (This post is mainly geared for the first phase of the course project for students in my Applied Natural Language Processing class this semester.)

You need to have a Twitter account to do this walk-through, so obtain one now if you don’t have one already.

Basics of obtaining tweets from the Twitter stream.

I mention it as an active data source that may find its way into your topic map.

April 5, 2013


Filed under: Data Structures,Scala — Patrick Durusau @ 2:56 pm


From the webpage:

Saddle is a data manipulation library for Scala that provides array-backed, indexed, one- and two-dimensional data structures that are judiciously specialized on JVM primitives to avoid the overhead of boxing and unboxing.

Saddle offers vectorized numerical calculations, automatic alignment of data along indices, robustness to missing (N/A) values, and facilities for I/O.

Saddle draws inspiration from several sources, among them the R programming language & statistical environment, the numpy and pandas Python libraries, and the Scala collections library.

I have heard some one and two dimensional data structures can be quite useful. 😉

Something to play with over the weekend.


March 20, 2013

“Functional Programming for…Big Data”

Filed under: BigData,Cascading,Cascalog,Clojure,Functional Programming,Scala,Scalding — Patrick Durusau @ 3:27 pm

“Functional Programming for optimization problems in Big Data” by Paco Nathan.

Interesting slide deck, even if it doesn’t start with high drama. 😉


  1. Data Science
  2. Functional Programming
  3. Workflow Abstraction
  4. Typical Use Cases
  5. Open Data Example

The reading list mentioned in these slides makes a nice self-review course in data science.

The Open Data Example is for Palo Alto but you can substitute a city with open data closer to home.

Applied Natural Language Processing

Filed under: Natural Language Processing,Scala — Patrick Durusau @ 5:53 am

Applied Natural Language Processing by Jason Baldridge.


This class will provide instruction on applying algorithms in natural language processing and machine learning for experimentation and for real world tasks, including clustering, classification, part-of-speech tagging, named entity recognition, topic modeling, and more. The approach will be practical and hands-on: for example, students will program common classifiers from the ground up, use existing toolkits such as OpenNLP, Chalk, StanfordNLP, Mallet, and Breeze, construct NLP pipelines with UIMA, and get some initial experience with distributed computation with Hadoop and Spark. Guidance will also be given on software engineering, including build tools, git, and testing. It is assumed that students are already familiar with machine learning and/or computational linguistics and that they already are competent programmers. The programming language used in the course will be Scala; no explicit instruction will be given in Scala programming, but resources and assistance will be provided for those new to the language.

From the syllabus:

The foremost goal of this course is to provide practical exposure to the core techniques and applications of natural language processing. By the end, students will understand the motivations for and capabilities of several core natural language processing and machine learning algorithms and techniques used in text analysis, including:

  • regular expressions
  • vector space models
  • clustering
  • classification
  • deduplication
  • n-gram language models
  • topic models
  • part-of-speech tagging
  • named entity recognition
  • PageRank
  • label propagation
  • dependency parsing

We will show, on a few chosen topics, how natural language processing builds on and uses the fundamental data structures and algorithms presented in this course. In particular, we will discuss:

  • authorship attribution
  • language identification
  • spam detection
  • sentiment analysis
  • influence
  • information extraction
  • geolocation

Students will learn to write non-trivial programs for natural language processing that take advantage of existing open source toolkits. The course will involve significant guidance and instruction in to software engineering practices and principles, including:

  • functional programming
  • distributed version control systems (git)
  • build systems
  • unit testing
  • distributed computing (Hadoop)

The course will help prepare students both for jobs in the industry and for doing original research that involves natural language processing.

A great start to one aspect of being a “data scientist.”

I encountered this course via the Nak (Scala library for NLP) project. Version 1.1.1 was just released and I saw a tweet from Jason Baldridge on the same.

The course materials have exercises and a rich set of links to other resources.

You may also enjoy:

Jason Baldridge’s homepage.

Bcomposes (Jason’s blog).

March 10, 2013

Programming Isn’t Math

Filed under: Algebird,Scala,Scalding,Tweets — Patrick Durusau @ 8:41 pm

Programming Isn’t Math by Oscar Boykin.

From the description:

Functional programming has a rich history of drawing from mathematical theory, yet in this highly entertaining talk from the Northeast Scala Symposium, Twitter data scientist Oscar Boykin make the case that programming is distinct from mathematics. This distinction is healthy and does not mean we can’t leverage many results and concepts from mathematics.

As examples, Oscar will discuss some recent work — algebird, bijection, scalding — and show cases where mathematical purity were both helpful and harmful to developing products at Twitter.

The phrase “…highly entertaining…” may be an understatement.

The type of presentation where you want to starting reading new material during the presentation but you are afraid of missing the next gold nugget!

Definitely one to start the week on!

February 14, 2013

Neo4j and Gatling Sitting in a Tree, Performance T-E-S-T-ING

Filed under: Gatling,Neo4j,Scala — Patrick Durusau @ 7:16 pm

Neo4j and Gatling Sitting in a Tree, Performance T-E-S-T-ING by Max De Marzi.

From the post:

I was introduced to the open-source performance testing tool Gatling a few months ago by Dustin Barnes and fell in love with it. It has an easy to use DSL, and even though I don’t know a lick of Scala, I was able to figure out how to use it. It creates pretty awesome graphics and takes care of a lot of work for you behind the scenes. They have great documentation and a pretty active google group where newbies and questions are welcomed.

It requires you to have Scala installed, but once you do all you need to do is create your tests and use a command line to execute it. I’ll show you how to do a few basic things, like test that you have everything working, then we’ll create nodes and relationships, and then query those nodes.

You did run performance tests on your semantic application. Yes?

February 2, 2013

Semantic Search for Scala – Post 1

Filed under: Programming,Scala,Semantics — Patrick Durusau @ 3:08 pm

Semantic Search for Scala – Post 1 by Mads Hartmann Jensen.

From the post:

The goal of the project is to create a semantic search engine for Scala, in the form of a library, and integrate it with the Scala IDE plugin for Eclipse. Part of the solution will be to index all aspects of a Scala code, that is:

  • Definitions of the usual Scala elements: classes, traits, objects, methods, fields, etc.
  • References to the above elements. Some more challenging case to consider are self-types, type-aliases, code injected by the compiler, and implicits.

With this information the library should be able to

  • Find all occurrences of any type of Scala element
  • Create a call-hierarchy, this is list all in- and outgoing method invocations, for any Scala method.
  • Create a type-hierarchy, i.e. list all super- and subclasses, of a specific type (I won’t necessarily find time to implement this during my thesis but nothing is stopping me from working on the project even after I hand in the report)

Mads is working on his master’s thesis and Typesafe has agreed to collaborate with him.

For a longer description of the project (or to comment), see: Features and Trees

If you have suggestions on semantic search for programming languages, please contact Mads on Twitter, Twitter @Mads_Hartmann.

January 26, 2013

The Neophyte’s Guide to Scala Part [n]…

Filed under: Functional Programming,Scala — Patrick Durusau @ 1:41 pm

Daniel Westheide has a series of posts introducing Scala to Neophytes.

As of today:

The Neophyte’s Guide to Scala Part 1: Extractors

The Neophyte’s Guide to Scala Part 2: Extracting Sequences

The Neophyte’s Guide to Scala Part 3: Patterns Everywhere

The Neophyte’s Guide to Scala Part 4: Pattern Matching Anonymous Functions

The Neophyte’s Guide to Scala Part 5: The Option type

The Neophyte’s Guide to Scala Part 6: Error handling with Try

The Neophyte’s Guide to Scala Part 7: The Either type

The Neophyte’s Guide to Scala Part 8: Welcome to the Future

The Neophyte’s Guide to Scala Part 9: Promises and Futures in practice

The Neophyte’s Guide to Scala Part 10: Staying DRY with higher-order functions

Apologies for not seeing this sooner.

Makes a nice starting place for the 25th March 2013 Functional Programming Principles in Scala class by Martin Odersky.

I first saw this at Chris Cundill’s This week in #Scala (26/01/2013).

Functional Programming Principles in Scala

Filed under: Functional Programming,Programming,Scala — Patrick Durusau @ 1:41 pm

Functional Programming Principles in Scala by Martin Odersky.

March 25th 2013 (7 weeks long)

From the webpage:

This course introduces the cornerstones of functional programming using the Scala programming language. Functional programming has become more and more popular in recent years because it promotes code that’s safe, concise, and elegant. Furthermore, functional programming makes it easier to write parallel code for today’s and tomorrow’s multiprocessors by replacing mutable variables and loops with powerful ways to define and compose functions.

Scala is a language that fuses functional and object-oriented programming in a practical package. It interoperates seamlessly with Java and its tools. Scala is now used in a rapidly increasing number of open source projects and companies. It provides the core infrastructure for sites such as Twitter, LinkedIn, Foursquare, Tumblr, and Klout.

In this course you will discover the elements of the functional programming style and learn how to apply them usefully in your daily programming tasks. You will also develop a solid foundation for reasoning about functional programs, by touching upon proofs of invariants and the tracing of execution symbolically.

The course is hands on; most units introduce short programs that serve as illustrations of important concepts and invite you to play with them, modifying and improving them. The course is complemented by a series of assignments, most of which are also programming projects.

In case you missed it last time.

I first saw this at Chris Cundill’s This week in #Scala (26/01/2013).

January 17, 2013

…Functional Programming and Scala

Filed under: Functional Programming,Scala — Patrick Durusau @ 7:25 pm

Resources for Getting Started With Functional Programming and Scala by Kelsey Innis.

From the post:

This is the “secret slide” from my recent talk Learning Functional Programming without Growing a Neckbeard, with links to the sources I used to put the talk together and some suggestions for ways to get started writing Scala code.

The “…without growing a neckbeard” merits mention even if you are not interested in functional programming and topic maps.

Nice list of resources.

Don’t miss the presentation!

I first saw this at This week in #Scala (11/01/2013) by Chris Cundill.

January 7, 2013

Scala Cheatsheet

Filed under: Programming,Scala — Patrick Durusau @ 9:23 am

Scala Cheatsheet by Brendan O’Connor.

Quick reference to Scala syntax.

Also includes examples of bad practice, labeled as such.

I first saw this at This week in #Scala (04/01/2013) by Chris Cundill.

November 28, 2012

Computational Finance with Map-Reduce in Scala [Since Quants Have Funding]

Filed under: Finance Services,MapReduce,Scala — Patrick Durusau @ 5:48 am

Computational Finance with Map-Reduce in Scala by Ron Coleman, Udaya Ghattamaneni, Mark Logan, and Alan Labouseur. (PDF)

Assuming the computations performed by quants are semantically homogeneous (a big assumption), the sources of their data and application of the outcomes, are not.

The clients of quants aren’t interested in you humming “…its a big world after all…,” etc. They are interested in furtherance of their financial operations.

Using topic maps to make an already effective tool more effective, is the most likely way to capture their interest. (Short of taking hostages.)

I first saw this in a tweet by Data Science London.

November 25, 2012

Graham’s Guide to Learning Scala

Filed under: Programming,Scala — Patrick Durusau @ 11:21 am

Graham’s Guide to Learning Scala by Graham Lee.

From the post:

It’s a pretty widely-accepted view that, as a programmer, learning new languages is a Good Idea™ . Most people with more than one language under their belt would say that learning new languages broadens your mind in ways that will positively affect the way you work, even if you never use that language again.

With the Christmas holidays coming up and many people likely to take some time off work, this end of the year presents a great opportunity to take some time out from your week-to-week programming grind and do some learning.

With that in mind, I present “Graham’s Guide to Learning Scala”. There are many, many resources on the web for learning about Scala. In fact, I think there’s probably too many! It would be quite easy to start in the wrong place and quickly get discouraged.

So this is not yet another resource to add to the pile. Rather, this is a guided course through what I believe are some of the best resources for learning Scala, and in an order that I think will help a complete newbie pick it up quickly but without feeling overwhelmed.

And, best of all, it has 9 Steps!

As Graham says, the holidays are coming up.

One way to avoid nosey family members, ravenous cousins and in-laws, almost off-key (you would have to know the key to be off-key) singing, is to spend some quality time with your laptop.

Graham offers a good selection of resources to fill a week, either now or at some other down time of the year.

November 4, 2012

Intro to Scalding by @posco and @argyris [video lecture]

Filed under: BigData,Scala,Scalding,Tweets — Patrick Durusau @ 3:40 pm

Intro to Scalding by @posco and @argyris by Marti Hearst.

From the post:

On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin presented a lecture that he and Argyris Zymnis put together for us:

(video – see Marti’s post)

Because scalding is built on the functional programming language Scala, it has advantage oover Pig in that you can have the equivalent of user-defined functions directly in your code. See for the lecture notes more details. Be sure watch the video to get all the details especially since Oscar managed to make us all laugh throughout his lecture. Thanks guys!

Another great lecture from Marti’s class, “Analyzing Big Data with Twitter.”

When the revenue department of a business, at least a successful business, starts using a technology, it’s time to take notice.

August 28, 2012

Atomic Scala

Filed under: Scala — Patrick Durusau @ 4:00 pm

Atomic Scala by Bruce Eckel and Dianne Marsh.

From the webpage:

Atomic Scala is meant to be your first Scala book, not your last. We show you enough to become familiar and comfortable with the language — competent, but not expert. You’ll be able to write useful Scala code, but you won’t necessarily be able to read all the Scala code you encounter.

When you’re done, you’ll be ready for more complex Scala books, several of which we recommend at the end of this book.

The first 25% of the book is available for download.

Take a peek at the “about” page if the author names sound familiar. 😉

I first saw this at Christopher Lalanne’s A bag of tweets / August 2012.

August 20, 2012


Filed under: Scala,ScalaNLP — Patrick Durusau @ 7:21 am


From the homepage:

ScalaNLP is a suite of machine learning and numerical computing libraries.

ScalaNLP is the umbrella project for Breeze and Epic. Breeze is a set of libraries for machine learning and numerical computing. Epic (coming soon) is a high-performance statistical parser.

From the about page:

Breeze is a suite of Scala libraries for numerical processing, machine learning, and natural language processing. Its primary focus is on being generic, clean, and powerful without sacrificing (much) efficiency.

The library currently consists of several parts:

  • breeze-math: Linear algebra and numerics routines
  • breeze-process: Libraries for processing text and managing data pipelines.
  • breeze-learn: Machine Learning, Statistics, and Optimization.

Possible future releases:

  • breeze-viz: Vizualization and plotting
  • breeze-fst: Finite state toolkit

Breeze is the merger of the ScalaNLP and Scalala projects, because one of the original maintainers is unable to continue development. The Scalala parts are largely rewritten.

Epic is a high-performance statistical parser written in Scala. It uses Expectation Propagation to build complex models without suffering the exponential runtimes one would get in a naive model. Epic is nearly state-of-the-art on the standard benchmark dataset in Natural Language Processing. We will be releasing Epic soon.

In case you are interested in project history, Scalala source.

A fairly new community so drop by and say hello.

August 12, 2012

Scalding for the Impatient

Filed under: Cascading,Scala,Scalding,TF-IDF — Patrick Durusau @ 5:39 pm

Scalding for the Impatient by Sujit Pal.

From the post:

Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.

Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.

Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.

I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I’ve provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.

A very nice side by side comparison and likely to make you interested in Scalding.

August 4, 2012


Filed under: MapReduce,Scala,Scalding — Patrick Durusau @ 3:07 pm

Scalding: Powerful & Concise MapReduce Programming


Scala is a functional programming language on the JVM. Hadoop uses a functional programming model to represent large-scale distributed computation. Scala is thus a very natural match for Hadoop.

In this presentation to the San Francisco Scala User Group, Dr. Oscar Boykin and Dr. Argyris Zymnis from Twitter give us some insight on Scalding DSL and provide some example jobs for common use cases.

Twitter uses Scalding for data analysis and machine learning, particularly in cases where we need more than sql-like queries on the logs, for instance fitting models and matrix processing. It scales beautifully from simple, grep-like jobs all the way up to jobs with hundreds of map-reduce pairs.

The Alice example failed (counted the different forms of Alice differently). I am reading a regex book so that may have made the problem more obvious.

Lesson: Test code/examples before presentation. 😉

See the Github repository:

Both Scalding and the presentation are worth your time.

July 11, 2012


Filed under: Games,Programming,Scala — Patrick Durusau @ 2:25 pm

Scalatron: Learn Scala with a programming game

From the homepage:

Scalatron is a free, open-source programming game in which bots, written in Scala, compete in a virtual arena for energy and survival. You can play by yourself against the computer or organize a tournament with friends. Scalatron may be the quickest and most entertaining way to become productive in Scala. – For updates, follow @scalatron on Twitter.

Entertaining and works right out of the “box.”

Well, remember the HBase 8080 conflict issue, so from the Scalatron documentation:

java -jar Scalatron.jar -help

Displays far more command line options than will be meaningful at first.

For the HBase 8080 issue, you need:

java -jar Scalatron.jar port int

or in my case:

java -jar Scalatron.jar port 9000

Caution, on startup it will ask to make Google Chrome your default browser. Good that it asks but annoying. Why not leave the user with whatever default browser they already prefer?

Anyway, starts up, asks you to create a user account (browser window) and can set the Administrator password.

Scalatron window opens up and I can tell this could be real addictive, in or out of ISO WG meetings. 😉

Scala resources mentioned in the Scalatron Tutorial document:

Other Resources

It’s a bit close to the metal to use as a model for a topic map “game.”

But I like the idea of “bots” (read teams) competing against each other, except for the construction of a topic map.

Just sketching some rough ideas but assuming some asynchronous means of communication, say tweets, emails, IRC chat, a simple syntax (CTM anyone?), basic automated functions and scoring, that should be doable, even if not on a “web” scale. 😉

By “basic automated functions” I mean more than simply parsing syntax for addition to a topic map but including the submission of DOIs, for example, which are specified to be resolved against a vendor or library catalog, with the automatic production of additional topics, associations, etc. Repetitive entry of information by graduate students only proves they are skillful copyists.

Assuming some teams will discover the same information as others, some timing mechanism and awarding of “credit” for topics/associations/occurrences added to the map would be needed.

Not to mention the usual stuff of contests, leader board, regular updating of the map, along with graph display, etc.

Something to think about. As I tell my daughter, life is too important to be taken seriously. Perhaps the same is true about topic maps.

Forwarded by Jack Park. (Who is not responsible for my musings on the same.)

June 2, 2012

High-Performance Domain-Specific Languages using Delite

Filed under: Delite,DSL,Machine Learning,Parallel Programming,Scala — Patrick Durusau @ 12:50 pm

High-Performance Domain-Specific Languages using Delite


This tutorial is an introduction to developing domain specific languages (DSLs) for productivity and performance using Delite. Delite is a Scala infrastructure that simplifies the process of implementing DSLs for parallel computation. The goal of this tutorial is to equip attendees with the knowledge and tools to develop DSLs that can dramatically improve the experience of using high performance computation in important scientific and engineering domains. In the first half of the day we will focus on example DSLs that provide both high-productivity and performance. In the second half of the day we will focus on understanding the infrastructure for implementing DSLs in Scala and developing techniques for defining good DSLs.

The graph manipulation language Green-Marl is one of the subjects of this tutorial.

This resource should be located and “boosted” by a search engine tuned to my preferences.

Skipping breaks, etc., you will find:

  • Introduction To High Performance DSLs (Kunle Olukotun)
  • OptiML: A DSL for Machine Learning (Arvind Sujeeth)
  • Liszt: A DSL for solving mesh-based PDEs (Zach Devito)
  • Green-Marl: A DSL for efficient Graph Analysis (Sungpack Hong)
  • Scala Tutorial (Hassan Chafi)
  • Delite DSL Infrastructure Overview (Kevin Brown)
  • High Performance DSL Implementation Using Delite (Arvind Sujeeth)
  • Future Directions in DSL Research (Hassan Chafi)

Compare your desktop computer to the MANIAC 1 (calculations for the first hydrogen bomb).

What have you invented/discovered lately?

April 14, 2012

Procedural Reflection in Programming Languages Volume 1

Filed under: Lisp,Reflection,Scala — Patrick Durusau @ 6:28 pm

Procedural Reflection in Programming Languages Volume 1

Brian Cantwell Smith’s dissertation that is the base document for reflection in programming languages.


We show how a computational system can be constructed to “reason”, effectively and consequentially, about its own inferential processes. The analysis proceeds in two parts. First, we consider the general question of computational semantics, rejecting traditional approaches, and arguing that the declarative and procedural aspects of computational symbols (what they stand for, and what behaviour they engender) should be analysed independently, in order that they may be coherently related. Second, we investigate self-referential behaviour in computational processes, and show how to embed an effective procedural model of a computational calculus within that calculus (a model not unlike a meta-circular interpreter, but connected to the fundamental operations of the machine in such a way as to provide, at any point in a computation, fully articulated descriptions of the state of that computation, for inspection and possible modification). In terms of the theories that result from these investigations, we present a general architecture for procedurally reflective processes, able to shift smoothly between dealing with a given subject domain, and dealing with their own reasoning processes over that domain.

An instance of the general solution is worked out in the context of an applicative language. Specifically, we present three successive dialects of LISP: 1-LISP, a distillation of current practice, for comparison purposes; 2-LISP, a dialect constructed in terms of our rationalised semantics, in which the concept of elevation is rejected in favour of independent notions of simplification and reference, and in which the respective categories of notation, structure, semantics, and behaviour are strictly aligned; and 3-LISP, an extension of 2-LISP endowed with reflective powers. (Warning: Hand copied from an image PDF. Tying errors may have occurred.)

I think reflection as it is described here is very close to Newcomb’s notion of composite subject identities, which are themselves composed of composite subject identities.

Has me wondering what a general purpose identification language with reflection would look like?

Martin Odersky: Reflection and Compilers

Filed under: Compilers,Reflection,Scala — Patrick Durusau @ 6:27 pm

Martin Odersky: Reflection and Compilers

From the description:

Reflection and compilers do tantalizing similar things. Yet, in mainstream, statically typed languages the two have been only loosely coupled, and generally share very little code. In this talk I explore what happens if one sets out to overcome their separation.

The first half of the talk addresses the challenge how reflection libraries can share core data structures and algorithms with the language’s compiler without having compiler internals leaking into the standard library API. It turns out that a component system based on abstract types and path-dependent types is a good tool to solve this challenge. I’ll explain how the “multiple cake pattern” can be fruitfully applied to expose the right kind of information.

The second half of the talk explores what one can do when strong, mirror-based reflection is a standard tool. In particular, the compiler itself can use reflection, leading to a particular system of low-level macros that rewrite syntax trees. One core property of these macros is that they can express staging, by rewriting a tree at one stage to code that produces the same tree at the next stage. Staging lets us implement type reification and general LINQ-like functionality. What’s more, staging can also be applied to the macro system itself, with the consequence that a simple low-level macro system can produce a high-level hygienic one, without any extra effort from the language or compiler.

Ignore the comments about the quality of the sound and video. It looks like substantial improvements have been made or I am less sensitive to those issues. Give it a try and see what you think.

Strikes me as being very close to Newcomb’s thoughts on subject identity being composed of other subject identities.

Such that you could have subject representatives that “merge” together and then themselves form the basis for merging other subject representatives.

Suggestions of literature on reflection, its issues and implementations? (Donated books welcome as well. Contact for physical delivery address.)

April 1, 2012

Neo4j Spring Data & Scala

Filed under: Neo4j,Scala,Spring Data — Patrick Durusau @ 7:10 pm

Neo4j Spring Data & Scala by Jan Machacek.

From the post:

Spring Data is an excellent tool that generates implementations of repositories using the naming conventions similar to the convention used in the dynamic language runtimes such as Grails and Ruby on Rails. In this post, I am going to show you how to use Spring Data in your Scala code.

In this post, we will construct trivial application that uses the Spring Data Neo4j to persist simple User objects. The only difference is that we’ll use Scala throughout and highlight some of the sticky points of Spring Data in Scala.

The post seeks to illustrate that Spring remains relevant, even after the advent of Scala.

It does that but code adoption, like application of security patches, is a mixed bag. Some people are using (read advocating) the latest releases, some people are using useful (read stable) software and still others are using older (read unsupported) software. You are likely to find Neo4j in one or more of those environments. Documentation for any and/or all of them would promote usage of Neo4j.

March 4, 2012


Filed under: Content Management System (CMS),Neo4j,Scala — Patrick Durusau @ 7:18 pm


A hosted “content management system,” aka, a hosted website solution. Based on Scala and Neo4j.

I suspect that Scala and Neo4j make it easier for the system developers to offer a hosted website solution.

I am not sure that in a hosted solution the average web developer will notice the difference.

Still, unless you want a “custom” domain name, the service is “free” with some restrictions.

Would be interested if you can tell that it is Scala and Neo4j powering the usual services?

From “Under the hood” is a next-generation content management system. In this post we will look how this system works and how this setup can benefit our users.

Unlike most other content management systems, is entirely built in the programming language Scala, which means it runs on the rock-solid and highly performant Java Virtual Machine.

Scala offers us a highly powerful programming model, greatly cutting back the amount of software we had to write, while its powerful type system reduces the number of potential coding errors.

Another unique feature of is the use of the Neo4j database engine.

Nearly all content management systems in use today, store their information in a Relational Database Management System (RDBMS), a proven technology ubiquitous around the ICT spectrum.

Relational Database Management Systems are very useful and have become extremely robust through decades of improvements, but they are not very well suited for highly connected data.

The world-wide-web is highly connected and in our search for the right technology for our software, we decided a different approach towards storage of data was needed.

Neo4j ended up to be the prefered solution for our storage needs. This database engine is based upon the model of the property-graph. Where a RDBMS stores information in tables, Neo4j stores information as nodes and relationships, where both can contain properties.

The data model of the property-graph is extremely simple, so it’s easy to reason about.

There were two main advantages to a graph-database for us. First of all, relationships are explicitly stored in the database. This makes navigating over complex networked data possible while maintaining a reasonable performance. Secondly, a graph database does not require a schema.

« Newer PostsOlder Posts »

Powered by WordPress