Scala as a platform for statistical computing and data science by Darren Wilkinson
From the post:
There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.
A feature wish list
It should:
- be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
- be free, open-source and platform independent
- be fast and efficient
- have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
- have a strong type system, and be statically typed with good compile-time type checking and type safety
- have reasonable type inference
- have a REPL for interactive use
- have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
- have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
- allow imperative programming for those (rare) occasions where it makes sense
- be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications
The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:
- have excellent data viz capability built-in
- have vast numbers of statistical routines in the standard library
Darren reviews Scala on each of these points.
Although he still uses R and Python, Darren has hopes for future development of Scala into a full featured data mining platform.
Perhaps his checklist will contribute the requirements needed to make that one of the futures of Scala.
I first saw this in Christophe Lalanne’s A bag of tweets / December 2013.