Oryx 2: Lambda architecture on Spark for real-time large scale machine learning

From the overview:

This is a redesign of the Oryx project as “Oryx 2.0”. The primary design goals are:

1. A more reusable platform for lambda-architecture-style designs, with batch, speed and serving layers

2. Make each layer usable independently

3.Fuller support for common machine learning needs

  • Test/train set split and evaluation
  • Parallel model build
  • Hyper-parameter selection

4. Use newer technologies like Spark and Streaming in order to simplify:

  • Remove separate in-core implementations for scale-down
  • Remove custom data transport implementation in favor of message queues like Apache Kafka
  • Use a ‘real’ streaming framework instead of reimplementing a simple one
  • Remove complex MapReduce-based implementations in favor of Apache Spark-based implementations

5. Support more input (i.e. not just CSV)

Initial import was three days ago if you are interested in being in on the beginning!

