Oryx 2: Lambda architecture on Spark for real-time large scale machine learning
From the overview:
This is a redesign of the Oryx project as “Oryx 2.0”. The primary design goals are:
1. A more reusable platform for lambda-architecture-style designs, with batch, speed and serving layers
2. Make each layer usable independently
3.Fuller support for common machine learning needs
- Test/train set split and evaluation
- Parallel model build
- Hyper-parameter selection
4. Use newer technologies like Spark and Streaming in order to simplify:
- Remove separate in-core implementations for scale-down
- Remove custom data transport implementation in favor of message queues like Apache Kafka
- Use a ‘real’ streaming framework instead of reimplementing a simple one
- Remove complex MapReduce-based implementations in favor of Apache Spark-based implementations
5. Support more input (i.e. not just CSV)
Initial import was three days ago if you are interested in being in on the beginning!