Microsoft open sources Distributed Machine Learning Toolkit for more efficient big data research by George Thomas Jr.
From the post:
Researchers at the Microsoft Asia research lab this week made the Microsoft Distributed Machine Learning Toolkit openly available to the developer community.
The toolkit, available now on GitHub, is designed for distributed machine learning — using multiple computers in parallel to solve a complex problem. It contains a parameter server-based programing framework, which makes machine learning tasks on big data highly scalable, efficient and flexible. It also contains two distributed machine learning algorithms, which can be used to train the fastest and largest topic model and the largest word-embedding model in the world.
The toolkit offers rich and easy-to-use APIs to reduce the barrier of distributed machine learning, so researchers and developers can focus on core machine learning tasks like data, model and training.
The toolkit is unique because its features transcend system innovations by also offering machine learning advances, the researchers said. With the toolkit, the researchers said developers can tackle big-data, big-model machine learning problems much faster and with smaller clusters of computers than previously required.
For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines.
…
This has been a banner week for machine learning!
On November 9th, Google open sourced TensorFlow.
On November 12th, Single Artificial Neuron Taught to Recognize Hundreds of Patterns (why neurons have thousands of synapses) is published.
On November 12th, Microsoft open sources its Distributed Machine Learning Toolkit.
Not every week is like that for machine learning but it is impressive when that many major stories drop in a week!
I do like the line from the Microsoft announcement:
For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. (emphasis added)
Prices are falling all the time and a 24 machine cluster should be within the reach of most startups if not most individuals now. Next year? Possibly within the reach of a large number of individuals.
What are your machine learning plans for 2016?
More DMTK information.