Docker and Jupyter [Advantages over VMware or VirtualBox?]

How to setup a data science environment in minutes using Docker and Jupyter by Vik Paruchuri.

From the post:

Configuring a data science environment can be a pain. Dealing with inconsistent package versions, having to dive through obscure error messages, and having to wait hours for packages to compile can be frustrating. This makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry.

The past few years have seen the rise of technologies that help with this by creating isolated environments. We’ll be exploring one in particular, Docker. Docker makes it fast and easy to create new data science environments, and use tools such as Jupyter notebooks to explore your data.

With Docker, we can download an image file that contains a set of packages and data science tools. We can then boot up a data science environment using this image within seconds, without the need to manually install packages or wait around. This environment is called a Docker container. Containers eliminate configuration problems – when you start a Docker container, it has a known good state, and all the packages work properly.

A nice walk through on installing a Docker container and Jupyter. I do wonder about the advantages claimed over VMware and VirtualBox:


Although virtual machines enable Linux development to take place on Windows, for example, they have some downsides. Virtual machines take a long time to boot up, they require significant system resources, and it’s hard to create a virtual machine from an image, install some packages, and then create another image. Linux containers solve this problem by enabling multiple isolated environments to run on a single machine. Think of containers as a faster, easier way to get started with virtual machines.

I have never noticed long boot times on VirtualBox and “require significant system resources” is too vague to evaluate.

As far as “it’s hard to create a virtual machine from an image, install some packages, and then create another image,” I thought the point of the post was to facilitate quick access to a data science environment?

In that case, I would download an image of my choosing, import it into VirtualBox and then fire it up. How hard is that?

There are pre-configured images with Solr, Solr plus web search engines, and a host of other options.

For more details, visit VirtualBox.org and for a stunning group of “appliances” see VirtualBoxImages.com.

You can use VMs with Docker so it isn’t strictly an either/or choice.

I first saw this in a tweet by Data Science Renee.


Update: Data Science Renee encountered numerous issues trying to follow this install on Windows 7 Professional 64-bit, using VirtualBox 5.0.10 r104061. You can read more about her travails here: Trouble setting up default, maybe caused by virtualbox. After 2 nights of effort, she succeeded! Great!

Error turned out to (apparently) be in VirtualBox. Or at least upgrading to a test version of VirtualBox fixed the problem. I know, I was surprised too. My assumption was that it was Windows. 😉

Comments are closed.