Archive for the ‘Virtual Machines’ Category

XQilla-2.3.2 – Tooling up for 2016 (Part 2) (XQuery)

Friday, January 1st, 2016

As I promised yesterday, a solution to the XQilla-2.3.2 installation problem!

Using a virtual machine to install the latest version of Ubuntu (15.10), which had the libraries required to install XQilla!

I use VirtualBox from Oracle but people also use VMware.

Virtual boxes come in all manner of configurations so you are likely to spend some time loading linux headers and the like to compile software.

The advantage of a virtual box is that I don’t have to risk doing something dumb or out of fatigue to my working setup. If I have to blow away the entire virtual machine, its takes only a few minutes to download another one.

Well, on any day other than New Year’s Day I found out today. I don’t know if people were streaming that many football games or streaming live “acts” of some sort but the Net was very slow today.

Introducing XQuery to humanists, librarians and reporters using a VM with the usual XQuery suspects pre-loaded would be very cool!

Great way to distribute xqueries* and shell scripts that run them for immediate results.

If you have any thoughts about what such a VM should contain, etc., drop me an email patrick@durusau.net or leave a comment. Thanks!

PS: XQueries returned approximately 26K “hits,” and xquerys returned approximately 1,700 “hits.” Usage favors the plural as “xqueries” so that is what I am following. At the first of a sentence, XQueries?

PPS: I could have written this without the woes of failed downloads, missing header files, etc. but I wanted to know for myself that Ubuntu (15.10) with all the appropriate header files would in fact compile XQilla-2.3.2.

You may need this line to get all the headers:

apt-get install dkms build-essential linux-headers-generic

Not to mention that I would update everything before trying to compile software. Hard to say how long your VM has been on the shelf.

Docker and Jupyter [Advantages over VMware or VirtualBox?]

Saturday, November 28th, 2015

How to setup a data science environment in minutes using Docker and Jupyter by Vik Paruchuri.

From the post:

Configuring a data science environment can be a pain. Dealing with inconsistent package versions, having to dive through obscure error messages, and having to wait hours for packages to compile can be frustrating. This makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry.

The past few years have seen the rise of technologies that help with this by creating isolated environments. We’ll be exploring one in particular, Docker. Docker makes it fast and easy to create new data science environments, and use tools such as Jupyter notebooks to explore your data.

With Docker, we can download an image file that contains a set of packages and data science tools. We can then boot up a data science environment using this image within seconds, without the need to manually install packages or wait around. This environment is called a Docker container. Containers eliminate configuration problems – when you start a Docker container, it has a known good state, and all the packages work properly.

A nice walk through on installing a Docker container and Jupyter. I do wonder about the advantages claimed over VMware and VirtualBox:


Although virtual machines enable Linux development to take place on Windows, for example, they have some downsides. Virtual machines take a long time to boot up, they require significant system resources, and it’s hard to create a virtual machine from an image, install some packages, and then create another image. Linux containers solve this problem by enabling multiple isolated environments to run on a single machine. Think of containers as a faster, easier way to get started with virtual machines.

I have never noticed long boot times on VirtualBox and “require significant system resources” is too vague to evaluate.

As far as “it’s hard to create a virtual machine from an image, install some packages, and then create another image,” I thought the point of the post was to facilitate quick access to a data science environment?

In that case, I would download an image of my choosing, import it into VirtualBox and then fire it up. How hard is that?

There are pre-configured images with Solr, Solr plus web search engines, and a host of other options.

For more details, visit VirtualBox.org and for a stunning group of “appliances” see VirtualBoxImages.com.

You can use VMs with Docker so it isn’t strictly an either/or choice.

I first saw this in a tweet by Data Science Renee.


Update: Data Science Renee encountered numerous issues trying to follow this install on Windows 7 Professional 64-bit, using VirtualBox 5.0.10 r104061. You can read more about her travails here: Trouble setting up default, maybe caused by virtualbox. After 2 nights of effort, she succeeded! Great!

Error turned out to (apparently) be in VirtualBox. Or at least upgrading to a test version of VirtualBox fixed the problem. I know, I was surprised too. My assumption was that it was Windows. 😉

Distributed Environments and VirtualBox

Thursday, May 15th, 2014

While writing about Distributed LIBLINEAR: I discovered two guides to creating distributed environments with VirtualBox.

I mention that fact in the other post but thought the use of VirtualBox to create distributed environments needed more visibility than a mention.

The guides are:

MPI LIBLINEAR – VirtualBox Guide

Spark LIBLINEAR – VirtualBox Guide

and you will need to refer to the original site: Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments for information on using those environments with “Distributed LIBLINEAR.”

VirtualBox brings research on and using distributed systems within the reach of anyone with reasonable computing resources.

Please drop me a note if you are using VirtualBox to create distributed systems for topic map processing.

Distributed LIBLINEAR:

Thursday, May 15th, 2014

Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments

From the webpage:

MPI LIBLINEAR is an extension of LIBLINEAR on distributed environments. The usage and the data format are the same as LIBLINEAR. Currently only two solvers are supported:

  • L2-regularized logistic regression (LR)
  • L2-regularized L2-loss linear SVM

NOTICE: This extension can only run on Unix-like systems. (We test it on Ubuntu 13.10.) Python and Matlab interfaces are not supported.

Spark LIBLINEAR is a Spark implementation based on LIBLINEAR and integrated with Hadoop distributed file system. This package is developed using Scala. Currently it supports the same two solvers as MPI LIBLINEAR.

If you are unfamiliar with LIBLINEAR:

LIBLINEAR is a linear classifier for data with millions of instances and features. It supports

  • L2-regularized classifiers
    L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
  • L1-regularized classifiers (after version 1.4)
    L2-loss linear SVM and logistic regression (LR)
  • L2-regularized support vector regression (after version 1.9)
    L2-loss linear SVR and L1-loss linear SVR.

Main features of LIBLINEAR include

  • Same data format as LIBSVM, our general-purpose SVM solver, and also similar usage
  • Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
  • Cross validation for model selection
  • Probability estimates (logistic regression only)
  • Weights for unbalanced data
  • MATLAB/Octave, Java, Python, Ruby interfaces

You will also find instructions for creating distributed environments using VirtualBox for both MPI LIBLINEAR and Spark LIBLINEAR. I am going to post on that separately to draw attention to it.

The phrase “standalone computer” is rapidly becoming a misnomer. Forward looking algorithm designers and power users will begin gaining experience with the new distributed “normal,” at every opportunity.

I first saw this in a tweet by Reynold Xin.

One Hour Hadoop Cluster

Tuesday, April 9th, 2013

How to setup a Hadoop cluster in one hour using Ambari?

A guide to setting up a 3-node Hadoop cluster using Oracle’s VirtualBox and Apache Ambari.

HPC may not be the key to semantics but it can still be useful. 😉

Getting Started with VM Depot

Friday, January 11th, 2013

Getting Started with VM Depot by Doug Mahugh.

From the post:

Do you need to deploy a popular OSS package on a Windows Azure virtual machine, but don’t know where to start? Or do you have a favorite OSS configuration that you’d like to make available for others to deploy easily? If so, the new VM Depot community portal from Microsoft Open Technologies is just what you need. VM Depot is a community-driven catalog of preconfigured operating systems, applications, and development stacks that can easily be deployed on Windows Azure.

You can learn more about VM Depot in the announcement from Gianugo Rabellino over on Port 25 today. In this post, we’re going to cover the basics of how to use VM Depot, so that you can get started right away.

Doug outlines simple steps to get you rolling with the VM Depot.

Sounds a lot easier than trying to walk casual computer users through installation and configuration of software. I assume you could even load data onto the VMs.

Users just need to fire up the VM and they have the interface and data they want.

Sounds like a nice way to distribute topic map based information systems.

Virtual Machines

Wednesday, November 2nd, 2011

Virtual Machines

From the post:

One of the best resources about virtual machines (both high-level language VMs and system VMs) is Jim Smith’s and Ravi Nair’s book Virtual Machines: Versatile Platforms for Systems and Processes.

What functions would you optimize if you were writing a virtual machine?