Archive for the ‘K-Means Clustering’ Category

Clusters and DBScan

Tuesday, September 10th, 2013

Clusters and DBScan by Jesse Johnson.

From the post:

A few weeks ago, I mentioned the idea of a clustering algorithm, but here’s a recap of the idea: Often, a single data set will be made up of different groups of data points, each of which corresponds to a different type of point or a different phenomenon that generated the points. For example, in the classic iris data set, the coordinates of each data point are measurements taken from an iris flower. There are 150 data points, with 50 from each of three species. As one might expect, these data points form three (mostly) distinct groups, called clusters. For a general data set, if we know how many clusters there are and that each cluster is a simple shape like a Gaussian blob, we could determine the structure of the data set using something like K-means or a mixture model. However, in many cases the clusters that make up a data set do not have a simple structure, or we may not know how many there are. In these situations, we need a more flexible algorithm. (Note that K-means is often thought of as a clustering algorithm, but note I’m going to, since it assumes a particular structure for each cluster.)

Jesse has started a series of post on clustering that you will find quite useful.

Particularly if you share my view that clustering is the semantic equivalent of “merging” in TMDM terms without the management of item identifiers.

In the final comment in parentheses, “Note that K-means…” is awkwardly worded. From later in the post you learn that Jesse doesn’t consider K-means to be a clustering algorithm at all.

Wikipedia on DBScan. Which reports that scikit-learn includes a Python implementation of DBScan.

Kaggle Digit Recognizer: A K-means attempt

Wednesday, October 24th, 2012

Kaggle Digit Recognizer: A K-means attempt by Michael Needham.

From the post:

Over the past couple of months Jen and I have been playing around with the Kaggle Digit Recognizer problem – a ‘competition’ created to introduce people to Machine Learning.

The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.

You are given an input file which contains multiple rows each containing 784 pixel values representing a 28×28 pixel image as well as a label indicating which number that image actually represents.

One of the algorithms that we tried out for this problem was a variation on the k-means clustering one whereby we took the values at each pixel location for each of the labels and came up with an average value for each pixel.

The results of machine learning are likely to be direct or indirect input into your topic maps.

Useful evaluation of that input will depend your understanding of machine learning.