Clusters and DBScan by Jesse Johnson.
From the post:
A few weeks ago, I mentioned the idea of a clustering algorithm, but here’s a recap of the idea: Often, a single data set will be made up of different groups of data points, each of which corresponds to a different type of point or a different phenomenon that generated the points. For example, in the classic iris data set, the coordinates of each data point are measurements taken from an iris flower. There are 150 data points, with 50 from each of three species. As one might expect, these data points form three (mostly) distinct groups, called clusters. For a general data set, if we know how many clusters there are and that each cluster is a simple shape like a Gaussian blob, we could determine the structure of the data set using something like K-means or a mixture model. However, in many cases the clusters that make up a data set do not have a simple structure, or we may not know how many there are. In these situations, we need a more flexible algorithm. (Note that K-means is often thought of as a clustering algorithm, but note I’m going to, since it assumes a particular structure for each cluster.)
Jesse has started a series of post on clustering that you will find quite useful.
Particularly if you share my view that clustering is the semantic equivalent of “merging” in TMDM terms without the management of item identifiers.
In the final comment in parentheses, “Note that K-means…” is awkwardly worded. From later in the post you learn that Jesse doesn’t consider K-means to be a clustering algorithm at all.