Structuredness coefficient to find patterns and associations by Livan Alonso.
From the post:
The structuredness coefficient, let’s denote it as w, is not yet fully defined – we are working on this right now. You are welcome to help us come up with a great, robust, simple, easy-to-compute, easy-to-understand, easy-to-interpret metric. In a nutshell, we are working under the following framework:
- We have a data set with n points. For simplicity, let’s consider for now that these n points are n vectors (x, y) where x, y are real numbers.
- For each pair of points {(x,y), (x’,y’)} we compute a distance d between the two points. In a more general setting, it could be a proximity metric between two keywords.
- We order all the distances d and compute the distance distribution, based on these n points
- Leaving-one-out: we remove one point at a time and compute the n new distance distributions, each based on n-1 points
- We compare the distribution computed on n points, with the n ones computed on n-1 points
- We repeat this iteration, but this time with n-2, then n-3, n-4 points etc.
- You would assume that if there is no pattern, these distance distributions (for successive values of n) would have some kind of behavior uniquely characterizing the absence of structure, behavior that can be identified via simulations. Any deviation from this behavior would indicate the presence of a structure. And the pattern-free behavior would be independent of the underlying point distribution or domain – a very important point. All of this would have to be established or tested, of course.
- It would be interesting to test whether this metric can identify patterns such as fractal distribution / fractal dimension. Would it be able to detect patterns in time series?
Note that this type of structuredness coefficient makes no assumption on the shape of the underlying domains, where the n points are located. These domains could be smooth, bumpy, made up of lines, made up of dual points etc. They might even be non numeric domain at all (e.g. if the data consists of keywords).
Deeply interesting work and I appreciate the acknowledgement that “structuredness coefficient” isn’t fully defined.
I will be trying to develop more links to resources on this topic. Please chime in if you have some already.