I think there is hidden gold in the linked research paper. In the second to last paragraph, it says if you are willing to discard the trivial partition (each point on its own) from Richness, then you *can* have Scale-Invariance and Refinement-Consistency as well. To me this suggests an optimal (in some sense) class of algorithms, perhaps to be chosen from based on computational complexity, cross-validation, etc.
A cute and thought-provoking theorem for sure, but arguably none of the three given criteria for clustering are well motivated, so the result is much less relevant than usually claimed.
- scale-invariance: stretching data along some dimensions should not change clustering.
This is clearly not true: . . . (three well-spaced spots) may be reasonably seen as three clusters, whereas ||| (three nearby elongated bars) not.
- richness: all groupings must be reachable.
Also not quite true, both of the two cases: (1) all clusters are singleton points and (2) a single cluster that contains all points, mean the same: no useful cluster structure found. So it is enough if one of these groupings are reachable, and not both.
- consistency: increasing inter-cluster differences and decreasing intra-cluster differences should not change clustering.
Also not quite true: suppose we have 9 clusters:
. . .
. . .
. . .
now move the points so that the columns get further apart, at some point we will get:
| | |, where 3 clusters are more reasonable.
Actually, scale-invariance only refers to scaling all dimensions by the same scalar (this is more clearly specificed in the paper linked by the article, page 3). For arbitrary scaling on each coordinate, of course you're correct, it's impossible to have a clustering algorithm that is invariant for such transformations (e.g., the 6-point group ::: may look like either 2 or 3 clusters, depending on whether it's stretched horizontally or vertically).
As for your last two points, I believe I agree! It seems that in the counterexample you give for consistency, some notion of scale-invariance is implicitly assumed -- perhaps this connection plays some role in the theorem's proof (which I haven't read).
This reminds me a bit of Arrow's impossibility theorem for voting, which similarly has questionable premises.
But in almost all cases that doesn't make any sense? Typically the data in different dimensions will have different "units". So there isn't any meaning in the scale in the first place. How could scaling by a single scalar be "more natural"?
If different components of the dataset have different units, I would argue that it is a prerequisite of clustering to first specify the relative importance of each particular unit (thereby putting all units on the same scale). Otherwise, there's no way the clustering algorithm could possibly know what to in certain cases (such as the ::: example).
It's true that there is no intrinsic meaning to the scale, but you must specify at least a relative scale -- how you want to compare (or weigh) different units -- before you can meaningfully cluster the data. Clustering can only work on dimensionless data.
This is the point I think - there's no inherent meaning to the scaling factor(s) as far as overall structure is concerned (they're dimensionless, so the units thing isn't a problem), so the outcome of a clustering algorithm should not depend on it.
Ah I see. As I understand it a general linear map like that isn't what the linked paper means by "scale-invariance", so it wouldn't be considered a violation for a dataset and it's PCA to be given different clusters by your clustering algorithm. It's only the dataset and its scaled up or down counterparts (i.e. the metric is multiplied by a fixed non-zero constant) that are required to get the same clusters for scale-invariance to hold.
In fact the paper doesn't assume that your dataset is contained in a vector space at all. All you have to give a clustering algorithm (as they define it) is a set and a metric function on it.
Scale invariance is well motivated in the sense that perhaps you already normalized your data (usually to fit [-1, 1] or something) and you would be bummed to discover it fucked up your clusters
Or likewise: if you have physical data recorded in some units (say, meters), it would suck if the clusters changed if you had measured stuff in another unit instead
It's interesting idea but so I don't see it as a matter of tradeoffs since just the "richness" sounds undecidable by itself. I mean, dividing a set into "things have low Kolmogorov complexity and things having high Kolmogorov complexity" is definitely undecidable so "any grouping that might make sense for your data" seems impossible without any other requirements.
The "richness" definition also seemed hand wavey to me so I looked at the referenced paper. The actual definition of "richness" of an algorithm is that for any arbitrary partition P of your original data (singletons, one cluster, etc), there is a distance function on the data, which when used in the clustering algorithm, produces P.
I mean, clustering is for actually finding the structure of the data ... It is a collection of structure discovering methods. If one doesn't know the probable structure of course one starts with little knowledge.
Interesting the author neglects to mention the Gap statistic[1] for the stopping criteria. It's a bit of a mouthful, but it does provide a metric for determining how many clusters the data indicates are present as compared to a random distribution in the same dimensions.
Let S = {1, 2}. Every distance function is determined by d(1,2) = a, a >= 0. Define f(d) = {{1,2}} if a = 0 and f(d) = {{1},{2}} otherwise. Isn't this a clustering algorithm that is scale invariant, rich, and consistent?
Looks that way to me, yeah, though this is obviously a super simple case. It's clearly scale invariant and there are only two partitions, which your algorithm hits, so it's rich. Completeness is trivially satisfied in both cases too.
i think i found the issue: the paper says distance function is 0 IFF elements are equal. so for this example, you can not define d(1,2) as equal to 0. so it is not rich, as this is the only way to get the partition {{1,2}}.
Oh, I see, it's not a true metric. That's fair enough, though I wonder if the result depends critically on that assumption.
(you can pass to equivalence classes to recover a true metric, and I didn't see anything obviously incompatible with that in the paper, but I admit I didn't look very deeply)
I think there is hidden gold in the linked research paper. In the second to last paragraph, it says if you are willing to discard the trivial partition (each point on its own) from Richness, then you *can* have Scale-Invariance and Refinement-Consistency as well. To me this suggests an optimal (in some sense) class of algorithms, perhaps to be chosen from based on computational complexity, cross-validation, etc.
A cute and thought-provoking theorem for sure, but arguably none of the three given criteria for clustering are well motivated, so the result is much less relevant than usually claimed.
- scale-invariance: stretching data along some dimensions should not change clustering.
This is clearly not true: . . . (three well-spaced spots) may be reasonably seen as three clusters, whereas ||| (three nearby elongated bars) not.
- richness: all groupings must be reachable.
Also not quite true, both of the two cases: (1) all clusters are singleton points and (2) a single cluster that contains all points, mean the same: no useful cluster structure found. So it is enough if one of these groupings are reachable, and not both.
- consistency: increasing inter-cluster differences and decreasing intra-cluster differences should not change clustering.
Also not quite true: suppose we have 9 clusters:
now move the points so that the columns get further apart, at some point we will get: | | |, where 3 clusters are more reasonable.Actually, scale-invariance only refers to scaling all dimensions by the same scalar (this is more clearly specificed in the paper linked by the article, page 3). For arbitrary scaling on each coordinate, of course you're correct, it's impossible to have a clustering algorithm that is invariant for such transformations (e.g., the 6-point group ::: may look like either 2 or 3 clusters, depending on whether it's stretched horizontally or vertically).
As for your last two points, I believe I agree! It seems that in the counterexample you give for consistency, some notion of scale-invariance is implicitly assumed -- perhaps this connection plays some role in the theorem's proof (which I haven't read).
This reminds me a bit of Arrow's impossibility theorem for voting, which similarly has questionable premises.
But in almost all cases that doesn't make any sense? Typically the data in different dimensions will have different "units". So there isn't any meaning in the scale in the first place. How could scaling by a single scalar be "more natural"?
If different components of the dataset have different units, I would argue that it is a prerequisite of clustering to first specify the relative importance of each particular unit (thereby putting all units on the same scale). Otherwise, there's no way the clustering algorithm could possibly know what to in certain cases (such as the ::: example).
It's true that there is no intrinsic meaning to the scale, but you must specify at least a relative scale -- how you want to compare (or weigh) different units -- before you can meaningfully cluster the data. Clustering can only work on dimensionless data.
This is the point I think - there's no inherent meaning to the scaling factor(s) as far as overall structure is concerned (they're dimensionless, so the units thing isn't a problem), so the outcome of a clustering algorithm should not depend on it.
What if you first did PCA on your data?
You tell me? I'm not sure I understand the question.
Basically scale and rotate your data such that the variation in all dimensions becomes equal, so to speak.
Ah I see. As I understand it a general linear map like that isn't what the linked paper means by "scale-invariance", so it wouldn't be considered a violation for a dataset and it's PCA to be given different clusters by your clustering algorithm. It's only the dataset and its scaled up or down counterparts (i.e. the metric is multiplied by a fixed non-zero constant) that are required to get the same clusters for scale-invariance to hold.
In fact the paper doesn't assume that your dataset is contained in a vector space at all. All you have to give a clustering algorithm (as they define it) is a set and a metric function on it.
(the paper if you don't have a link: https://www.cs.cornell.edu/home/kleinber/nips15.pdf)
Scale invariance is well motivated in the sense that perhaps you already normalized your data (usually to fit [-1, 1] or something) and you would be bummed to discover it fucked up your clusters
Or likewise: if you have physical data recorded in some units (say, meters), it would suck if the clusters changed if you had measured stuff in another unit instead
[dead]
It's interesting idea but so I don't see it as a matter of tradeoffs since just the "richness" sounds undecidable by itself. I mean, dividing a set into "things have low Kolmogorov complexity and things having high Kolmogorov complexity" is definitely undecidable so "any grouping that might make sense for your data" seems impossible without any other requirements.
The "richness" definition also seemed hand wavey to me so I looked at the referenced paper. The actual definition of "richness" of an algorithm is that for any arbitrary partition P of your original data (singletons, one cluster, etc), there is a distance function on the data, which when used in the clustering algorithm, produces P.
I’ve never found a use case for clustering where if the goal is accurate prediction, some other technique doesn’t work better.
Now if the goal is a quick prototype or to get an intuitive sense of the structure of the data, then sure, it’s fine.
But of course you’re always sacrificing something desirable when you try to shoehorn data into a model that doesn’t fit.
I mean, clustering is for actually finding the structure of the data ... It is a collection of structure discovering methods. If one doesn't know the probable structure of course one starts with little knowledge.
Interesting the author neglects to mention the Gap statistic[1] for the stopping criteria. It's a bit of a mouthful, but it does provide a metric for determining how many clusters the data indicates are present as compared to a random distribution in the same dimensions.
[1] https://academic.oup.com/jrsssb/article-abstract/63/2/411/70...
Let S = {1, 2}. Every distance function is determined by d(1,2) = a, a >= 0. Define f(d) = {{1,2}} if a = 0 and f(d) = {{1},{2}} otherwise. Isn't this a clustering algorithm that is scale invariant, rich, and consistent?
Looks that way to me, yeah, though this is obviously a super simple case. It's clearly scale invariant and there are only two partitions, which your algorithm hits, so it's rich. Completeness is trivially satisfied in both cases too.
i think i found the issue: the paper says distance function is 0 IFF elements are equal. so for this example, you can not define d(1,2) as equal to 0. so it is not rich, as this is the only way to get the partition {{1,2}}.
Oh, I see, it's not a true metric. That's fair enough, though I wonder if the result depends critically on that assumption.
(you can pass to equivalence classes to recover a true metric, and I didn't see anything obviously incompatible with that in the paper, but I admit I didn't look very deeply)
Sounds like Richness is an index in SQL terms. Would you agree?