Providing intrinsic clustering quality indices #257

jaksle · 2023-07-11T15:21:46Z

Continuation of issue #253. It introduces 4 indices: Calinsky-Harabasz, Davies-Bouldin, Xie-Beni and Dunn. CH, DB, XB require that the space is linear, e.g. we can talk about average positions, k-means and fuzzy c-means work under the same assumptions. Only CH and XB are defined for fuzzy clustering. D is the most computationally complex, but is the most general, only distances between data points are required, as such works also for k-medoids.

Issues:

No documentation and testing. Please first check if my API fits Clustering.jl and I will add it. It may be also reasonable to introduce function qualityindex(method,args), maybe not even export the rest of the functions? I am not sure.
File mytest.jl is temporary testing unit for the current API, will be removed later.
These methods do not work for arrays which are not 1-idexed. However, the rest of Clustering.jl also does not. Should we add Base.require_one_based_indexing? Theoretically it can be made indexing independent, but it seems irritating to implement.

codecov-commenter · 2023-07-12T11:09:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1cade0d) 95.01% compared to head (6b446db) 95.40%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #257      +/-   ##
==========================================
+ Coverage   95.01%   95.40%   +0.39%     
==========================================
  Files          19       20       +1     
  Lines        1385     1503     +118     
==========================================
+ Hits         1316     1434     +118     
  Misses         69       69

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alyst

Thanks for your contribution!
These metrics (I find the term index a bit confusing in this context) are indeed very useful, but the code needs some work.
I've indicated in the code suggestions that we need to export the single function clustering_quality().
I've just reviewed the first calinski_harabasz() method, but for the other ones please update the formatting, variable names and linear algebra code in the similar way.
Also please add the proper unit tests (you have them already, but they are in the wrong source file)

.vscode/settings.json

mytest.jl

src/Clustering.jl

src/qualityindices.jl

alyst · 2023-07-14T03:11:38Z

These methods do not work for arrays which are not 1-idexed. However, the rest of Clustering.jl also does not. Should we add Base.require_one_based_indexing? Theoretically it can be made indexing independent, but it seems irritating to implement.

Let's try to make the new code in Clustering.jl index base-agnostic. It should not be so "irritating": using eachindex()/eachslice() should address most of the issues (and potentially even simplify & speed-up the code).

jaksle · 2023-07-14T12:18:43Z

Thank you about the thoughtful comments, it is a chance to learn something new along the way.

Few thoughts about broad topics:

The name. I am not a big fan of "quality index" either, but it seems to be standard. It is used on Wikipedia and in the review articles I found. I don't have a strong opinion here.
1-based indexing. I was thinking about it a little before and the "irritating" (or maybe better "confusing") part was the assignments vector. Generally its elements can be any type I think? It is not clear to me currently how to use it in a universal way, but I will think about it.

alyst · 2023-07-14T16:47:25Z

The name. I am not a big fan of "quality index" either, but it seems to be standard. It is used on Wikipedia and in the review articles I found. I don't have a strong opinion here.

If there's a consensus to use index in the literature, let's call it index (quality_index kwarg in clustering_quality() function).
One advantage is that it disambiguates clustering quality index from distance metric.

I was thinking about it a little before and the "irritating" (or maybe better "confusing") part was the assignments vector. Generally its elements can be any type I think?

The elements should be of Integer type. The Clustreing.jl methods allow taking assignments::AbstractVector{<:Integer} that could be constructed by the user.
But you are right that most of the methods assume the elements have to be cluster indices from 1 to K without gaps (the ClusteringResult.assignments generated by the Clustering.jl clustering methods).

jaksle · 2023-07-18T16:10:31Z

Attempt at making clustering_quality indexing agnostic ended in a failure. The reason is Distances require 1-based indexing.

JoFeG · 2023-10-05T12:50:52Z

Hello everyone,

I'm currently working on intrinsic cluster validity indices and wanted to propose a new feature for the package, but I found this thread open that is more or less what I was thinking.

The development seems to have paused a couple of months ago; I would like to contribute.

Before diving in, I would like to know if help is indeed needed here?

jaksle · 2023-10-05T15:19:10Z

Hi! The current state is that the commit is fully functional and I am waiting for more detailed re-review by @alyst . You can check the files, few most popular quality measures are implemented there.

alyst · 2023-10-06T17:45:28Z

@jaksle oh, sorry, I have missed your updates. I'll try to review them in the next few days. I have recently merged #255, which, I think, put your PR into a conflicting state. Could you please resolve the conflicts?

jaksle · 2023-10-07T22:11:42Z

Conflict resolved. The code is mostly index-agnostic; I learned that Distances.jl will not work in the middle of doing it. Other than that, the documentation may be to verbose, but it should be easy to optionally trim down.

alyst

Thank you, there was a big progress since the last version!
I've left some comments/suggestions to the docs and the code.
I have not made these suggestions everywhere, but please use space after comma in the code and in the docs.

At the moment clustering_quality.jl tests are not enabled in the CI, so it is not clear whether the code runs and gives the correct results, and what is the coverage. Please include this file in the runtests.jl. By looking at the existing tests I would say that ideally we would also need the tests with non-0,1 weights and with varying fuzzyness.

If the CI results look ok, I think we are close to merging it after the final review.

docs/source/validate.md

src/clustering_quality.jl

test/clustering_quality.jl

src/clustering_quality.jl

alyst · 2023-10-09T21:16:17Z

test/clustering_quality.jl

+        @test_throws ArgumentError clustering_quality(zeros(2,2),zeros(2,2), [1, 2], quality_index = :calinski_harabasz)
+        @test_throws DimensionMismatch clustering_quality([1,2,3], zeros(2,2), quality_index = :dunn)
+    end
+


We should also test corner cases to make sure they are handled correctly (ideally, the correct value according to the quality index definition should be returned):

degenerated input data: 0 points (and 0 clusters)

trivial clustering: single cluster

0-dimensional data

Quality indices are generally undefined for single cluster. The reason is that the cluster spread is undefined.

Ok, silhouettes() actually throws an error for the degenerated clustering, so the other indices should follow that behaviour.

The other corner case that silhouettes() checks are empty clusters (gaps in assignments values).
See test/silhouette.jl.

If the cluster is empty than it's center is undefined, which means that the user gave nonsense arguments (or inconsistent number of assignments/centers, this is checked). Only silhouettes and Dunn index do not require them.

ideally we need to test that the behavior in such cases meets our expectations, because this will happen (bugs in the user code). I'd say: let's add the tests, see how clustering_quality handles it and then decide if we need to adjust the behavior.

.gitignore

Project.toml

test/clustering_quality.jl

src/clustering_quality.jl

test/clustering_quality.jl

src/clustering_quality.jl

alyst · 2023-10-29T22:08:07Z

src/clustering_quality.jl

+_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix,
+               assignments::AbstractVector{<:Integer}) =


Suggested change

_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix,

assignments::AbstractVector{<:Integer}) =

_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix,

assignments::AbstractVector{<:Integer}, fuzziness::Nothing) =

You can calculate inner_inertia without using _gather_samples as you suggest, but colwise can be more efficient than calling chosen distance within the loop. This makes possible gains unclear and distance-dependent.

pairwise() is for sure more efficient when we are calculating pairwise distances (for certain metrics) between N and M points.
But in this case it's a distance between N points and 1 point (in _gather_samples() implementation).
Is it also getting significant performance boost?

For what I understand, possible gains are entirely metric-dependent. For the most popular one, in

using BenchmarkTools X = randn(1000,1000) C = randn(1000) @benchmark sum(colwise(Euclidean(),X,C)) @benchmark sum(Euclidean()(x,C) for x in eachcol(X))

on my machine I get colwise being faster by 60% and taking 10x less memory.

Thanks for checking. Could you please add your benchmark results here?
I guess the performance gain is due to the use of BLAS matrix multiplication for Euclidean/SqEuclidean (so for the other metrics the gain should be lower).
The memory inefficiency is a bit unfortunate, I'm not sure it's because of eachcol(X) vs view(X, :, i), but due to internal arrays allocation in the generic metric() code.
Anyway, thanks again for checking -- you can revert it to the old version (otherwise I will do it), just keep the signature of inner_inertia() the same and call gather_samples() inside it.

Sure, colwise is

BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 132.700 μs … 4.223 ms ┊ GC (min … max): 0.00% … 96.37% Time (median): 137.900 μs ┊ GC (median): 0.00% Time (mean ± σ): 144.589 μs ± 47.773 μs ┊ GC (mean ± σ): 0.28% ± 0.96% ▃▇█▆▅▅▃▃▂▂▁▁▁ ▁ ▁ ▂ ██████████████████████████▇▇▆▆▆▇▆▆▆▆▆▅▃▅▅▄▁▄▄▁▄▅▃▁▄▄▃▁▃▄▄▁▅▅ █ 133 μs Histogram: log(frequency) by time 256 μs < Memory estimate: 7.95 KiB, allocs estimate: 2.

Euclidean()(x,C) is

BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 197.400 μs … 3.395 ms ┊ GC (min … max): 0.00% … 87.95% Time (median): 203.200 μs ┊ GC (median): 0.00% Time (mean ± σ): 211.891 μs ± 122.278 μs ┊ GC (mean ± σ): 2.17% ± 3.57% ▃█▇▇▁ ▁▂▅█████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂ 197 μs Histogram: frequency by time 254 μs < Memory estimate: 78.17 KiB, allocs estimate: 3001.

I've checked view(X,:,i) quickly, it is even worse.

I've checked view(X,:,i) quickly, it is even worse.

Worse in terms of memory or performance?

On the same machine

BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 325.000 μs … 2.099 ms ┊ GC (min … max): 0.00% … 69.74% Time (median): 339.600 μs ┊ GC (median): 0.00% Time (mean ± σ): 348.629 μs ± 85.286 μs ┊ GC (mean ± σ): 1.09% ± 3.81% ▁▆█▆▅▄▃▁▁▂▂▁ ▁ ▂ ▇▃▁█████████████████▇▇▇▇▆▆▅▅▅▅▅▆▅▅▅▄▅▅▁▅▄▃▃▃▄▅▄▁▁▃▃▁▄▃▃▄▄▁▁▄ █ 325 μs Histogram: log(frequency) by time 478 μs < Memory estimate: 116.66 KiB, allocs estimate: 5466.

src/clustering_quality.jl

alyst · 2023-10-29T22:42:04Z

@jaksle We have reached the point when the CI tests are passing, thank you!
If we look at the coverage report (the first PR comment), not all the lines are covered, though.
In particular, testing for incorrect :quality_index values, and some variants of the clustering_quality() signatures are never called.
Could you please add tests to cover those lines?

src/clustering_quality.jl

jaksle · 2023-10-31T11:06:24Z

@alyst You mentioned wanting to reshape documentation. It may be a good idea to divide "Evaluation & Validation" into 2 sections. Clustering quality and silhouettes are about intrinsic validation, the rest is about measuring similarity between clusterings, cross validation you could say.

alyst · 2023-10-31T16:53:15Z

You mentioned wanting to reshape documentation. It may be a good idea to divide "Evaluation & Validation" into 2 sections.

Yes, that's what I had in mind: separate intrinsic quality metrics and the comparison between the clusterings (cross-validation is one use case).
I am keeping it out of scope for this PR, because you have already done a lot (thank you!).

alyst · 2023-11-25T20:45:56Z

@jaksle I think we're almost done with the PR. But it would be nice to improve the coverage.
E.g. https://app.codecov.io/gh/JuliaStats/Clustering.jl/pull/257/blob/src/clustering_quality.jl#L89 and few other lines show that some parameter combinations are not tested.
Could you please add those tests? I think we would be ready to go then.

jaksle · 2023-11-25T22:54:03Z

@alyst Yes, of course. For :silhouettes I haven't tested them deliberately assuming that what is here is basically an additional API for silhouettes method tested elsewhere.

alyst · 2023-11-25T23:00:22Z

@jaksle Yes, I think no comprehensive tests are required for silhouettes.
It is just to check that clustering_quality(..., quality_index=:silhouettes) calculates the correct index.
If silhouettes() API changes, this test will help to identify that we need to update clustering_quality() code that calls it.

And there is also Dunn method that is not covered in that clustering_quality() method variant.

jaksle · 2023-11-28T19:56:58Z

I added testing of :silhouettes (and wasted an hour looking for mistake in my analytical formula -_-). I remember why I haven't tested :dunn in line 91, I think it is the same call as line 152.

alyst · 2024-01-15T23:39:35Z

@jaksle I've updated the tests to fix the coverage (also, I've enhanced the exception logic -- it should throw a correct error message if the index name is valid, but not compatible with input arguments). Plus, I've reshuffled the docs, so that clustering comparison and quality evaluation are the two big subsections. I think this PR is ready to be merged -- but let me know if you have some questions/concerns about my recent changes. And thank you for all your hard work!

jaksle · 2024-01-18T15:37:45Z

This looks nice. I think this is a very good clean-up. I also think the PR is ready to merge. Thank you too for all the corrections!

alyst requested changes Jul 14, 2023

View reviewed changes

jaksle requested a review from alyst July 18, 2023 16:05

alyst requested changes Oct 8, 2023

View reviewed changes

alyst reviewed Oct 9, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 9, 2023

View reviewed changes

alyst reviewed Oct 27, 2023

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

alyst reviewed Oct 27, 2023

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

test/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

test/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 29, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst reviewed Oct 30, 2023

View reviewed changes

src/clustering_quality.jl Outdated Show resolved Hide resolved

alyst approved these changes Jan 15, 2024

View reviewed changes

add clustering_quality()

92c7e7f

docs: refactor clustering validation

6b446db

alyst merged commit b54a08d into JuliaStats:master Jan 18, 2024
7 checks passed

This was referenced Jan 23, 2024

Small erorrs in docs validate links #270

Closed

Providing additional intrinsic evaluation metrices #253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providing intrinsic clustering quality indices #257

Providing intrinsic clustering quality indices #257

jaksle commented Jul 11, 2023

codecov-commenter commented Jul 12, 2023 •

edited

Loading

alyst left a comment

alyst commented Jul 14, 2023

jaksle commented Jul 14, 2023

alyst commented Jul 14, 2023

jaksle commented Jul 18, 2023

JoFeG commented Oct 5, 2023

jaksle commented Oct 5, 2023

alyst commented Oct 6, 2023

jaksle commented Oct 7, 2023

alyst left a comment

alyst Oct 9, 2023

jaksle Oct 9, 2023

alyst Oct 9, 2023

jaksle Oct 10, 2023

alyst Oct 10, 2023

alyst Oct 29, 2023

jaksle Oct 29, 2023

alyst Oct 29, 2023 •

edited

Loading

jaksle Oct 29, 2023

alyst Oct 30, 2023

jaksle Oct 30, 2023

alyst Oct 30, 2023

jaksle Oct 30, 2023

alyst commented Oct 29, 2023

jaksle commented Oct 31, 2023

alyst commented Oct 31, 2023

alyst commented Nov 25, 2023

jaksle commented Nov 25, 2023

alyst commented Nov 25, 2023

jaksle commented Nov 28, 2023

alyst commented Jan 15, 2024

jaksle commented Jan 18, 2024

		_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix,
		assignments::AbstractVector{<:Integer}) =

Providing intrinsic clustering quality indices #257

Providing intrinsic clustering quality indices #257

Conversation

jaksle commented Jul 11, 2023

codecov-commenter commented Jul 12, 2023 • edited Loading

Codecov Report

alyst left a comment

Choose a reason for hiding this comment

alyst commented Jul 14, 2023

jaksle commented Jul 14, 2023

alyst commented Jul 14, 2023

jaksle commented Jul 18, 2023

JoFeG commented Oct 5, 2023

jaksle commented Oct 5, 2023

alyst commented Oct 6, 2023

jaksle commented Oct 7, 2023

alyst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyst Oct 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyst commented Oct 29, 2023

jaksle commented Oct 31, 2023

alyst commented Oct 31, 2023

alyst commented Nov 25, 2023

jaksle commented Nov 25, 2023

alyst commented Nov 25, 2023

jaksle commented Nov 28, 2023

alyst commented Jan 15, 2024

jaksle commented Jan 18, 2024

codecov-commenter commented Jul 12, 2023 •

edited

Loading

alyst Oct 29, 2023 •

edited

Loading