Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing intrinsic clustering quality indices #257

Merged
merged 2 commits into from
Jan 18, 2024
Merged

Providing intrinsic clustering quality indices #257

merged 2 commits into from
Jan 18, 2024

Conversation

jaksle
Copy link
Contributor

@jaksle jaksle commented Jul 11, 2023

Continuation of issue #253. It introduces 4 indices: Calinsky-Harabasz, Davies-Bouldin, Xie-Beni and Dunn. CH, DB, XB require that the space is linear, e.g. we can talk about average positions, k-means and fuzzy c-means work under the same assumptions. Only CH and XB are defined for fuzzy clustering. D is the most computationally complex, but is the most general, only distances between data points are required, as such works also for k-medoids.

Issues:

  • No documentation and testing. Please first check if my API fits Clustering.jl and I will add it. It may be also reasonable to introduce function qualityindex(method,args), maybe not even export the rest of the functions? I am not sure.
  • File mytest.jl is temporary testing unit for the current API, will be removed later.
  • These methods do not work for arrays which are not 1-idexed. However, the rest of Clustering.jl also does not. Should we add Base.require_one_based_indexing? Theoretically it can be made indexing independent, but it seems irritating to implement.

@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1cade0d) 95.01% compared to head (6b446db) 95.40%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #257      +/-   ##
==========================================
+ Coverage   95.01%   95.40%   +0.39%     
==========================================
  Files          19       20       +1     
  Lines        1385     1503     +118     
==========================================
+ Hits         1316     1434     +118     
  Misses         69       69              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!
These metrics (I find the term index a bit confusing in this context) are indeed very useful, but the code needs some work.
I've indicated in the code suggestions that we need to export the single function clustering_quality().
I've just reviewed the first calinski_harabasz() method, but for the other ones please update the formatting, variable names and linear algebra code in the similar way.
Also please add the proper unit tests (you have them already, but they are in the wrong source file)

.vscode/settings.json Outdated Show resolved Hide resolved
mytest.jl Outdated Show resolved Hide resolved
mytest.jl Outdated Show resolved Hide resolved
mytest.jl Outdated Show resolved Hide resolved
src/Clustering.jl Outdated Show resolved Hide resolved
src/qualityindices.jl Outdated Show resolved Hide resolved
src/qualityindices.jl Outdated Show resolved Hide resolved
src/qualityindices.jl Outdated Show resolved Hide resolved
src/qualityindices.jl Outdated Show resolved Hide resolved
src/qualityindices.jl Outdated Show resolved Hide resolved
@alyst
Copy link
Member

alyst commented Jul 14, 2023

These methods do not work for arrays which are not 1-idexed. However, the rest of Clustering.jl also does not. Should we add Base.require_one_based_indexing? Theoretically it can be made indexing independent, but it seems irritating to implement.

Let's try to make the new code in Clustering.jl index base-agnostic. It should not be so "irritating": using eachindex()/eachslice() should address most of the issues (and potentially even simplify & speed-up the code).

@jaksle
Copy link
Contributor Author

jaksle commented Jul 14, 2023

Thank you about the thoughtful comments, it is a chance to learn something new along the way.

Few thoughts about broad topics:

  • The name. I am not a big fan of "quality index" either, but it seems to be standard. It is used on Wikipedia and in the review articles I found. I don't have a strong opinion here.
  • 1-based indexing. I was thinking about it a little before and the "irritating" (or maybe better "confusing") part was the assignments vector. Generally its elements can be any type I think? It is not clear to me currently how to use it in a universal way, but I will think about it.

@alyst
Copy link
Member

alyst commented Jul 14, 2023

The name. I am not a big fan of "quality index" either, but it seems to be standard. It is used on Wikipedia and in the review articles I found. I don't have a strong opinion here.

If there's a consensus to use index in the literature, let's call it index (quality_index kwarg in clustering_quality() function).
One advantage is that it disambiguates clustering quality index from distance metric.

I was thinking about it a little before and the "irritating" (or maybe better "confusing") part was the assignments vector. Generally its elements can be any type I think?

The elements should be of Integer type. The Clustreing.jl methods allow taking assignments::AbstractVector{<:Integer} that could be constructed by the user.
But you are right that most of the methods assume the elements have to be cluster indices from 1 to K without gaps (the ClusteringResult.assignments generated by the Clustering.jl clustering methods).

@jaksle jaksle requested a review from alyst July 18, 2023 16:05
@jaksle
Copy link
Contributor Author

jaksle commented Jul 18, 2023

Attempt at making clustering_quality indexing agnostic ended in a failure. The reason is Distances require 1-based indexing.

@JoFeG
Copy link

JoFeG commented Oct 5, 2023

Hello everyone,

I'm currently working on intrinsic cluster validity indices and wanted to propose a new feature for the package, but I found this thread open that is more or less what I was thinking.

The development seems to have paused a couple of months ago; I would like to contribute.

Before diving in, I would like to know if help is indeed needed here?

@jaksle
Copy link
Contributor Author

jaksle commented Oct 5, 2023

Hi! The current state is that the commit is fully functional and I am waiting for more detailed re-review by @alyst . You can check the files, few most popular quality measures are implemented there.

@alyst
Copy link
Member

alyst commented Oct 6, 2023

@jaksle oh, sorry, I have missed your updates. I'll try to review them in the next few days. I have recently merged #255, which, I think, put your PR into a conflicting state. Could you please resolve the conflicts?

@jaksle
Copy link
Contributor Author

jaksle commented Oct 7, 2023

Conflict resolved. The code is mostly index-agnostic; I learned that Distances.jl will not work in the middle of doing it. Other than that, the documentation may be to verbose, but it should be easy to optionally trim down.

Copy link
Member

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, there was a big progress since the last version!
I've left some comments/suggestions to the docs and the code.
I have not made these suggestions everywhere, but please use space after comma in the code and in the docs.

At the moment clustering_quality.jl tests are not enabled in the CI, so it is not clear whether the code runs and gives the correct results, and what is the coverage. Please include this file in the runtests.jl. By looking at the existing tests I would say that ideally we would also need the tests with non-0,1 weights and with varying fuzzyness.

If the CI results look ok, I think we are close to merging it after the final review.

docs/source/validate.md Outdated Show resolved Hide resolved
docs/source/validate.md Outdated Show resolved Hide resolved
docs/source/validate.md Outdated Show resolved Hide resolved
docs/source/validate.md Outdated Show resolved Hide resolved
docs/source/validate.md Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
test/clustering_quality.jl Outdated Show resolved Hide resolved
test/clustering_quality.jl Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
@test_throws ArgumentError clustering_quality(zeros(2,2),zeros(2,2), [1, 2], quality_index = :calinski_harabasz)
@test_throws DimensionMismatch clustering_quality([1,2,3], zeros(2,2), quality_index = :dunn)
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test corner cases to make sure they are handled correctly (ideally, the correct value according to the quality index definition should be returned):

  • degenerated input data: 0 points (and 0 clusters)
  • trivial clustering: single cluster
  • 0-dimensional data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quality indices are generally undefined for single cluster. The reason is that the cluster spread is undefined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, silhouettes() actually throws an error for the degenerated clustering, so the other indices should follow that behaviour.

The other corner case that silhouettes() checks are empty clusters (gaps in assignments values).
See test/silhouette.jl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster is empty than it's center is undefined, which means that the user gave nonsense arguments (or inconsistent number of assignments/centers, this is checked). Only silhouettes and Dunn index do not require them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we need to test that the behavior in such cases meets our expectations, because this will happen (bugs in the user code). I'd say: let's add the tests, see how clustering_quality handles it and then decide if we need to adjust the behavior.

.gitignore Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
test/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
test/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
Comment on lines 171 to 172
_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix,
assignments::AbstractVector{<:Integer}) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix,
assignments::AbstractVector{<:Integer}) =
_inner_inertia(metric::SemiMetric, data::AbstractMatrix, centers::AbstractMatrix,
assignments::AbstractVector{<:Integer}, fuzziness::Nothing) =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can calculate inner_inertia without using _gather_samples as you suggest, but colwise can be more efficient than calling chosen distance within the loop. This makes possible gains unclear and distance-dependent.

Copy link
Member

@alyst alyst Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairwise() is for sure more efficient when we are calculating pairwise distances (for certain metrics) between N and M points.
But in this case it's a distance between N points and 1 point (in _gather_samples() implementation).
Is it also getting significant performance boost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what I understand, possible gains are entirely metric-dependent. For the most popular one, in

using BenchmarkTools

X = randn(1000,1000)
C = randn(1000)

@benchmark sum(colwise(Euclidean(),X,C))
@benchmark sum(Euclidean()(x,C) for x in eachcol(X))

on my machine I get colwise being faster by 60% and taking 10x less memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. Could you please add your benchmark results here?
I guess the performance gain is due to the use of BLAS matrix multiplication for Euclidean/SqEuclidean (so for the other metrics the gain should be lower).
The memory inefficiency is a bit unfortunate, I'm not sure it's because of eachcol(X) vs view(X, :, i), but due to internal arrays allocation in the generic metric() code.
Anyway, thanks again for checking -- you can revert it to the old version (otherwise I will do it), just keep the signature of inner_inertia() the same and call gather_samples() inside it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, colwise is

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  132.700 μs   4.223 ms  ┊ GC (min  max): 0.00%  96.37%
 Time  (median):     137.900 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   144.589 μs ± 47.773 μs  ┊ GC (mean ± σ):  0.28% ±  0.96%

  ▃▇█▆▅▅▃▃▂▂▁▁▁    ▁    ▁                                      ▂
  ██████████████████████████▇▇▆▆▆▇▆▆▆▆▆▅▃▅▅▄▁▄▄▁▄▅▃▁▄▄▃▁▃▄▄▁▅▅ █
  133 μs        Histogram: log(frequency) by time       256 μs <

 Memory estimate: 7.95 KiB, allocs estimate: 2.

Euclidean()(x,C) is

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  197.400 μs    3.395 ms  ┊ GC (min  max): 0.00%  87.95%
 Time  (median):     203.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   211.891 μs ± 122.278 μs  ┊ GC (mean ± σ):  2.17% ±  3.57%

     ▃█▇▇▁         
  ▁▂▅█████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  197 μs           Histogram: frequency by time          254 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 3001.

I've checked view(X,:,i) quickly, it is even worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked view(X,:,i) quickly, it is even worse.

Worse in terms of memory or performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same machine

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  325.000 μs   2.099 ms  ┊ GC (min  max): 0.00%  69.74%
 Time  (median):     339.600 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   348.629 μs ± 85.286 μs  ┊ GC (mean ± σ):  1.09% ±  3.81%

     ▁▆█▆▅▄▃▁▁▂▂▁   ▁                                          ▂
  ▇▃▁█████████████████▇▇▇▇▆▆▅▅▅▅▅▆▅▅▅▄▅▅▁▅▄▃▃▃▄▅▄▁▁▃▃▁▄▃▃▄▄▁▁▄ █
  325 μs        Histogram: log(frequency) by time       478 μs <

 Memory estimate: 116.66 KiB, allocs estimate: 5466.

src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
@alyst
Copy link
Member

alyst commented Oct 29, 2023

@jaksle We have reached the point when the CI tests are passing, thank you!
If we look at the coverage report (the first PR comment), not all the lines are covered, though.
In particular, testing for incorrect :quality_index values, and some variants of the clustering_quality() signatures are never called.
Could you please add tests to cover those lines?

src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
src/clustering_quality.jl Outdated Show resolved Hide resolved
@jaksle
Copy link
Contributor Author

jaksle commented Oct 31, 2023

@alyst You mentioned wanting to reshape documentation. It may be a good idea to divide "Evaluation & Validation" into 2 sections. Clustering quality and silhouettes are about intrinsic validation, the rest is about measuring similarity between clusterings, cross validation you could say.

@alyst
Copy link
Member

alyst commented Oct 31, 2023

You mentioned wanting to reshape documentation. It may be a good idea to divide "Evaluation & Validation" into 2 sections.

Yes, that's what I had in mind: separate intrinsic quality metrics and the comparison between the clusterings (cross-validation is one use case).
I am keeping it out of scope for this PR, because you have already done a lot (thank you!).

@alyst
Copy link
Member

alyst commented Nov 25, 2023

@jaksle I think we're almost done with the PR. But it would be nice to improve the coverage.
E.g. https://app.codecov.io/gh/JuliaStats/Clustering.jl/pull/257/blob/src/clustering_quality.jl#L89 and few other lines show that some parameter combinations are not tested.
Could you please add those tests? I think we would be ready to go then.

@jaksle
Copy link
Contributor Author

jaksle commented Nov 25, 2023

@alyst Yes, of course. For :silhouettes I haven't tested them deliberately assuming that what is here is basically an additional API for silhouettes method tested elsewhere.

@alyst
Copy link
Member

alyst commented Nov 25, 2023

@jaksle Yes, I think no comprehensive tests are required for silhouettes.
It is just to check that clustering_quality(..., quality_index=:silhouettes) calculates the correct index.
If silhouettes() API changes, this test will help to identify that we need to update clustering_quality() code that calls it.

And there is also Dunn method that is not covered in that clustering_quality() method variant.

@jaksle
Copy link
Contributor Author

jaksle commented Nov 28, 2023

I added testing of :silhouettes (and wasted an hour looking for mistake in my analytical formula -_-). I remember why I haven't tested :dunn in line 91, I think it is the same call as line 152.

@alyst
Copy link
Member

alyst commented Jan 15, 2024

@jaksle I've updated the tests to fix the coverage (also, I've enhanced the exception logic -- it should throw a correct error message if the index name is valid, but not compatible with input arguments). Plus, I've reshuffled the docs, so that clustering comparison and quality evaluation are the two big subsections. I think this PR is ready to be merged -- but let me know if you have some questions/concerns about my recent changes. And thank you for all your hard work!

@jaksle
Copy link
Contributor Author

jaksle commented Jan 18, 2024

This looks nice. I think this is a very good clean-up. I also think the PR is ready to merge. Thank you too for all the corrections!

@alyst alyst merged commit b54a08d into JuliaStats:master Jan 18, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants