Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing intrinsic clustering quality indices #257

Merged
merged 2 commits into from
Jan 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
doc/build
Manifest.toml
*.swp
.vscode
docs/build/
8 changes: 4 additions & 4 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
*Clustering.jl* is a Julia package for data clustering. It covers the two
aspects of data clustering:

- [Clustering Algorithms](@ref clu_algo_basics), *e.g.* K-means, K-medoids, Affinity
propagation, and DBSCAN, etc.
- [Clustering Evaluation](@ref clu_validate), *e.g.* Silhouettes and variational
information.
- [Clustering Algorithms](@ref clu_algo_basics): K-means, K-medoids, Affinity
propagation, DBSCAN etc.
- [Clustering Comparison & Evaluation](@ref clu_validate): cross-tabulation, variational
and mutual information, intrinsic clustering quality indices, such as *silhouettes*, etc.
232 changes: 188 additions & 44 deletions docs/source/validate.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
# [Evaluation & Validation](@id clu_validate)

*Clustering.jl* package provides a number of methods to evaluate the results of
a clustering algorithm and/or to validate its correctness.
*Clustering.jl* package provides a number of methods to compare different clusterings,
evaluate clustering quality or validate its correctness.

## Clustering comparison

## Cross tabulation
Methods to compare two clusterings and measure their similarity.

### Cross tabulation

[Cross tabulation](https://en.wikipedia.org/wiki/Contingency_table), or
*contingency matrix*, is a basis for many clustering quality measures.
Expand All @@ -13,11 +16,20 @@ It shows how similar are the two clusterings on a cluster level.
*Clustering.jl* extends `StatsBase.counts()` with methods that accept
[`ClusteringResult`](@ref) arguments:
```@docs
counts(a::ClusteringResult, b::ClusteringResult)
counts(::ClusteringResult, ::ClusteringResult)
```

### Confusion matrix

[Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)
for the two clusterings is a 2×2 contingency table that counts
how frequently the pair of data points are in the same or different clusters.

```@docs
confusion
```

## Rand index
### Rand index

[Rand index](http://en.wikipedia.org/wiki/Rand_index) is a measure of
the similarity between the two data clusterings. From a mathematical
Expand All @@ -28,34 +40,7 @@ even when the original class labels are not used.
randindex
```


## Silhouettes

[Silhouettes](http://en.wikipedia.org/wiki/Silhouette_(clustering)) is
a method for evaluating the quality of clustering. Particularly, it provides a
quantitative way to measure how well each point lies within its cluster in
comparison to the other clusters.

The *Silhouette* value for the ``i``-th data point is:
```math
s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}
```
- ``a_i`` is the average distance from the ``i``-th point to the other points in
the same cluster ``z_i``,
- ``b_i ≝ \min_{k \ne z_i} b_{ik}``, where ``b_{ik}`` is the average distance
from the ``i``-th point to the points in the ``k``-th cluster.

Note that ``s_i \le 1``, and that ``s_i`` is close to ``1`` when the ``i``-th
point lies well within its own cluster. This property allows using
`mean(silhouettes(assignments, counts, X))` as a measure of clustering quality.
Higher values indicate better separation of clusters w.r.t. point distances.

```@docs
silhouettes
```


## Variation of Information
### Variation of Information

[Variation of information](http://en.wikipedia.org/wiki/Variation_of_information)
(also known as *shared information distance*) is a measure of the
Expand All @@ -64,11 +49,9 @@ information*, but it is a true metric, *i.e.* it is symmetric and satisfies
the triangle inequality.

```@docs
varinfo
Clustering.varinfo
```


## V-measure
### V-measure

*V*-measure can be used to compare the clustering results with the
existing class labels of data points or with the alternative clustering.
Expand All @@ -89,7 +72,7 @@ weight, and when ``\beta < 1`` it's *homogeneity*.
vmeasure
```

## Mutual information
### Mutual information

[Mutual information](https://en.wikipedia.org/wiki/Mutual_information)
quantifies the "amount of information" obtained about one random variable
Expand All @@ -100,17 +83,178 @@ the similarity of two different clusterings of a dataset.
mutualinfo
```

## Confusion matrix
## Clustering quality indices

Pair [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)
arising from two clusterings is a 2×2 contingency table representation of
the partition co-occurrence, see [`counts`](@ref).
[`clustering_quality()`][@ref clustering_quality] methods allow computing *intrinsic* clustering quality indices,
i.e. the metrics that depend only on the clustering itself and do not use the external knowledge.
These metrics can be used to compare different clustering algorithms or choose the optimal number of clusters.

| **quality index** | **`quality_index` option** | **clustering type** | **better quality** | **cluster centers** |
|:-------------------------------------------:|:--------------------:|:----------:|:-------------:|:-------------------:|
| [Calinski-Harabasz](@ref calinsky_harabasz) | `:calinsky_harabasz` | hard/fuzzy | *higher* values | required |
| [Xie-Beni](@ref xie_beni) | `:xie_beni` | hard/fuzzy | *lower* values | required |
| [Davis-Bouldin](@ref davis_bouldin) | `:davis_bouldin` | hard | *lower* values | required |
| [Dunn](@ref dunn) | `:dunn` | hard | *higher* values | not required |
| [silhouettes](@ref silhouettes_index) | `:silhouettes` | hard | *higher* values | not required |

```@docs
confusion
clustering_quality
```

The clustering quality index definitions use the following notation:
- ``x_1, x_2, \ldots, x_n``: data points,
- ``C_1, C_2, \ldots, C_k``: clusters,
- ``c_j`` and ``c``: cluster centers and global dataset center,
- ``d``: a similarity (distance) function,
- ``w_{ij}``: weights measuring membership of a point ``x_i`` to a cluster ``C_j``,
- ``\alpha``: a fuzziness parameter.

### [Calinski-Harabasz index](@id calinsky_harabasz)

[*Calinski-Harabasz* index](https://en.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_index) (option `:calinski_harabasz`)
measures corrected ratio between global inertia of the cluster centers and the summed internal inertias of clusters:
```math
\frac{n-k}{k-1}\frac{\sum_{C_j}|C_j|d(c_j,c)}{\sum\limits_{C_j}\sum\limits_{x_i\in C_j} d(x_i,c_j)} \quad \text{and}\quad
\frac{n-k}{k-1} \frac{\sum\limits_{C_j}\left(\sum\limits_{x_i}w_{ij}^\alpha\right) d(c_j,c)}{\sum_{C_j} \sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}
```
for hard and fuzzy (soft) clusterings, respectively.
*Higher* values indicate better quality.

### [Xie-Beni index](@id xie_beni)

*Xie-Beni* index (option `:xie_beni`) measures ratio between summed inertia of clusters
and the minimum distance between cluster centres:
```math
\frac{\sum_{C_j}\sum_{x_i\in C_j}d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) }
\quad \text{and}\quad
\frac{\sum_{C_j}\sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) }
```
for hard and fuzzy (soft) clusterings, respectively.
*Lower* values indicate better quality.

### [Davis-Bouldin index](@id davis_bouldin)
[*Davis-Bouldin* index](https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)
(option `:davis_bouldin`) measures average cohesion based on the cluster diameters and distances between cluster centers:
```math
\frac{1}{k}\sum_{C_{j_1}}\max_{c_{j_2}\neq c_{j_1}}\frac{S(C_{j_1})+S(C_{j_2})}{d(c_{j_1},c_{j_2})}
```
where
```math
S(C_j) = \frac{1}{|C_j|}\sum_{x_i\in C_j}d(x_i,c_j).
```
*Lower* values indicate better quality.

### [Dunn index](@id dunn)
[*Dunn* index](https://en.wikipedia.org/wiki/Dunn_index) (option `:dunn`)
measures the ratio between the nearest neighbour distance divided by the maximum cluster diameter:
```math
\frac{\min\limits_{ C_{j_1}\neq C_{j_2}} \mathrm{dist}(C_{j_1},C_{j_2})}{\max\limits_{C_j}\mathrm{diam}(C_j)}
```
where
```math
\mathrm{dist}(C_{j_1},C_{j_2}) = \min\limits_{x_{i_1}\in C_{j_1},x_{i_2}\in C_{j_2}} d(x_{i_1},x_{i_2}),\quad \mathrm{diam}(C_j) = \max\limits_{x_{i_1},x_{i_2}\in C_j} d(x_{i_1},x_{i_2}).
```
It is more computationally demanding quality index, which can be used when the centres are not known. *Higher* values indicate better quality.

### [Silhouettes](@id silhouettes_index)

[*Silhouettes* metric](http://en.wikipedia.org/wiki/Silhouette_(clustering)) quantifies the correctness of point-to-cluster asssignment by
comparing the distance of the point to its cluster and to the other clusters.

The *Silhouette* value for the ``i``-th data point is:
```math
s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}
```
- ``a_i`` is the average distance from the ``i``-th point to the other points in
the *same* cluster ``z_i``,
- ``b_i ≝ \min_{k \ne z_i} b_{ik}``, where ``b_{ik}`` is the average distance
from the ``i``-th point to the points in the ``k``-th cluster.

Note that ``s_i \le 1``, and that ``s_i`` is close to ``1`` when the ``i``-th
point lies well within its own cluster. This property allows using average silhouette value
`mean(silhouettes(assignments, counts, X))` as a measure of clustering quality;
it is also available using [`clustering_quality(...; quality_index = :silhouettes)`](@ref clustering_quality) method.
Higher values indicate better separation of clusters w.r.t. point distances.

```@docs
silhouettes
```

[`clustering_quality(..., quality_index=:silhouettes)`][@ref clustering_quality]
provides mean silhouette metric for the datapoints. Higher values indicate better quality.

## References
> Olatz Arbelaitz *et al.* (2013). *An extensive comparative study of cluster validity indices*. Pattern Recognition. 46 1: 243-256. [doi:10.1016/j.patcog.2012.07.021](https://doi.org/10.1016/j.patcog.2012.07.021)

> Aybükë Oztürk, Stéphane Lallich, Jérôme Darmont. (2018). *A Visual Quality Index for Fuzzy C-Means*. 14th International Conference on Artificial Intelligence Applications and Innovations (AIAI 2018). 546-555. [doi:10.1007/978-3-319-92007-8_46](https://doi.org/10.1007/978-3-319-92007-8_46).

### Examples

Exemplary data with 3 real clusters.
```@example
using Plots, Clustering
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))


scatter(view(X, 1, :), view(X, 2, :),
label = "data points",
xlabel = "x",
ylabel = "y",
legend = :right,
)
```

Hard clustering quality for K-means method with 2 to 5 clusters:

```@example
using Plots, Clustering
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))

nclusters = 2:5
clusterings = kmeans.(Ref(X), nclusters)

plot((
plot(nclusters,
clustering_quality.(Ref(X), clusterings, quality_index = qidx),
marker = :circle,
title = ":$qidx", label = nothing,
) for qidx in [:silhouettes, :calinski_harabasz, :xie_beni, :davies_bouldin, :dunn])...,
layout = (3, 2),
xaxis = "N clusters",
plot_title = "\"Hard\" clustering quality indices"
)
```

Fuzzy clustering quality for fuzzy C-means method with 2 to 5 clusters:
```@example
using Plots, Clustering
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))

fuzziness = 2
fuzzy_nclusters = 2:5
fuzzy_clusterings = fuzzy_cmeans.(Ref(X), fuzzy_nclusters, fuzziness)

plot((
plot(fuzzy_nclusters,
clustering_quality.(Ref(X), fuzzy_clusterings,
fuzziness = fuzziness, quality_index = qidx),
marker = :circle,
title = ":$qidx", label = nothing,
) for qidx in [:calinski_harabasz, :xie_beni])...,
layout = (2, 1),
xaxis = "N clusters",
plot_title = "\"Soft\" clustering quality indices"
)
```


## Other packages

* [ClusteringBenchmarks.jl](https://github.com/HolyLab/ClusteringBenchmarks.jl) provides
benchmark datasets and implements additional methods for evaluating clustering performance.
benchmark datasets and implements additional methods for evaluating clustering performance.
55 changes: 55 additions & 0 deletions examples/clustering_quality.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
using Plots, Clustering

## test data with 3 clusters
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))

## visualisation of the exemplary data
scatter(X[1,:], X[2,:],
label = "data points",
xlabel = "x",
ylabel = "y",
legend = :right,
)

nclusters = 2:5

## hard clustering quality
clusterings = kmeans.(Ref(X), nclusters)
hard_indices = [:silhouettes, :calinski_harabasz, :xie_beni, :davies_bouldin, :dunn]

kmeans_quality = Dict(
qidx => clustering_quality.(Ref(X), clusterings, quality_index = qidx)
for qidx in hard_indices)

plot((
plot(nclusters, kmeans_quality[qidx],
marker = :circle,
title = qidx,
label = nothing,
) for qidx in hard_indices)...,
layout = (3, 2),
xaxis = "N clusters",
plot_title = "\"Hard\" clustering quality indices"
)

## soft clustering quality
fuzziness = 2
fuzzy_clusterings = fuzzy_cmeans.(Ref(X), nclusters, fuzziness)
soft_indices = [:calinski_harabasz, :xie_beni]

fuzzy_cmeans_quality = Dict(
qidx => clustering_quality.(Ref(X), fuzzy_clusterings, fuzziness = fuzziness, quality_index = qidx)
for qidx in soft_indices)

plot((
plot(nclusters, fuzzy_cmeans_quality[qidx],
marker = :circle,
title = qidx,
label = nothing,
) for qidx in soft_indices)...,
layout = (2, 1),
xaxis = "N clusters",
plot_title = "\"Soft\" clustering quality indices"
)
8 changes: 7 additions & 1 deletion src/Clustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ module Clustering
# silhouette
silhouettes,

# quality indices
clustering_quality,

# varinfo
varinfo,

Expand All @@ -70,6 +73,7 @@ module Clustering
# pair confusion matrix
confusion


## source files

include("utils.jl")
Expand All @@ -84,13 +88,15 @@ module Clustering

include("counts.jl")
include("cluster_distances.jl")

include("silhouette.jl")
include("clustering_quality.jl")

include("randindex.jl")
include("varinfo.jl")
include("vmeasure.jl")
include("mutualinfo.jl")
include("confusion.jl")

include("hclust.jl")

include("deprecate.jl")
Expand Down
Loading
Loading