Skip to content

Latest commit

 

History

History
120 lines (90 loc) · 3.42 KB

README.md

File metadata and controls

120 lines (90 loc) · 3.42 KB

PAM

About

Partioning around medoids, in Julia.

Usage

This package exports a single function:

pam(D::Array, k::Int)

D is a distance matrix and k is the number of groups desired.

Example

using Distances
using PAM
using RDatasets

iris = dataset("datasets", "iris")
X = Matrix(iris[:,1:4])
D = pairwise(Euclidean(), X, dims=1)
k = 3

results = pam(D,k)

# Output:

(medoids = [79, 8, 113], assignments = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2    3, 3, 1, 3, 3, 3, 3, 3, 3, 1])

To visualize:

using StatsPlots

scatter(
    iris[:,3],
    iris[:,4],
    legend=false,
    group=results.assignments,
    xlabel="Petal Length",
    ylabel="Petal Width",
    title="Iris Species"
)

iris

Cluster Quality Comparison

I've compared the results of the cluster quality from this implementation to the kmedoids function from Clustering.jl as well as the pam function from R's cluster package. To compare the quality of the clusters, clustering was performed 1,000 times on randomly-generated matrices of different sizes (m × n) and values of k. At each iteration, the mean silhouette score was computed and stored. Finally, the mean of all 1,000 mean silhouette scores was computed (higher values represent better separation of clusters):

1

2

3

4

R's cluster package provides the best results across all values tested while PAM.jl provides the second-best results. It should be noted, however, that the performance of Clustering.jl is far better than either the R package or PAM.jl but clustering quality is significantly worse at higher values of k with higher dimensionsal and larger data sets (at least when generated randomly).

The comparisons were generated with the following code:

using Clustering
using Distances
using RCall
using StatsBase
using StatsPlots
using PAM

trials = 1_000
pamjl = Vector{Float64}(undef,trials)
rpam = Vector{Float64}(undef,trials)
clusteringjl = Vector{Float64}(undef,trials)
m = 2
n = 10
clusters = 2

for i in 1:trials
    X = rand(m, n)
    D = pairwise(Euclidean(), X, dims=2)
    Y = X'
    k = clusters
    @rput Y
    @rput k
    R"
    library(cluster)
    r_results = pam(Y,2)
    r_assignments = r_results[3]
    "
    pamjl_assignments = pam(D,k).assignments
    rpam_assignments = @rget r_assignments
    clusteringjl_assignments = kmedoids(D,k).assignments
    pamjl[i] = mean(silhouettes(pamjl_assignments, D))
    rpam[i] = mean(silhouettes(rpam_assignments[:clustering], D))
    clusteringjl[i] = mean(silhouettes(clusteringjl_assignments, D))
end

bar(
    [mean(pamjl),mean(rpam),mean(clusteringjl)],
    fillalpha=0.8,
    color=:orange,
    ylabel="Mean of mean silhouette scores",
    legend=false,
    title="k = $clusters, X = $m × $n",
    xticks=(1:3, ["PAM.jl", "R clustering", "Clustering.jl"])
)