Topology
is the math concerning continuous objects
(not sizes and shapes, but continuity). It is how
things are connected and where the gaps
are.
It explains how a material's shape can be
completely deformed into new one without losing its
core properties. [1]
Definition (homotopy): Let f, g : X ⟶ Y be maps.
f is homotopic
to g if there exists a map F : X×I ⟶ Y
such that F(x, 0) = f(x) and F(x, 1) = g(x)
for all points x ∈ X. The map F is called a homotopy
from f to g and we write it as f' F g.
More intuitvely, if we think of the second parameter
of F as “time”, then F describes a
“continuous deformation” of f into g. At time 0 we
have the function f, at time 1 we have the function g.
Famous example is the mug and donut homotopy
. Informally, two continuous functions from
one topological space to another are called homotopic
if one can be continuously deformed
into another. Such
a deformation being called a homotopy between the
two functions. [1][2]
Topological data analysis (TDA)
is an approach to the analysis
of datasets using techniques from topology
. It is usually combined
with other forms of analysis such as statistical or geometric
approaches. [1]
Famous TDA algorithm for visualization of high dimensional
data is Mapper
.
Mapper
is TDA algorithm is used for generalized
notion of coordinatization for high dimensional
datasets. Coordinatization can refer to choice of
real valued coordinate functions or data or
other notions of geometric representation like
reeb graph
. [4]
The Mapper algorithm is explained in three sections:
- Math theory required for mapper algorithm;
- Mapper: Construction of simplical complex (e.g. graph) from cover;
- Mapper: Multiresolution motivation (level of detail).
Definition: A topology τ on a set X
consists of subsets of X satisfying the following
properties:
- The empty set ∅ and the space X are both sets in the topology;
- The union of any collection of sets in τ is contained in τ;
- The intersection of any finitely many sets in τ is also contained in τ.
All sets in topology are open sets
. [3]
Examples (with visualization):
- τ = {∅, {1, 2, 3}}
- τ = {∅, {1}, {1, 2, 3}}
- τ = {∅, {1}, {2}, {1, 2}, {1, 2, 3}}
- τ = {∅, {1, 2}, {2, 3}, {2}, {1, 2, 3}}
- τ = {∅, {2}, {3}, {1, 2, 3}}, missing: {2, 3} = {2} ∪ {3}.
- τ = {∅, {1, 2}, {2, 3}, {1, 2, 3}}, missing: {2} = {1, 2} ∩ {2, 3}.
Definition: A topological space
is a pair (X, τ)
where X is a set and τ a topology on a set X. [3]
Definition: Cover C of set X
is collection of sets
whose union includes X. Cover is open cover
is all
members are open sets. [3]
Example: X is unit circle and C is set of circles containing X.
Definition: Nerve of an open covering C
is
a construction of simplical complex N(C).
Example: From previous example we can form nerve of open covering. Note: Obtained simplical complex approximates initial space.
Definition: Partition of unity
of topological space
X is set of continuous functions R from X to [0, 1] where
for each point x ∈ X:
- there is a neighbourhood of x where all but finite number of functions of R are 0,
- the sum of all the function values at x is 1.
Definition: Points v in the k-simplex correspond
to set of ordered k-tuples of real numbers
(numbers are from interval [0, 1] and they sum up to 1).
We can intrepet these values as normalized masses.
This coordinate system is called barycentric coordinate system
.
Example: Barycentric coordinates of 2-simplex (triangle):
- connected components and path connected components.
Definition: A topological space X is said to be
disconnected
if it is the union of two disjoint nonempty open sets.
Otherwise, X is said to be connected
. We can
define relation x ~ y if there exists connected subset
of X that is containing them. It can be shown that this relation is equivalence
relation. Equilvalence classes of this relation are
called connected components
. [3]
Definition:
The space X is said to be path-connected
if for any
two points x,y ∈ X there exists a continuous
function f from the unit interval [0,1]
to X with f(0) = x and f(1) =
y. (This function is called a path from x to y.)
We can now define relation x ~ y if x is path connected
to y. It can be shown that this relation is equivalence
relation. Equilvalence classes of this relation are
called path-connected components
.
Assume we have a finite covering U = {Ua | a ∈ A} of space X where A is an indexing set (image construction-1).
construction-1: Space X, defined by black border, with finite covering U = {U1, U2, U3, U4}
We can define nerve of the covering U to be a simplical complex N(U) (image constructin-2) where:
- Vertices of N(U) are named by index set A.
- Family {a0, ..., ak} forms a k-simplex in N(U) (vertices of simplex) if and only if Ua0 ∩ Ua1 ∩ ... ∩ Uak is a non-empty set. [3]
construction-2: The nerve of covering N(U) for U
With defined a partition of unity {Φa: X ⟶ [0, 1] | a ∈ A} (∑α Φα(x)=1), we can obtain map from X to N(U):
- Let T: X ⟶ A, T(x) = {a | x ∈ Ua} (set of members of covers that contain x).
- Let ρ: X ⟶ N(U) where ρ(x) is point in simplex spanned by vertices a ∈ T(x) (spanned by k-simplex vertices) whose barycentric coordinates are (Φa1(x), Φa2(x), ..., Φak(x)) where a1, a2, ..., ak are values from T(x). Continuous map ρ provides kind of partial coordination of X using a k-simplex from N(U). [3]
We can form a finite covering V with continuous map f: X ⟶ Z where Z is parameter space. Let the parameter space Z be equipped with a finite open covering C = {Cb | b ∈ B} where B is an indexing set (image construction-3).
construction-3: Mapping of covering C to Z
Let g be inverse map of f. Map g is continuous since f is continuous. Hence, the sets Vb := g(Cb) also form a finite open covering of space X (image construction-4). We can now decompose Ub into path connected components (Vb is union of connected components). [3]
construction-4: Forming covering using inverse function g
If we have two coverings U = {Ua | a ∈ A}
and V = {Vb | b ∈ B} then map of coverings
from U
to V is function f: A ⟶ B so that for all a ∈ A,
we have Ua ⊆ Vf(a). Hence, we have induced
mapping of simplical complexes N(f): N(U) ⟶ N(V). [3]
Consequently, if we have a family of coverings Ui, i = 0,1,...,n, and maps of coverings fi : Ui → Ui+1 for each i, we obtain a diagram of simplicial complexes and simplicial maps: [3]
This means that when resolution of cover increases (members of cover are decomposed into more "smaller" members) the resulting "more detailed" simplical complex (vertices are consequently decomposed). In case of graphs, they are more refined in sense that there are more nodes inserted along the edges. Example: [5]
Statistical version of mapper is used for the implementation.
Idea is to use some clustering algorithm
to partition space into set of connected components.
Assume we have N data points x ∈ X, filter function (lens)
f: X ⟶ R and inter-point distance matrix
(explicitly or implicitly given some metric). Example:
image mapper-example: point cloud
. Algorithm:
- Form a cover for range I of function f:
- I = f(X);
- We can form the cover by splitting I into a set of intervals S of equal length which overlap.
- Note: We have two parameters here: Size of the set S (n) (or length of interval l) and percentage overlap between successive intervals (p).
- Example:
- X ⊆ [0,2]x[0,2]
- f(x, y) = y ⇒ I ⊆ [0,2]
- n = 4, p = 10%
- S = {[0, 0.55], [0.45, 1.05], [0.95, 1.55], [1.45, 2]}
- Example: image mapper-example:
filter
- Calcute the preimage for each member of cover I:
- Xi = f-1[Ii]
- set V = {Xi | i ∈ {1...n}} forms cover of X
- Example: image mapper-example:
covering
- Cluster each member of V:
- Clustering algorithm is arbitrary, but it should have some desired characteristics which will be noted later.
- Each cluster forms a vertex vXi, Cj where Xi is clustered member of the cover and Ci, j is cluster obtained from clustering Xi.
- Example: image mapper-example:
clustering
- Form a simplical complex (or graph):
- In case of forming simplical complex: Set of vertices {vi1, j1, vi2, j2, ..., vik, jk} form k-simplex if Ci1, j1 ∩ Ci2, j2 ∩ ... ∩ Cik, jk ≠ ∅
- In case of forming a graph (special case: Vertices va, b and vc, d are connected if Ca, b ∩ Cc, d ≠ ∅
- Example: image mapper-example:
TDA network
mapper-example: Applying mapper [[6]](#6)
Besides choosing a good filter function, finding a good clustering is another important challenge for decent data visualization. The Mapper algorithm does not place any conditions on the clustering but there are some desired characteristics: [4]
- Clustering should not be restricted to the euclidean space. We can give inter-point distance matrix as an input to mapper algorithm.
- Number of clusters should not be specified beforehand. We can use algorithm that do not require number of clusters specified before applying clustering algorithm. Examples: DBSCAN, Agglomerative (hierarchical) clustering with distance threshold, etc...
Iris dataset
is a well known dataset consisting of three
classes of flowers (setosa
, versicolor
, virginica
)
with 50 instances for each class. This dataset is associated
with classification (supervised) task in machine learning. [7]
- Features:
SepalLengthCm
,SepalWidthCm
,PetalLengthCm
,PetalWidthCm
- Target:
Species
Custom mapper algorithm prototype is used to test class separability by using visualization. Size of the nodes is correlated to number of instances that node contains. Parameters:
- Filter function: Projection on first PCA component
- Clustering: DBSCAN
- Number of intervals: 10
- overlap percentage: 50%
In the image on left we
can see nodes colored by mode
of each cluster (node).
We can see three "bigger" clusters:
setosa
cluster which is isolated fromversicolor
andvirginica
virginica
andversicolor
clusters which are not isolated from each other.
On the image on the right we can see entropies of
the nodes and expect classes virginica
and
versicolor
to mix when applying trained a classification model
while setosa
should be easily learned to be separated by
the model.
We could also make another plot after training the model to see model certainty i.e. where model is not sure which class does the instance belong to.
Code can be found in the 01_iris
directory. Required packages:
pandas==1.2.4
sklearn==0.0
matplotlib==3.4.1
networkx==2.5.1
scipy==1.6.1
Running code:
cd 01_iris
python main.py (or python3 main.py)
Link to Kaggle competition: Natural Language Processing with Disaster Tweets
Each sample in the dataset has the following information:
- The
text
of a tweet - A
keyword
from that tweet (although this may be blank!) - The
location
the tweet was sent from (may also be blank)
Goal: Predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0. Note: This is the goal of Kaggle competition but not our goal here.
Data: Columns
id
- a unique identifier for each tweettext
- the text of the tweetlocation
- the location the tweet was sent from (may be blank)keyword
- a particular keyword from the tweet (may be blank)target
- denotes whether a tweet is about a real disaster (1) or not (0)
Exploratory data analysis od the dataset can be found on this link.
Interactive visualization is generated using KeplerMapper API. Results can be found on this link.
Result of The Mapper algorithm using Glove word embeddings. Configuration:
- Filter function: Projection on two T-SNE components
- Clustering: DBSCAN
- Number of intervals (cubes): 10x10
- overlap percentage: 50%
Glove Embeddings (max), T-SNE, DBSCAN
When taking average
instead of maximum
of the Glove word embeddings
we get more uniform representation (from disaster to non-disasters) instead
of the flares for (mostly) the disasters. Base of configurations
(T-SNE, DBSCAN) but parameters for these objects are differ.
Glove Embeddings (mean), T-SNE, DBSCAN
Problem with Tf-Idf with PCA projection is that most of the samples are collected in the middle (the middle nodes in the image). It is very hard to separate them using different paramters and algorithms.
We have three disaster flares: suicide bombing
(upper left), wildfires
(left)
and sandstorm
(bottom), that are separated from the rest.
Tf-Idf, PCA, DBSCAN
Separated node on top represents sandstorm
disaster. Cluster of nodes
on the right represent suicide bombing
and bottom nodes on the left
represent wildifires
.
Tf-Idf, PCA, KMeans
Problem with cosine
distance in this is case is small threshold
between 0.5
and 0.6
(cosine distance has values from 0 to 1)
where are nodes are sparsed or collected. On the image below
we can see results with threshold 0.55
.
Tf-Idf, PCA, DBSCAN with cosine distance
Code can be found in the 02_tweet_disaster
directory. Required packages:
pandas==1.2.4
sklearn==0.0
matplotlib==3.4.1
networkx==2.5.1
scipy==1.6.1
seaborn==0.11.1
wordcloud==1.8.1
kmapper==2.0.1
nltk==3.6.2
Running code:
cd 02_tweet_disaster
python main.py (or python3 main.py)
[1] Intro to Applied Topological Data Analysis
[2] When is a coffee mug a donut? Topology explains it
[4] Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition - Gurjeet Singh, Facundo Mémoli and Gunnar Carlsson
[5] The Shape of an Image: A Study of Mapper on Images
[6] tmap: an integrative framework based on topological data analysis for population-scale microbiome stratification and association studies