MATF-NI-Mapper

Short introduction to Topology

Topology is the math concerning continuous objects (not sizes and shapes, but continuity). It is how things are connected and where the gaps are. It explains how a material's shape can be completely deformed into new one without losing its core properties. [1]

Definition (homotopy): Let f, g : X ⟶ Y be maps. f is homotopic to g if there exists a map F : X×I ⟶ Y such that F(x, 0) = f(x) and F(x, 1) = g(x) for all points x ∈ X. The map F is called a homotopy from f to g and we write it as f' F g. More intuitvely, if we think of the second parameter of F as “time”, then F describes a “continuous deformation” of f into g. At time 0 we have the function f, at time 1 we have the function g.

Famous example is the mug and donut homotopy. Informally, two continuous functions from one topological space to another are called homotopic if one can be continuously deformed into another. Such a deformation being called a homotopy between the two functions. [1][2]

Topological data analysis (TDA)

Topological data analysis (TDA) is an approach to the analysis of datasets using techniques from topology. It is usually combined with other forms of analysis such as statistical or geometric approaches. [1]

Famous TDA algorithm for visualization of high dimensional data is Mapper.

Mapper - Introduction

Mapper is TDA algorithm is used for generalized notion of coordinatization for high dimensional datasets. Coordinatization can refer to choice of real valued coordinate functions or data or other notions of geometric representation like reeb graph. [4]

Mapper - Topological background and motivation

The Mapper algorithm is explained in three sections:

Math theory required for mapper algorithm;
Mapper: Construction of simplical complex (e.g. graph) from cover;
Mapper: Multiresolution motivation (level of detail).

Definitions

Definition: A topology τ on a set X consists of subsets of X satisfying the following properties:

The empty set ∅ and the space X are both sets in the topology;
The union of any collection of sets in τ is contained in τ;
The intersection of any finitely many sets in τ is also contained in τ.

All sets in topology are open sets. [3]

Examples (with visualization):

τ = {∅, {1, 2, 3}}
τ = {∅, {1}, {1, 2, 3}}
τ = {∅, {1}, {2}, {1, 2}, {1, 2, 3}}
τ = {∅, {1, 2}, {2, 3}, {2}, {1, 2, 3}}
τ = {∅, {2}, {3}, {1, 2, 3}}, missing: {2, 3} = {2} ∪ {3}.
τ = {∅, {1, 2}, {2, 3}, {1, 2, 3}}, missing: {2} = {1, 2} ∩ {2, 3}.

Definition: A topological space is a pair (X, τ) where X is a set and τ a topology on a set X. [3]

Definition: Cover C of set X is collection of sets whose union includes X. Cover is open cover is all members are open sets. [3]

Example: X is unit circle and C is set of circles containing X.

Definition: Nerve of an open covering C is a construction of simplical complex N(C).

Example: From previous example we can form nerve of open covering. Note: Obtained simplical complex approximates initial space.

Definition: Partition of unity of topological space X is set of continuous functions R from X to [0, 1] where for each point x ∈ X:

there is a neighbourhood of x where all but finite number of functions of R are 0,
the sum of all the function values at x is 1.

Definition: Points v in the k-simplex correspond to set of ordered k-tuples of real numbers (numbers are from interval [0, 1] and they sum up to 1). We can intrepet these values as normalized masses. This coordinate system is called barycentric coordinate system.

Example: Barycentric coordinates of 2-simplex (triangle):

connected components and path connected components.

Definition: A topological space X is said to be disconnected if it is the union of two disjoint nonempty open sets. Otherwise, X is said to be connected. We can define relation x ~ y if there exists connected subset of X that is containing them. It can be shown that this relation is equivalence relation. Equilvalence classes of this relation are called connected components. [3]

Definition: The space X is said to be path-connected if for any two points x,y ∈ X there exists a continuous function f from the unit interval [0,1] to X with f(0) = x and f(1) = y. (This function is called a path from x to y.) We can now define relation x ~ y if x is path connected to y. It can be shown that this relation is equivalence relation. Equilvalence classes of this relation are called path-connected components.

Construction

Assume we have a finite covering U = {U_a | a ∈ A} of space X where A is an indexing set (image construction-1).

construction-1: Space X, defined by black border, with finite covering U = {U₁, U₂, U₃, U₄}

We can define nerve of the covering U to be a simplical complex N(U) (image constructin-2) where:

Vertices of N(U) are named by index set A.
Family {a₀, ..., a_k} forms a k-simplex in N(U) (vertices of simplex) if and only if U_a₀ ∩ U_a₁ ∩ ... ∩ U_{a_k} is a non-empty set. [3]

construction-2: The nerve of covering N(U) for U

With defined a partition of unity {Φa: X ⟶ [0, 1] | a ∈ A} (∑_α Φ_α(x)=1), we can obtain map from X to N(U):

Let T: X ⟶ A, T(x) = {a | x ∈ U_a} (set of members of covers that contain x).
Let ρ: X ⟶ N(U) where ρ(x) is point in simplex spanned by vertices a ∈ T(x) (spanned by k-simplex vertices) whose barycentric coordinates are (Φ_a₁(x), Φ_a₂(x), ..., Φ_{a_k}(x)) where a₁, a₂, ..., a_k are values from T(x). Continuous map ρ provides kind of partial coordination of X using a k-simplex from N(U). [3]

We can form a finite covering V with continuous map f: X ⟶ Z where Z is parameter space. Let the parameter space Z be equipped with a finite open covering C = {C_b | b ∈ B} where B is an indexing set (image construction-3).

construction-3: Mapping of covering C to Z

Let g be inverse map of f. Map g is continuous since f is continuous. Hence, the sets V_b := g(C_b) also form a finite open covering of space X (image construction-4). We can now decompose Ub into path connected components (Vb is union of connected components). [3]

construction-4: Forming covering using inverse function g

Multiresolution structure

If we have two coverings U = {U_a | a ∈ A} and V = {V_b | b ∈ B} then map of coverings from U to V is function f: A ⟶ B so that for all a ∈ A, we have U_a ⊆ V_f(a). Hence, we have induced mapping of simplical complexes N(f): N(U) ⟶ N(V). [3]

Consequently, if we have a family of coverings U_i, i = 0,1,...,n, and maps of coverings f_i : U_i → U_i+1 for each i, we obtain a diagram of simplicial complexes and simplicial maps: [3]

This means that when resolution of cover increases (members of cover are decomposed into more "smaller" members) the resulting "more detailed" simplical complex (vertices are consequently decomposed). In case of graphs, they are more refined in sense that there are more nodes inserted along the edges. Example: [5]

Mapper - Implementation

Statistical version of mapper is used for the implementation. Idea is to use some clustering algorithm to partition space into set of connected components. Assume we have N data points x ∈ X, filter function (lens) f: X ⟶ R and inter-point distance matrix (explicitly or implicitly given some metric). Example: image mapper-example: point cloud. Algorithm:

Form a cover for range I of function f:
- I = f(X);
- We can form the cover by splitting I into a set of intervals S of equal length which overlap.
- Note: We have two parameters here: Size of the set S (n) (or length of interval l) and percentage overlap between successive intervals (p).
- Example:
  - X ⊆ [0,2]x[0,2]
  - f(x, y) = y ⇒ I ⊆ [0,2]
  - n = 4, p = 10%
  - S = {[0, 0.55], [0.45, 1.05], [0.95, 1.55], [1.45, 2]}
- Example: image mapper-example: filter
Calcute the preimage for each member of cover I:
- X_i = f^-1[I_i]
- set V = {X_i | i ∈ {1...n}} forms cover of X
- Example: image mapper-example: covering
Cluster each member of V:
- Clustering algorithm is arbitrary, but it should have some desired characteristics which will be noted later.
- Each cluster forms a vertex v_{X_i, C_j} where X_i is clustered member of the cover and C_{i, j} is cluster obtained from clustering X_i.
- Example: image mapper-example: clustering
Form a simplical complex (or graph):
- In case of forming simplical complex: Set of vertices {v_{i₁, j₁}, v_{i₂, j₂}, ..., v_{i_k, j_k}} form k-simplex if C_{i₁, j₁} ∩ C_{i₂, j₂} ∩ ... ∩ C_{i_k, j_k} ≠ ∅
- In case of forming a graph (special case: Vertices v_{a, b} and v_{c, d} are connected if C_{a, b} ∩ C_{c, d} ≠ ∅
- Example: image mapper-example: TDA network

mapper-example: Applying mapper [[6]](#6)

Clustering desired characteristics

Besides choosing a good filter function, finding a good clustering is another important challenge for decent data visualization. The Mapper algorithm does not place any conditions on the clustering but there are some desired characteristics: [4]

Clustering should not be restricted to the euclidean space. We can give inter-point distance matrix as an input to mapper algorithm.
Number of clusters should not be specified beforehand. We can use algorithm that do not require number of clusters specified before applying clustering algorithm. Examples: DBSCAN, Agglomerative (hierarchical) clustering with distance threshold, etc...

Toy Example - Iris Dataset - 01_iris

Iris dataset is a well known dataset consisting of three classes of flowers (setosa, versicolor, virginica) with 50 instances for each class. This dataset is associated with classification (supervised) task in machine learning. [7]

Features: SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm
Target: Species

Custom mapper algorithm prototype is used to test class separability by using visualization. Size of the nodes is correlated to number of instances that node contains. Parameters:

Filter function: Projection on first PCA component
Clustering: DBSCAN
Number of intervals: 10
overlap percentage: 50%

In the image on left we can see nodes colored by mode of each cluster (node). We can see three "bigger" clusters:

setosa cluster which is isolated from versicolor and virginica
virginica and versicolor clusters which are not isolated from each other.

On the image on the right we can see entropies of the nodes and expect classes virginica and versicolor to mix when applying trained a classification model while setosa should be easily learned to be separated by the model.

We could also make another plot after training the model to see model certainty i.e. where model is not sure which class does the instance belong to.

Code

Code can be found in the 01_iris directory. Required packages:

pandas==1.2.4
sklearn==0.0
matplotlib==3.4.1
networkx==2.5.1
scipy==1.6.1

Running code:

cd 01_iris

python main.py (or python3 main.py)

Disaster Tweets Dataset - 02_tweet_disaster

Link to Kaggle competition: Natural Language Processing with Disaster Tweets

Each sample in the dataset has the following information:

The text of a tweet
A keyword from that tweet (although this may be blank!)
The location the tweet was sent from (may also be blank)

Goal: Predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0. Note: This is the goal of Kaggle competition but not our goal here.

Data: Columns

id - a unique identifier for each tweet
text - the text of the tweet
location - the location the tweet was sent from (may be blank)
keyword - a particular keyword from the tweet (may be blank)
target - denotes whether a tweet is about a real disaster (1) or not (0)

Exploratory Data Analysis

Exploratory data analysis od the dataset can be found on this link.

Results - Glove, T-SNE, DBSCAN

Interactive visualization is generated using KeplerMapper API. Results can be found on this link.

Result of The Mapper algorithm using Glove word embeddings. Configuration:

Filter function: Projection on two T-SNE components
Clustering: DBSCAN
Number of intervals (cubes): 10x10
overlap percentage: 50%

Glove Embeddings (max), T-SNE, DBSCAN

When taking average instead of maximum of the Glove word embeddings we get more uniform representation (from disaster to non-disasters) instead of the flares for (mostly) the disasters. Base of configurations (T-SNE, DBSCAN) but parameters for these objects are differ.

Glove Embeddings (mean), T-SNE, DBSCAN

Results - Tf-Idf, PCA, DBSCAN

Problem with Tf-Idf with PCA projection is that most of the samples are collected in the middle (the middle nodes in the image). It is very hard to separate them using different paramters and algorithms.

We have three disaster flares: suicide bombing (upper left), wildfires (left) and sandstorm (bottom), that are separated from the rest.

Tf-Idf, PCA, DBSCAN

Results - Tf-Idf, PCA, KMeans

Separated node on top represents sandstorm disaster. Cluster of nodes on the right represent suicide bombing and bottom nodes on the left represent wildifires.

Tf-Idf, PCA, KMeans

Results - Tf-Idf, PCA, DBSCAN with cosine distance

Problem with cosine distance in this is case is small threshold between 0.5 and 0.6 (cosine distance has values from 0 to 1) where are nodes are sparsed or collected. On the image below we can see results with threshold 0.55.

Tf-Idf, PCA, DBSCAN with cosine distance

Code

Code can be found in the 02_tweet_disaster directory. Required packages:

pandas==1.2.4
sklearn==0.0
matplotlib==3.4.1
networkx==2.5.1
scipy==1.6.1
seaborn==0.11.1
wordcloud==1.8.1
kmapper==2.0.1
nltk==3.6.2

Running code:

cd 02_tweet_disaster

python main.py (or python3 main.py)

Literature

[1] Intro to Applied Topological Data Analysis

[2] When is a coffee mug a donut? Topology explains it

[3] Introduction to Topology

[4] Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition - Gurjeet Singh, Facundo Mémoli and Gunnar Carlsson

[5] The Shape of an Image: A Study of Mapper on Images

[6] tmap: an integrative framework based on topological data analysis for population-scale microbiome stratification and association studies

[7] Iris dataset

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
01_iris		01_iris
02_tweet_disaster		02_tweet_disaster
data		data
docs		docs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MATF-NI-Mapper

Short introduction to Topology

Topological data analysis (TDA)

Mapper - Introduction

Mapper - Topological background and motivation

Definitions

Construction

Multiresolution structure

Mapper - Implementation

Clustering desired characteristics

Toy Example - Iris Dataset - 01_iris

Code

Disaster Tweets Dataset - 02_tweet_disaster

Exploratory Data Analysis

Results - Glove, T-SNE, DBSCAN

Results - Tf-Idf, PCA, DBSCAN

Results - Tf-Idf, PCA, KMeans

Results - Tf-Idf, PCA, DBSCAN with cosine distance

Code

Literature

About

Releases

Packages

Languages

Robotmurlock/TDA-Mapper

Folders and files

Latest commit

History

Repository files navigation

MATF-NI-Mapper

Short introduction to Topology

Topological data analysis (TDA)

Mapper - Introduction

Mapper - Topological background and motivation

Definitions

Construction

Multiresolution structure

Mapper - Implementation

Clustering desired characteristics

Toy Example - Iris Dataset - 01_iris

Code

Disaster Tweets Dataset - 02_tweet_disaster

Exploratory Data Analysis

Results - Glove, T-SNE, DBSCAN

Results - Tf-Idf, PCA, DBSCAN

Results - Tf-Idf, PCA, KMeans

Results - Tf-Idf, PCA, DBSCAN with cosine distance

Code

Literature

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages