Skip to content

Implementation of TDA Mapper algorithm and application on Tweet Disaster dataset

Notifications You must be signed in to change notification settings

Robotmurlock/TDA-Mapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MATF-NI-Mapper

Short introduction to Topology

Topology is the math concerning continuous objects (not sizes and shapes, but continuity). It is how things are connected and where the gaps are. It explains how a material's shape can be completely deformed into new one without losing its core properties. [1]

Definition (homotopy): Let f, g : X ⟶ Y be maps. f is homotopic to g if there exists a map F : X×I ⟶ Y such that F(x, 0) = f(x) and F(x, 1) = g(x) for all points x ∈ X. The map F is called a homotopy from f to g and we write it as f' F g. More intuitvely, if we think of the second parameter of F as “time”, then F describes a “continuous deformation” of f into g. At time 0 we have the function f, at time 1 we have the function g.

Famous example is the mug and donut homotopy. Informally, two continuous functions from one topological space to another are called homotopic if one can be continuously deformed into another. Such a deformation being called a homotopy between the two functions. [1][2]

iris

Topological data analysis (TDA)

Topological data analysis (TDA) is an approach to the analysis of datasets using techniques from topology. It is usually combined with other forms of analysis such as statistical or geometric approaches. [1]

Famous TDA algorithm for visualization of high dimensional data is Mapper.

Mapper - Introduction

Mapper is TDA algorithm is used for generalized notion of coordinatization for high dimensional datasets. Coordinatization can refer to choice of real valued coordinate functions or data or other notions of geometric representation like reeb graph. [4]

iris

Mapper - Topological background and motivation

The Mapper algorithm is explained in three sections:

  • Math theory required for mapper algorithm;
  • Mapper: Construction of simplical complex (e.g. graph) from cover;
  • Mapper: Multiresolution motivation (level of detail).

Definitions

Definition: A topology τ on a set X consists of subsets of X satisfying the following properties:

  1. The empty set ∅ and the space X are both sets in the topology;
  2. The union of any collection of sets in τ is contained in τ;
  3. The intersection of any finitely many sets in τ is also contained in τ.

All sets in topology are open sets. [3]

Examples (with visualization):

  1. τ = {∅, {1, 2, 3}}
  2. τ = {∅, {1}, {1, 2, 3}}
  3. τ = {∅, {1}, {2}, {1, 2}, {1, 2, 3}}
  4. τ = {∅, {1, 2}, {2, 3}, {2}, {1, 2, 3}}
  5. τ = {∅, {2}, {3}, {1, 2, 3}}, missing: {2, 3} = {2} ∪ {3}.
  6. τ = {∅, {1, 2}, {2, 3}, {1, 2, 3}}, missing: {2} = {1, 2} ∩ {2, 3}.

topology

Definition: A topological space is a pair (X, τ) where X is a set and τ a topology on a set X. [3]

Definition: Cover C of set X is collection of sets whose union includes X. Cover is open cover is all members are open sets. [3]

Example: X is unit circle and C is set of circles containing X.

cover

Definition: Nerve of an open covering C is a construction of simplical complex N(C).

Example: From previous example we can form nerve of open covering. Note: Obtained simplical complex approximates initial space.

nerve

Definition: Partition of unity of topological space X is set of continuous functions R from X to [0, 1] where for each point x ∈ X:

  • there is a neighbourhood of x where all but finite number of functions of R are 0,
  • the sum of all the function values at x is 1.

partition

Definition: Points v in the k-simplex correspond to set of ordered k-tuples of real numbers (numbers are from interval [0, 1] and they sum up to 1). We can intrepet these values as normalized masses. This coordinate system is called barycentric coordinate system.

Example: Barycentric coordinates of 2-simplex (triangle):

barycentric

  • connected components and path connected components.

Definition: A topological space X is said to be disconnected if it is the union of two disjoint nonempty open sets. Otherwise, X is said to be connected. We can define relation x ~ y if there exists connected subset of X that is containing them. It can be shown that this relation is equivalence relation. Equilvalence classes of this relation are called connected components. [3]

connectedness

Definition: The space X is said to be path-connected if for any two points x,y ∈ X there exists a continuous function f from the unit interval [0,1] to X with f(0) = x and f(1) = y. (This function is called a path from x to y.) We can now define relation x ~ y if x is path connected to y. It can be shown that this relation is equivalence relation. Equilvalence classes of this relation are called path-connected components.

path-connectedness

Construction

Assume we have a finite covering U = {Ua | a ∈ A} of space X where A is an indexing set (image construction-1).

construction-1

construction-1: Space X, defined by black border, with finite covering U = {U1, U2, U3, U4}

We can define nerve of the covering U to be a simplical complex N(U) (image constructin-2) where:

  • Vertices of N(U) are named by index set A.
  • Family {a0, ..., ak} forms a k-simplex in N(U) (vertices of simplex) if and only if Ua0 ∩ Ua1 ∩ ... ∩ Uak is a non-empty set. [3]

construction-2

construction-2: The nerve of covering N(U) for U

With defined a partition of unity {Φa: X ⟶ [0, 1] | a ∈ A} (∑α Φα(x)=1), we can obtain map from X to N(U):

  • Let T: X ⟶ A, T(x) = {a | x ∈ Ua} (set of members of covers that contain x).
  • Let ρ: X ⟶ N(U) where ρ(x) is point in simplex spanned by vertices a ∈ T(x) (spanned by k-simplex vertices) whose barycentric coordinates are (Φa1(x), Φa2(x), ..., Φak(x)) where a1, a2, ..., ak are values from T(x). Continuous map ρ provides kind of partial coordination of X using a k-simplex from N(U). [3]

We can form a finite covering V with continuous map f: X ⟶ Z where Z is parameter space. Let the parameter space Z be equipped with a finite open covering C = {Cb | b ∈ B} where B is an indexing set (image construction-3).

construction-3

construction-3: Mapping of covering C to Z

Let g be inverse map of f. Map g is continuous since f is continuous. Hence, the sets Vb := g(Cb) also form a finite open covering of space X (image construction-4). We can now decompose Ub into path connected components (Vb is union of connected components). [3]

construction-4

construction-4: Forming covering using inverse function g

Multiresolution structure

If we have two coverings U = {Ua | a ∈ A} and V = {Vb | b ∈ B} then map of coverings from U to V is function f: A ⟶ B so that for all a ∈ A, we have Ua ⊆ Vf(a). Hence, we have induced mapping of simplical complexes N(f): N(U) ⟶ N(V). [3]

Consequently, if we have a family of coverings Ui, i = 0,1,...,n, and maps of coverings fi : Ui → Ui+1 for each i, we obtain a diagram of simplicial complexes and simplicial maps: [3]

multiresolution

This means that when resolution of cover increases (members of cover are decomposed into more "smaller" members) the resulting "more detailed" simplical complex (vertices are consequently decomposed). In case of graphs, they are more refined in sense that there are more nodes inserted along the edges. Example: [5]

multiresolution-example

Mapper - Implementation

Statistical version of mapper is used for the implementation. Idea is to use some clustering algorithm to partition space into set of connected components. Assume we have N data points x ∈ X, filter function (lens) f: X ⟶ R and inter-point distance matrix (explicitly or implicitly given some metric). Example: image mapper-example: point cloud. Algorithm:

  1. Form a cover for range I of function f:
    • I = f(X);
    • We can form the cover by splitting I into a set of intervals S of equal length which overlap.
    • Note: We have two parameters here: Size of the set S (n) (or length of interval l) and percentage overlap between successive intervals (p).
    • Example:
      • X ⊆ [0,2]x[0,2]
      • f(x, y) = y ⇒ I ⊆ [0,2]
      • n = 4, p = 10%
      • S = {[0, 0.55], [0.45, 1.05], [0.95, 1.55], [1.45, 2]}
    • Example: image mapper-example: filter
  2. Calcute the preimage for each member of cover I:
    • Xi = f-1[Ii]
    • set V = {Xi | i ∈ {1...n}} forms cover of X
    • Example: image mapper-example: covering
  3. Cluster each member of V:
    • Clustering algorithm is arbitrary, but it should have some desired characteristics which will be noted later.
    • Each cluster forms a vertex vXi, Cj where Xi is clustered member of the cover and Ci, j is cluster obtained from clustering Xi.
    • Example: image mapper-example: clustering
  4. Form a simplical complex (or graph):
    • In case of forming simplical complex: Set of vertices {vi1, j1, vi2, j2, ..., vik, jk} form k-simplex if Ci1, j1 ∩ Ci2, j2 ∩ ... ∩ Cik, jk ≠ ∅
    • In case of forming a graph (special case: Vertices va, b and vc, d are connected if Ca, b ∩ Cc, d ≠ ∅
    • Example: image mapper-example: TDA network

mapper-overview

mapper-example: Applying mapper [[6]](#6)

Clustering desired characteristics

Besides choosing a good filter function, finding a good clustering is another important challenge for decent data visualization. The Mapper algorithm does not place any conditions on the clustering but there are some desired characteristics: [4]

  • Clustering should not be restricted to the euclidean space. We can give inter-point distance matrix as an input to mapper algorithm.
  • Number of clusters should not be specified beforehand. We can use algorithm that do not require number of clusters specified before applying clustering algorithm. Examples: DBSCAN, Agglomerative (hierarchical) clustering with distance threshold, etc...

Toy Example - Iris Dataset - 01_iris

Iris dataset is a well known dataset consisting of three classes of flowers (setosa, versicolor, virginica) with 50 instances for each class. This dataset is associated with classification (supervised) task in machine learning. [7]

  • Features: SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm
  • Target: Species

iris-flowers

Custom mapper algorithm prototype is used to test class separability by using visualization. Size of the nodes is correlated to number of instances that node contains. Parameters:

  • Filter function: Projection on first PCA component
  • Clustering: DBSCAN
  • Number of intervals: 10
  • overlap percentage: 50%

In the image on left we can see nodes colored by mode of each cluster (node). We can see three "bigger" clusters:

  • setosa cluster which is isolated from versicolor and virginica
  • virginica and versicolor clusters which are not isolated from each other.

On the image on the right we can see entropies of the nodes and expect classes virginica and versicolor to mix when applying trained a classification model while setosa should be easily learned to be separated by the model.

We could also make another plot after training the model to see model certainty i.e. where model is not sure which class does the instance belong to.

iris

Code

Code can be found in the 01_iris directory. Required packages:

pandas==1.2.4
sklearn==0.0
matplotlib==3.4.1
networkx==2.5.1
scipy==1.6.1

Running code:

cd 01_iris

python main.py (or python3 main.py)

Disaster Tweets Dataset - 02_tweet_disaster

Link to Kaggle competition: Natural Language Processing with Disaster Tweets

Each sample in the dataset has the following information:

  • The text of a tweet
  • A keyword from that tweet (although this may be blank!)
  • The location the tweet was sent from (may also be blank)

Goal: Predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0. Note: This is the goal of Kaggle competition but not our goal here.

Data: Columns

  • id - a unique identifier for each tweet
  • text - the text of the tweet
  • location - the location the tweet was sent from (may be blank)
  • keyword - a particular keyword from the tweet (may be blank)
  • target - denotes whether a tweet is about a real disaster (1) or not (0)

Exploratory Data Analysis

Exploratory data analysis od the dataset can be found on this link.

Results - Glove, T-SNE, DBSCAN

Interactive visualization is generated using KeplerMapper API. Results can be found on this link.

Result of The Mapper algorithm using Glove word embeddings. Configuration:

  • Filter function: Projection on two T-SNE components
  • Clustering: DBSCAN
  • Number of intervals (cubes): 10x10
  • overlap percentage: 50%

kmapper-result-maximum

Glove Embeddings (max), T-SNE, DBSCAN

When taking average instead of maximum of the Glove word embeddings we get more uniform representation (from disaster to non-disasters) instead of the flares for (mostly) the disasters. Base of configurations (T-SNE, DBSCAN) but parameters for these objects are differ.

kmapper-result-maximum

Glove Embeddings (mean), T-SNE, DBSCAN

Results - Tf-Idf, PCA, DBSCAN

Problem with Tf-Idf with PCA projection is that most of the samples are collected in the middle (the middle nodes in the image). It is very hard to separate them using different paramters and algorithms.

We have three disaster flares: suicide bombing (upper left), wildfires (left) and sandstorm (bottom), that are separated from the rest.

kmapper-result-maximum

Tf-Idf, PCA, DBSCAN

Results - Tf-Idf, PCA, KMeans

Separated node on top represents sandstorm disaster. Cluster of nodes on the right represent suicide bombing and bottom nodes on the left represent wildifires.

kmapper-result-maximum

Tf-Idf, PCA, KMeans

Results - Tf-Idf, PCA, DBSCAN with cosine distance

Problem with cosine distance in this is case is small threshold between 0.5 and 0.6 (cosine distance has values from 0 to 1) where are nodes are sparsed or collected. On the image below we can see results with threshold 0.55.

kmapper-result-maximum

Tf-Idf, PCA, DBSCAN with cosine distance

Code

Code can be found in the 02_tweet_disaster directory. Required packages:

pandas==1.2.4
sklearn==0.0
matplotlib==3.4.1
networkx==2.5.1
scipy==1.6.1
seaborn==0.11.1
wordcloud==1.8.1
kmapper==2.0.1
nltk==3.6.2

Running code:

cd 02_tweet_disaster

python main.py (or python3 main.py)

Literature

[1] Intro to Applied Topological Data Analysis

[2] When is a coffee mug a donut? Topology explains it

[3] Introduction to Topology

[4] Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition - Gurjeet Singh, Facundo Mémoli and Gunnar Carlsson

[5] The Shape of an Image: A Study of Mapper on Images

[6] tmap: an integrative framework based on topological data analysis for population-scale microbiome stratification and association studies

[7] Iris dataset

About

Implementation of TDA Mapper algorithm and application on Tweet Disaster dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published