Skip to content

Report of QT clustering project

Carlota Carbajo Moral edited this page Nov 14, 2020 · 5 revisions

Welcome to the QT-Clustering wiki!

Introduction

Clustering can be defined as a process of grouping objects of similar kinds into respective categories. It is considered one of the most useful tasks in data mining process for discovering new classifications and interesting patterns in the underlying data [1]. The algorithms that perform clustering analysis make partitions of the data objects into subsets (which normally are named clusters) based on similarity or dissimilarity among them. The objects grouped in the same cluster are more similar to each other than they are to objects belonging to a different cluster [1].

Clustering algorithms have proved to be an indispensable tool for the exploratory analysis of high dimensional datasets. Nowadays, many methods have been proposed for clustering and in 1998 Eisen et al. proposed for the first time the use of these grouping techniques in biological datasets [2]. Since then, clustering algorithms have become an essential part of biological datasets analysis which address the classification of huge amounts of data, specially in the field of gene expression analysis where they are used to arrange genes according to similarity in pattern of gene expression. The way of assessing similarity among the data relies on the use of mathematical descriptions of similarity. One of the most common similarity criteria used is the Euclidean distance of the two n-dimensional vectors representing a series of n measurements [3].

The quality cluster algorithm (QT) is a clustering method designed by Heyer, Kruglyak & Yooseph in 1999 specifically for grouping gene expression patterns. The aim of their work was to generate clusters that could summarize genome-wide expression of the yeast Saccharomyces cerevisiae and create biologically meaningful groups of genes by using a supervised method [4]. Grouping genes with similar expression levels is very useful in biological research as it can reveal the function of previously uncharacterized genes and give information about some regulatory systems. According to different studies functionally related genes tend to be grouped together during clustering and it is known that genes that are coexpressed throughout a variety of conditions may be regulated by a common regulatory system [4].

[1] https://pubmed.ncbi.nlm.nih.gov/19957146/

[2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5708128/

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC24541/

[4] https://pubmed.ncbi.nlm.nih.gov/10568750/

Contribution

Theory

Algorithm design

Program design

Program manual

Runtime analysis

Conclusion