Skip to content

Taxonomic Identity PREdiction: homework for Machine Learning Course 2021 (MSc Bioinformatics for Computational Genomics)

Notifications You must be signed in to change notification settings

mariachiaragrieco/TIPre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TIPRE: Taxonomic Identity PREdiction

ML-taxonomic-identity-prediction

Homework for Machine Learning Course 2021 (MSc Bioinformatics for Computational Genomics) helded by Prof. Matteo Matteucci and Marco Cannici at Politecnico di Milano.

The notebook can be viewed here nbviewer

Aim

The aim of this project is to investigate the use of codon usage frequencies from different organisms to identify if they can be used to classify codon usage in terms of 11 Kingdoms: archea, bacteria, bacteriophage, plasmid, plant, invertebrate, vertebrate, mammal, rodent, primate and virus. The anaysis is carried out using techniques for clustering, classification and regression learned during the course.

Background

"The coding DNA of a genome describes the proteins of the organism in terms of 64 different codons that map to 21 different amino acids and a stop signal. Different organisms differ not only in the amino acid sequences of their proteins, but also in the extents in which they use the synonymous codons for different amino acids. The inherent redundancy of the genetic code allows the same amino acid to be specified by one to five different codons so that there are, in principle, many different nucleic acids to describe the primary structure of a given protein. Coding DNA sequences thus can carry information beyond that needed for encoding amino acid sequence. Thus, one may ask: is it possible to classify some properties of nucleic acids from the usages of different synonymous codons rather than, with much greater computational effort, from individual nucleotide sequences themselves?" — Khomtchouk, Bohdan B. "Codon usage bias levels predict taxonomic identity and genetic composition." bioRxiv (2020).

This data set enables a preliminary analysis on this topic.

Reference

Khomtchouk, Bohdan B. "Codon usage bias levels predict taxonomic identity and genetic composition." bioRxiv (2020)

About

Taxonomic Identity PREdiction: homework for Machine Learning Course 2021 (MSc Bioinformatics for Computational Genomics)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published