ML-taxonomic-identity-prediction
Homework for Machine Learning Course 2021 (MSc Bioinformatics for Computational Genomics) helded by Prof. Matteo Matteucci and Marco Cannici at Politecnico di Milano.
The notebook can be viewed here nbviewer
The aim of this project is to investigate the use of codon usage frequencies from different organisms to identify if they can be used to classify codon usage in terms of 11 Kingdoms: archea, bacteria, bacteriophage, plasmid, plant, invertebrate, vertebrate, mammal, rodent, primate and virus. The anaysis is carried out using techniques for clustering, classification and regression learned during the course.
"The coding DNA of a genome describes the proteins of the organism in terms of 64 different codons that map to 21 different amino acids and a stop signal. Different organisms differ not only in the amino acid sequences of their proteins, but also in the extents in which they use the synonymous codons for different amino acids. The inherent redundancy of the genetic code allows the same amino acid to be specified by one to five different codons so that there are, in principle, many different nucleic acids to describe the primary structure of a given protein. Coding DNA sequences thus can carry information beyond that needed for encoding amino acid sequence. Thus, one may ask: is it possible to classify some properties of nucleic acids from the usages of different synonymous codons rather than, with much greater computational effort, from individual nucleotide sequences themselves?" — Khomtchouk, Bohdan B. "Codon usage bias levels predict taxonomic identity and genetic composition." bioRxiv (2020).
This data set enables a preliminary analysis on this topic.
Khomtchouk, Bohdan B. "Codon usage bias levels predict taxonomic identity and genetic composition." bioRxiv (2020)