-
Notifications
You must be signed in to change notification settings - Fork 25
de Bruijn Graph
The core data structure of McCortex is a de Bruijn Graph, which is a graph where nodes are kmers (words of length k
) and edges exist between all kmers which overlap by k-1
characters. Kmer size k
is fixed for any given graph (more on kmer size). The kmers in genome assembly use bases
(i.e. the DNA letters A,C,G,T).
An example with k=5
:
ACCAT -> CCATC
which has two nodes (ACCAT
,CCATC
) and one edge between them. Traversing this graph from left to right we assemble the sequence ACCATC
. From right to left we assemble the reverse complement GATGGT
.
Each node in the McCortex de Bruijn graph represents both the kmer and is reverse complement. We refer to this representation as the kmer-key (lexically lowest of the two). Nodes can then be traversed with an orientation FORWARD (the kmer-key) or REVERSE (the reverse complement of the kmer-key).
By only allowing odd values of parameter k
, a kmer can never be its own reverse complement. This is why --kmer-size parameter must always be odd.
There are various ways to represent a de Bruijn graph in memory, the most obvious being a hash table where the key is the kmer-key. More compact representations have been developed and present different pay offs in speed and required memory.
- wikipedia
- Nature review paper on using de Bruijn graphs in genome assembly