Graph theory is the most common representation in cheminformatics, and with
quantum mechanics, rule the informatics side of chemistry. The molecular graph
follow graph theory and defines atoms as molecules and bonds as edge between to
atoms. This is by far the only option, and the IBond
allows for more complex
representations, but we will focus on the molecular graph in this chapter.
If one is going to calculate graph properties, the first thing one often has to do, is to split ensure that one is looking at a fully connected graph. Since this is often in combination with ensuring fully connected graphs, the ConnectivityChecker is a welcome tool. It allows partitioning of the atoms and bonds in an IAtomContainer into molecules, organized into IAtomContainerSet:
ConnectivityCheckerDemo
Which gives:
ConnectivityCheckerDemo
The spanning tree of a graph, is subgraph with no cycles; that spans all atoms into a, still, fully connected graph:
SpanningTreeBondCount
which returns:
SpanningTreeBondCount
As a side effect, it also determines which bonds are ring bonds, and which are not:
SpanningTreeRingBonds
giving
SpanningTreeRingBonds
Warning: this section needs updating as there are better approaches in CDK 2.x.
Without providing a full review of all ring detection algorithms, this section discusses three algorithms for ring detection. The first two are alternative implementations of the Smallest Set of Smallest Rings (SSSR) concept. While argued to have limited usefulness if not harmful [OpenEyeSSSR], the CDK implements two algorithms: the Figueras algorithm [Q28837947], and an improved algorithm addressing limitations of the first [Q61780124]. The key limitation is that the resulting set is not unique. Depending on the initial conditions, alternative sets may result for the same structure, particularly for structures with bridgehead atoms. However, on the bright side, it does give some indication on the number of rings in a structure.
The third algorithm discussed is one to find all rings, even if they are merely combinations of smaller rings. For example, naphthalene contains has three rings.
The smallest set of smallest rings refers to a set of unique rings that, together, capture all ring atoms. Ring atoms may participate in multiple rings in this set, but each ring will have at least one ring atom not in other rings. But for each compound, the set itself may not be unique. For example, adamantane has multiple SSSRs. The trick of the SSSR algorithm is to find a smallest set of rings that cover all ring atoms.
The CDK implements two algorithms, both are found in the cdk.ringsearch
package. The
first is the algorithm developed by Figureas [Q28837947] for which the
FiguerasSSSRFinder class can be used:
FiguerasSSSR
Which tells us the number of smallest rings for azulene:
FiguerasSSSR
However, because of this algorithm's limitations in finding a correct SSSR set for some corner case structure, the following alternative method by Berger et al. [Q61780124] is recommended:
SSSR
Which calculates the same number of rings for this compound:
SSSR
If you are interesting in all possible rings, you can use the AllRingsFinder class, which implements an algorithm by Hanser et al. [Q28837943]:
FindAllRings
Which returns all three rings present in azulene:
FindAllRings
One should keep in mind that when more smallest rings are fused together, the number of all rings grows quickly. The AllRingsFinder class has three options to keep the calculation manageable for larger structures.
The first option is to search only for rings up to a certain size. For example, only rings of at most 5 atoms:
FindUptoFiveRings
For which there is only one in azulene:
FindUptoFiveRings
A second option is more like a trick to reduce the search space, but first isolating
the ring systems. There are findAllRingsInIsolatedRingSystem
alternatives for the
above two approaches.
The third option to limit the execution time is to set a time out. This ensures that a computation will return after a certain time rather than continuing indefinately:
FindAllRingsInLimitedTime
Chemical graphs have been very successfully used as representations of molecular
structures, but are not always to most suitable representation. For example,
for computation of graph properties often a matrix representation is used
as intermediate step. The CDK has predefined helper classes to calculate two
kind of graph matrices: the adjacency matrix and the distance matrix. Both
are found in the cdk.graph.matrix
package.
The adjacency matrix describes which atoms are connected via a covalent
bond. All matrix elements that link to bonded atoms are 1, and those matrix
elements for disconnected atoms are 0. In mathematical terms, the adjacency matrix A
is defined as:
Copy back in the equation. MathML?
The algorithm to calculate this matrix is implemented in the
AdjacencyMatrix class. The matrix is calculated with
the static getMatrix(IAtomContainer)
method:
AdjacencyMatrixCalc
This code outputs the matrix, resulting for ethanoic acid, with the atoms in the order C, C, O, and O, in:
AdjacencyMatrixCalc
The distance matrix describes the number of bonds on has to traverse to get from one atom to another. Therefore, it has zeros on the diagonal and non-zero values at all other locations. Matrix elements for neighboring atoms are 1 and others are larger. The CDK uses Floyd's algorithm to calculate this matrix [Q56170978], which is exposed via the TopologicalMatrix class:
DistanceMatrix
For the ethanoic acid used earlier, the resulting matrix looks like:
DistanceMatrix
Another important aspect of the chemical graph, is that the graph uniquely places atoms in the molecule. That is, the graphs allows us to uniquely identify, and therefore, number atoms in the molecule. This is an important aspect of cheminformatics, and the concept behind canonicalization, such as used to create canonical SMILES. The InChI library (see Chapter inchi) implements such an algorithm, and we can use it to assign unique integers to all atoms in a chemical graph.
Morgan published an algorithm in 1965 to assign numbers to vertices in the chemical graph [Q28837925]. The algorithm does not take into account the element symbols associated with those vertices, and it only based on the connectivity. Therefore, we see the same number of symmetry related atoms, even if they have different symbols. If we run:
MorganAtomNumbers
we see this output:
MorganAtomNumbers
The InChI library does not have a direct method to calculate atom numbers from Java, but the CDK can extract these from the auxiliary layer. These numbers are those listed in the bond layer, but to use these in the CDK molecule class, we need to mapping of the InChI atom numbers This method is made available via the InChINumbersTools class:
InChIAtomNumbers
which outputs:
InChIAtomNumbers
It is important to note that because these numbers are used in the connectivity layer, symmetry is broken in assignment of these numbers, allowing all atoms in, for example, benzene to still be uniquely identified:
InChIAtomNumbersBenzene
which outputs:
InChIAtomNumbersBenzene
The InChI atom numbers are shown in Figure inchiAtomNumbers.