Skip to content

Latest commit

 

History

History
226 lines (149 loc) · 8.76 KB

graph.i.md

File metadata and controls

226 lines (149 loc) · 8.76 KB

Graph Properties

Graph theory is the most common representation in cheminformatics, and with quantum mechanics, rule the informatics side of chemistry. The molecular graph follow graph theory and defines atoms as molecules and bonds as edge between to atoms. This is by far the only option, and the IBond allows for more complex representations, but we will focus on the molecular graph in this chapter.

Partitioning

If one is going to calculate graph properties, the first thing one often has to do, is to split ensure that one is looking at a fully connected graph. Since this is often in combination with ensuring fully connected graphs, the ConnectivityChecker is a welcome tool. It allows partitioning of the atoms and bonds in an IAtomContainer into molecules, organized into IAtomContainerSet:

ConnectivityCheckerDemo

Which gives:

ConnectivityCheckerDemo

Spanning Tree

The spanning tree of a graph, is subgraph with no cycles; that spans all atoms into a, still, fully connected graph:

SpanningTreeBondCount

which returns:

SpanningTreeBondCount

As a side effect, it also determines which bonds are ring bonds, and which are not:

SpanningTreeRingBonds

giving

SpanningTreeRingBonds

Ring counts

Warning: this section needs updating as there are better approaches in CDK 2.x.

Without providing a full review of all ring detection algorithms, this section discusses three algorithms for ring detection. The first two are alternative implementations of the Smallest Set of Smallest Rings (SSSR) concept. While argued to have limited usefulness if not harmful [OpenEyeSSSR], the CDK implements two algorithms: the Figueras algorithm [Q28837947], and an improved algorithm addressing limitations of the first [Q61780124]. The key limitation is that the resulting set is not unique. Depending on the initial conditions, alternative sets may result for the same structure, particularly for structures with bridgehead atoms. However, on the bright side, it does give some indication on the number of rings in a structure.

The third algorithm discussed is one to find all rings, even if they are merely combinations of smaller rings. For example, naphthalene contains has three rings.

Smallest Rings

The smallest set of smallest rings refers to a set of unique rings that, together, capture all ring atoms. Ring atoms may participate in multiple rings in this set, but each ring will have at least one ring atom not in other rings. But for each compound, the set itself may not be unique. For example, adamantane has multiple SSSRs. The trick of the SSSR algorithm is to find a smallest set of rings that cover all ring atoms.

The CDK implements two algorithms, both are found in the cdk.ringsearch package. The first is the algorithm developed by Figureas [Q28837947] for which the FiguerasSSSRFinder class can be used:

FiguerasSSSR

Which tells us the number of smallest rings for azulene:

FiguerasSSSR

However, because of this algorithm's limitations in finding a correct SSSR set for some corner case structure, the following alternative method by Berger et al. [Q61780124] is recommended:

SSSR

Which calculates the same number of rings for this compound:

SSSR

All Rings

If you are interesting in all possible rings, you can use the AllRingsFinder class, which implements an algorithm by Hanser et al. [Q28837943]:

FindAllRings

Which returns all three rings present in azulene:

FindAllRings

One should keep in mind that when more smallest rings are fused together, the number of all rings grows quickly. The AllRingsFinder class has three options to keep the calculation manageable for larger structures.

The first option is to search only for rings up to a certain size. For example, only rings of at most 5 atoms:

FindUptoFiveRings

For which there is only one in azulene:

FindUptoFiveRings

A second option is more like a trick to reduce the search space, but first isolating the ring systems. There are findAllRingsInIsolatedRingSystem alternatives for the above two approaches.

The third option to limit the execution time is to set a time out. This ensures that a computation will return after a certain time rather than continuing indefinately:

FindAllRingsInLimitedTime

Graph matrices

Chemical graphs have been very successfully used as representations of molecular structures, but are not always to most suitable representation. For example, for computation of graph properties often a matrix representation is used as intermediate step. The CDK has predefined helper classes to calculate two kind of graph matrices: the adjacency matrix and the distance matrix. Both are found in the cdk.graph.matrix package.

Adjacency matrix

The adjacency matrix describes which atoms are connected via a covalent bond. All matrix elements that link to bonded atoms are 1, and those matrix elements for disconnected atoms are 0. In mathematical terms, the adjacency matrix A is defined as:

Copy back in the equation. MathML?

The algorithm to calculate this matrix is implemented in the AdjacencyMatrix class. The matrix is calculated with the static getMatrix(IAtomContainer) method:

AdjacencyMatrixCalc

This code outputs the matrix, resulting for ethanoic acid, with the atoms in the order C, C, O, and O, in:

AdjacencyMatrixCalc

Distance matrix

The distance matrix describes the number of bonds on has to traverse to get from one atom to another. Therefore, it has zeros on the diagonal and non-zero values at all other locations. Matrix elements for neighboring atoms are 1 and others are larger. The CDK uses Floyd's algorithm to calculate this matrix [Q56170978], which is exposed via the TopologicalMatrix class:

DistanceMatrix

For the ethanoic acid used earlier, the resulting matrix looks like:

DistanceMatrix

Atom Numbers

Another important aspect of the chemical graph, is that the graph uniquely places atoms in the molecule. That is, the graphs allows us to uniquely identify, and therefore, number atoms in the molecule. This is an important aspect of cheminformatics, and the concept behind canonicalization, such as used to create canonical SMILES. The InChI library (see Chapter inchi) implements such an algorithm, and we can use it to assign unique integers to all atoms in a chemical graph.

Morgan Atom Numbers

Morgan published an algorithm in 1965 to assign numbers to vertices in the chemical graph [Q28837925]. The algorithm does not take into account the element symbols associated with those vertices, and it only based on the connectivity. Therefore, we see the same number of symmetry related atoms, even if they have different symbols. If we run:

MorganAtomNumbers

we see this output:

MorganAtomNumbers

InChI Atom Numbers
![](images/generated/RenderAdenineWithNumbers.png) ![](images/generated/RenderOxazole.png) ![](images/generated/InChIAtomNumbersBenzene.png)

The InChI library does not have a direct method to calculate atom numbers from Java, but the CDK can extract these from the auxiliary layer. These numbers are those listed in the bond layer, but to use these in the CDK molecule class, we need to mapping of the InChI atom numbers This method is made available via the InChINumbersTools class:

InChIAtomNumbers

which outputs:

InChIAtomNumbers

It is important to note that because these numbers are used in the connectivity layer, symmetry is broken in assignment of these numbers, allowing all atoms in, for example, benzene to still be uniquely identified:

InChIAtomNumbersBenzene

which outputs:

InChIAtomNumbersBenzene

The InChI atom numbers are shown in Figure inchiAtomNumbers.

References