Missing information is common place in chemical file formats and line notations. In many cases this information is implicit to the representation, but recovering it is not always easy, requiring assumptions which may not be true. Examples of missing informations is the lack of bonds in XYZ files, and the removed double bond location information for aromatic ring systems.
When reading files the format in one way or another has implicit information you may need for some algorithms. Element and isotope information is a key example. Typically, the element symbol is provided in the file, but not the mass number or isotope implied. You would need to read the format specification what properties are implicitly meant. The idea here is that information about elements and isotopes is pretty standardized by other organizations such as the IUPAC. Such default element and isotope properties are exposed in the CDK by the classes Elements and Isotopes.
The Elements class provides information about the element's atomic number, symbol, periodic table group and period, covalent radius and van der Waals radius and Pauling electronegativity:
ElementsDemo
For example, for lithium this gives:
ElementsDemo
Similarly, there is the Isotopes class to help you look up isotope information. For example, you can get all isotopes for an element or just the major isotope (a full list of isotopes is available from Appendix B:
HydrogenIsotopes
For hydrogen this gives:
HydrogenIsotopes
This class is also used by the getMajorIsotopeMass
method in the
MolecularFormulaManipulator class to calculate the
monoisotopic mass of a molecule:
MonoisotopicMass
The output for ethanol looks like:
MonoisotopicMass
XYZ files do not have bond information, and may look like:
code/data/methane.xyz
Fortunately, we can reasonably assume bonds to have a certain length, and reasonably understand how many connections and atom can have at most. Then, using the 3D coordinate information available from the XYZ file, an algorithm can deduce how the atoms must be bonded. The RebondTool does exactly that. And, it does it efficiently too, using a binary search tree, which allows it to scale to protein-sized molecules.
Now, the algorithm does need to know what reasonable bond lengths are, and for this we can use the Jmol list of covalent radii, and we configure the atoms accordingly:
CovalentRadii
which configures and prints the atoms' radii:
CovalentRadii
Then the RebondTool can be used to rebind the atoms:
RebondToolDemo
The number of bonds it found are reported in the last line:
RebondToolDemo
There are several reasons why bond orders are missing from an input structure. For example, you may be reading a XYZ file and just performed a rebonding as outlined in the previous section. Or, you may be reading SMILES strings with aromatic organic subset atoms, such as c1ccccc1. Or, you may be reading a MDL molfile that uses the query bond order 4 to indicate an aromatic bond.
The latter two situations are, in fact, very common in cheminformatics. Before CDK 1.4.11 we had the DeduceBondSystemTool to find the location of double bonds in such delocalized electron bond systems, but in that 1.4.11 release a new tool was released, the FixBondOrdersTool class, that does a better job, and faster too. Both classes only look for double bond positions in rings, but that covers many common use cases.
The method requires atom types to be perceived already, which is already done when reading SMILES, for example for pyrrole:
FixPyrroleBondOrders
This results in the image given in Figure pyrrole.
![](code/generated/FixPyrroleBondOrders.png)The CDKHydrogenAdder class can be used to add missing hydrogens. The algorithm itself adds implicit hydrogens (see Section hydrogens), but we will see how these can be converted into explicit hydrogens. The hydrogen adding algorithm expects, however, that CDK atom types are already perceived (see Section atomtypePerception).
Hydrogens that are not vertices in the molecular graph are called implicit hydrogens. They are merely a property of the atom to which they are connected. If these values are not given, which is common in for example SMILES, they can be (re)calculated with:
MissingHydrogens
which reports:
MissingHydrogens
These implicit hydrogens can be converted into explicit hydrogens using the following code:
ExplicitHydrogens
which reports for the running methane example:
ExplicitHydrogens
Another bit of information missing from the input is often 2D coordinates. To generate 2D coordinates, the StructureDiagramGenerator can be used:
Layout
which will generate the coordinate starting with an initial direction:
Layout
Mass spectrometry (MS) is a technology where the experiment yields monoisotopic masses for molecules. In order to analyze these further, it is common to convert them to molecular formula. The MassToFormulaTool has functionality to determine these missing formulae. Miguel Rojas-Chertó developed this code for use in metabolomics [Q27134827]. Basic usage looks like:
MissingMF
This will create a long list of possible molecular formula. It is important to realize that it looks only at what molecular formula are possible with respect to the corresponding mass. This means that it will include chemically unlikely molecular formulae:
MissingMF
This is overcome by setting restrictions. For example, we can put restrictions on the number of elements we allow in the matched formulae:
MissingMFRestrictions
Now the list looks more chemical:
MissingMFRestrictions
Of course, this is a long way from actual chemical structures. An Open Source structure generator has been a long standing holy grail, and the CDK-based MAYGEN addresses this gap [Q109827109], though the also open source Surge is a good bit faster [Q113585012].