This repository contains code for downloading a set of crystal structures from the Crystallography Open Database (COD) (http://www.crystallography.net/cod/) and performing an energy minimization using the OpenMM (https://openmm.org/) molecular simulation tookit and OpenFF (https://openforcefield.org/).
Project was coded for Python 3.9. Refer to requirements.txt
file for required packages.
- Download list of desired structures from COD using list of COD IDs in
COD_ID_List.txt
andCOD_import.py
script. This downloads CIF files from COD and converts to PDB format using pybel. These are output in thedata/CIF
anddata/PDB
directories. - Create supercell from PDB file using PyMol
supercell
function. This cell is created to be large enought to satisfy that periodic boundary conditions are greater than 1.8nm. This is done in thedata/supercell_generation.py
script and output in thedata/PDB_supercell
directory. - Paramaterize using openFF and build MD simulation in openMM. Perform energy minimization and record the initial and final energy values as well as the RMSD between initial and final state.
main.py
will perform an energy minimization which includes custom functions to minimize the unit cell as well as positions. This script will also output initial and final box vectors.main_no_box_minimization.py
will perform only energy minimization with respect to position using the built-in OpenMM energy minimization.
Results that are output in the data
directory:
minimzation_results.csv
is a csv file with COD ID #, initial and final energies, and RMSD between those states. If box vector minimization is performed, the initial and final box vectors are also output.minimization_results.pkl
is a pickled file of the underlying Pandas DataFrame for the above data.rmsd_values.txt
is a tab separated text data file that only reports COD ID and RMSD values.initial_states
andfinal_states
directories contain the .xml state files from OpenMM saved before and after energy minimzation.dminimized_PDB_supercell
directory contains the PDB files of the minimized supercell system.
The examples
folder contains scripts for a reduced subset of 10 COD ID values with the expected outputs stored in the expected_data
directory. For all, ensure the working directory is set to /examples
.
- Run
example_COD_import.py
. - Run
example_supercell_generation.py
. - Run
example_main.py
. - Compare the generated data in
example_data
toexpected_data
.
COD_ID_List.txt
is not curated of entries that have data issues (non-matching coordinates between CIF and SMILES
and entries without SMILES strings). These entries currently produce errors that are logged in errors.log
during execution.
The current energy minimization algorithm fails for many systems due to periodic cell conditions not being in a reduced format (http://docs.openmm.org/7.0.0/userguide/theory.html#periodic-boundary-conditions).
The energy minimization algorithm also rarely tries to reduce the periodic boundary conditions to less than 0.9 nm which is half the non-bonded cutoff distance which causes an error.