Molecule standardisation

This is the public version of a tool designed to provide a simple way of standardising molecules as a prelude to e.g. molecular modelling exercises.

A Python module performs the complete standardisation procedure. In addition, the modules that implement the individual steps in this procedure may be accessed separately if required, for example as part of a custom standardisation pipeline.

The tool is open-source and is available from GitHub.

A slide-set describing some of the background to the project is available

In summary, the general procedure for standardising a molecule (with the documentation for the appropriate module linked) is...

  • Break bonds to Group I or II metals [break_bonds]

  • Neutralise charges by adding/removing protons [neutralise]

  • Apply standardisation rules [rules]

  • Apply tautomerism rules [ tautomerism ]

  • Re-run neutralisation (in case any charges are exposed by rules)

  • Discard any salt/solvate components [unsalt]

  • Return standardised parent

The complete procedure is implemented by the standardise module; a bare-bones alternative workflow using the individual modules is shown here.

The documentation is contained in the project docs/ directory, and consists of a set of Jupyter Notebooks, which can be viewed (and run and edited) by starting a notebook server in that directory.

A simple command-line driver program standardiser_mol.py is available in the project bin/ directory. It take SD or SMILES as input, and writes out a file containing those structures that have been successfully standardised and one containing structures for which the procedure has failed.

Acknowledgements

  • This work was funded by the IMI eTOX project.

  • The salt dictionary used is based on that used in the ChEMBL database; this was compiled by L.J. Bellis, A. Hersey and others and was in turn was based on that used in the USAN nomenclature.

  • Some of the standardisation rules were inspired by those used in the InChI software.

  • This project is built using the RDKit chemistry toolkit.

Licensing

This code is released under the Apache 2.0 license. Copyright [2020] is retained by the EMBL-EBI.

Last updated