Similarity Search, Tanimoto Coefficient and Fingerprint Generation

A number of thresholds or measures are available for similarity searching. The higher the threshold the closer the target structures are to the query structure. By default the Similarity search within SureChEMBL uses the Tanimoto coefficient (Tc) to calculate the degree of similarity between the query and the target structures. The Tanimoto coefficient has two arguments:

The fingerprint of the query structure
The fingerprint of the target structure

A fingerprint is comprised of a list of predefined structure fragments or feature found within a structure. Each feature that is present is represented as “on” by using the number 1 (as in one bit).

Tanimoto coefficient formula

NA represents the number of "on" features (bits) in structure A.

NB represents represents the number of "on" features (bits) in structure B.

NA&B represents the number of "on" features (bits) common to both fingerprints A and B.

The hashed binary chemical fingerprint of a molecule is a fixed-length bit string (a sequence of "0" and "1" digits) that contains information on the structure.

Fingerprint type

SureChEMBL makes use of the RDKit chemical fingerprints. They are hashed Morgan fingerprints 256 bits with a radius of 2 atoms.

The process of fingerprint generation is as follows:

Up to a given bond, all linear paths (linear patterns) consisting of bonds and atoms of a structure are detected.
Branching points at the end of each linear pattern are also detected.
All cycle (cyclic patterns) are detected. Using a proprietary hashing method, a given number of bits in the bit stream are set for each pattern. It is possible that the same bit is set by multiple patterns. This phenomenon is called bit collision. A few bit collisions in the fingerprint are tolerable, but too many may result in losing information in the fingerprint.

FPSim2

SureChEMBL's Similarity search is using FPSim2, a program developed by Eloy Felix to run fast compound similarity searches.

The FPSim2 file is available in each SureChEMBL bulk data directory (fpsim2_fingerprints.h5)

More background information are available in this blog post - and some documentation on the similarity search can be found on GitHub.

Reference websites

Some details presented within this documentation kindly provided by and reproduced from ChemAxon (www.chemaxon.com).

PreviousStructure search type NextFilter by molecular weight

Last updated 6 months ago