⚗️
SureChEMBL
  • SureChEMBL
  • Contact us
  • AVAILABLE DATA SOURCES
  • PATENT VIEW
  • TEXT SEARCH
    • Search interface overview
    • Simple patent query overview
    • Complex Solr search
    • Patents with small molecules filter
    • Date filter
    • Query assistant
    • Solr query field names and examples
    • Patent number search format
  • CHEMICAL SEARCH
    • Structure drawing
    • Insert a SMILES, SMARTS, MOL, or Name Entry
    • Structure search type
    • Similarity Search, Tanimoto Coefficient and Fingerprint Generation
    • Filter by molecular weight
    • Search for structure in document section(s)
    • SMARTS search
  • PATENT ANNOTATION
    • Chemistry annotations
    • Biomedical annotations
  • API
    • API documentation
    • Examples
  • DOWNLOADS
    • Bulk data
    • Export annotations for patent
    • Old data downloads
      • MAP files
      • SureChEMBL data client
      • SureChEMBL compound data dump
  • FAQ
  • Data protection: Privacy notice for SureChEMBL's public website
Powered by GitBook
On this page
  1. CHEMICAL SEARCH

Similarity Search, Tanimoto Coefficient and Fingerprint Generation

PreviousStructure search typeNextFilter by molecular weight

Last updated 1 day ago

A number of thresholds or measures are available for similarity searching. The higher the threshold the closer the target structures are to the query structure. By default the Similarity search within SureChEMBL uses the Tanimoto coefficient (Tc) to calculate the degree of similarity between the query and the target structures. The Tanimoto coefficient has two arguments:

  • The fingerprint of the query structure

  • The fingerprint of the target structure

A fingerprint is comprised of a list of predefined structure fragments or feature found within a structure. Each feature that is present is represented as “on” by using the number 1 (as in one bit).

Tanimoto coefficient formula

NA represents the number of "on" features (bits) in structure A.

NB represents represents the number of "on" features (bits) in structure B.

NA&B represents the number of "on" features (bits) common to both fingerprints A and B.

The hashed binary chemical fingerprint of a molecule is a fixed-length bit string (a sequence of "0" and "1" digits) that contains information on the structure.

Fingerprint type

SureChEMBL makes use of the RDKit chemical fingerprints. They are hashed Morgan fingerprints 256 bits with a radius of 2 atoms.

The process of fingerprint generation is as follows:

  1. Up to a given bond, all linear paths (linear patterns) consisting of bonds and atoms of a structure are detected.

  2. Branching points at the end of each linear pattern are also detected.

  3. All cycle (cyclic patterns) are detected. Using a proprietary hashing method, a given number of bits in the bit stream are set for each pattern. It is possible that the same bit is set by multiple patterns. This phenomenon is called bit collision. A few bit collisions in the fingerprint are tolerable, but too many may result in losing information in the fingerprint.

FPSim2

SureChEMBL's Similarity search is using FPSim2, a program developed by Eloy Felix to run fast compound similarity searches.

Reference websites

The FPSim2 file is available in each SureChEMBL (fpsim2_fingerprints.h5)

More background information are available in this - and some documentation on the similarity search can be found on .

Some details presented within this documentation kindly provided by and reproduced from ChemAxon ().

bulk data directory
blog post
GitHub
www.chemaxon.com