# Similarity Search, Tanimoto Coefficient and Fingerprint Generation

A number of thresholds or measures are available for similarity searching. The higher the threshold the closer the target structures are to the query structure. By default the Similarity search within SureChEMBL uses the Tanimoto coefficient (Tc) to calculate the degree of similarity between the query and the target structures. The Tanimoto coefficient has two arguments:

* The fingerprint of the query structure
* The fingerprint of the target structure

A fingerprint is comprised of a list of predefined structure fragments or feature found within a structure. Each feature that is present is represented as “on” by using the number 1 (as in one bit).

***Tanimoto coefficient formula***

<figure><img src="https://1396459327-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fty0JOfWwPnEbs5wW271w%2Fuploads%2FZ5DSGK0DMnTdaFLP4Cz3%2Fimage.png?alt=media&#x26;token=54517b6b-e095-4759-aee6-0eaf6b5c8046" alt=""><figcaption></figcaption></figure>

***NA*** represents the number of "on" features (bits) in structure A.

***NB*** represents represents the number of "on" features (bits) in structure B.

***NA\&B*** represents the number of "on" features (bits) common to both fingerprints A and B.

The ***hashed binary chemical fingerprint*** of a molecule is a fixed-length bit string (a sequence of "0" and "1" digits) that contains information on the structure.

***Fingerprint type***

SureChEMBL makes use of the RDKit chemical fingerprints. They are hashed Morgan fingerprints 256 bits with a radius of 2 atoms.

The process of fingerprint generation is as follows:&#x20;

1. Up to a given bond, all linear paths (linear patterns) consisting of bonds and atoms of a structure are detected.
2. Branching points at the end of each linear pattern are also detected.
3. All cycle (cyclic patterns) are detected.\
   Using a proprietary hashing method, a given number of bits in the bit stream are set for each pattern. It is possible that the same bit is set by multiple patterns. This phenomenon is called bit collision. A few bit collisions in the fingerprint are tolerable, but too many may result in losing information in the fingerprint.

***FPSim2***

SureChEMBL's Similarity search is using FPSim2, a program developed by Eloy Felix to run fast compound similarity searches.

The FPSim2 file is available in each SureChEMBL [bulk data directory](https://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/bulk_data/) (fpsim2\_fingerprints.h5)

More background information are available in this [blog post](http://chembl.blogspot.com/2019/01/fpsim2-simple-python3-molecular.html) - and some documentation on the similarity search can be found on [GitHub](https://github.com/chembl/FPSim2).

**Reference websites**

Some details presented within this documentation kindly provided by and reproduced from ChemAxon ([www.chemaxon.com](http://www.chemaxon.com/)).&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://chembl.gitbook.io/surechembl/chemical-search/similarity-search-tanimoto-coefficient-and-fingerprint-generation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
