Drug and Compound Questions
Last updated
Last updated
ChEMBL contains a large number of preclinical compounds with bioactivity data from experimental sources. However, ChEMBL also curates information for marketed drugs, and drugs that are progressing through the clinical development pipeline (‘clinical candidate drugs’).
In essence, although a preclinical compound must have associated bioactivity data, a drug or clinical candidate drug does not require experimental data for inclusion in ChEMBL. However, an approved drug could also be a clinical candidate drug and/or a preclinical compound and therefore it may also have additional attributes as shown in the table below.
ChEMBL contains information on drugs that have been approved for treatment of a specific disease / diagnosis (an indication) within a region of the world (e.g. FDA drugs are approved for use in the United States), and clinical candidate drugs that are being investigated for an indication during the clinical trials process.
The maximum phase of development for the compound across all indications is assigned a category called 'max_phase' (the value in brackets is used in the downloadable ChEMBL database in the 'molecule_dictionary' table):
Approved (4): A marketed drug e.g. AMINOPHYLLINE (CHEMBL1370561) is an FDA approved drug for treatment of asthma.
Phase 3 (3): A clinical candidate drug in Phase 3 Clinical Trials e.g. TEGOPRAZAN (CHEMBL4297583) is under clinical investigation for treatment of peptic ulcer at Phase 3, and also liver disease at Phase 1.
Phase 2 (2): A clinical candidate drug in Phase 2 Clinical Trials e.g. NEVANIMIBE HYDROCHLORIDE (CHEMBL542103) is under clinical investigation for treatment of Cushing syndrome at Phase 2. Note that this category also includes a small number of trials that are defined by ClinicalTrials.gov as "Phase 2/Phase 3". In addition, INN applications are assigned as Phase 2 because their guidance that states “As a general guide, the development of a drug should progress up to the point of clinical trials (phase II) before an application is submitted to the INN Secretariat for name selection.”
Phase 1 (1): A clinical candidate drug in Phase 1 Clinical Trials e.g. SALCAPROZATE SODIUM (CHEMBL2107027) is under clinical investigation for treatment of diabetes mellitus. Note that this category also includes a small number of trials that are defined by ClinicalTrials.gov as "Phase 1/Phase 2". In addition, USAN applications as assigned as Phase 1 because their guidance states “Firms usually apply for a USAN when the investigational therapy is in Phase I or Phase II trials”
Early Phase 1 (0.5): A clinical candidate drug in Early Phase 1 Clinical Trials e.g. CITRULLINE MALATE (CHEMBL4297667) is under clinical investigation for coronary artery disease at Early Phase 1.
Unknown (-1): Clinical Phase unknown for drug or clinical candidate drug ie where ChEMBL cannot assign a clinical phase e.g. NALIDIXATE SODIUM (CHEMBL1255939) is known to be a clinical candidate drug because it has a USAN name, however ChEMBL has not been able to map a disease indication for this compound via its clinical trials pipeline and therefore its max_phase is assigned as Unknown. By contrast, the parent compound (NALIDIXIC ACID, CHEMBL5) is an approved drug (Phase 4) for treatment of bacterial disease.
Preclinical (NULL): preclinical compounds with bioactivity data e.g. CHEMBL6300 is a preclinical compound with bioactivity data that has been extracted from scientific literature. However, the sources of drug and clinical candidate drug information in ChEMBL do not show that this compound has reached clinical trials and therefore the max_phase is set to null.
By contrast, the 'max_phase_for_ind' field in the 'drug_indication' table in the downloadable ChEMBL database contains the maximum phase of development for the drug or clinical candidate drug for a specified indication. The numbering system remains identical to that described above for max_phase.
In the database, information on withdrawn drugs can be found in the drug_warning table. On the interface, these fields are available as filters for compounds/drugs. Further details can also be found in this Blog post.
The chirality flag shows whether a drug is dosed as a racemic mixture (0), single stereoisomer (1) or as an achiral molecule (2), for unchecked compounds the chirality flag = -1. This information is curated for drugs and clinical candidates.
The molfiles and images of a proportion of metal-containing compounds we removed from the ChEMBL interface and downloads set in ChEMBL_17. This was partially due to some of these compounds having coordinated metal bonds. As InChI limitations are such that these coordinate bonds could not generate a Standard InChI, our main compound indicator of uniqueness in ChEMBL, it was decided to exclude the structures altogether.
The compound image on the interface was replaced with an icon that shows it is a metal-containing compound and the molfiles were removed from the download set on the FTP site. We will retain the molecular formula in both the download files and on the ChEMBL interface, so that the elemental make up of the compound is visible. This change does not affect the storage or display of the associated bioactivity data for these compounds.
All properties are calculated on the parent form of the molecule i.e after any salts have been removed. The exception is the FULL_WT, which, where applicable, is the molecular weight of the salt, plus any present hydrates.
Properties are only calculated on single component compounds (not mixtures) with molecular weight<1000 and containing only atoms H, C, N, O, S, P, F, Cl, Br and I. The exception is MW_freebase which is calculated for all molecules.
Section 1
These calculations are performed using algorithms available in RDKit.
MW_freebase
Molecular weight of the parent form of the molecule
MW_Monoisotopic
The monoisotopic mass of the compound calculated as the sum of the masses of the most abundant isotopes in the compound.
AlogP
Calculated value for the lipophilicity of a molecule expressed as log (octanol/water partition coefficient). Method used for the calculation is as described in:
Prediction of Physicochemical Parameters by Atomic Contributions
Scott A. Wildman and and Gordon M. Crippen, Journal of Chemical Information and Computer Sciences 1999 39 (5), 868-873. DOI: 10.1021/ci990307l
PSA
Polar surface area is calculated by the method by P Ertl.
Fast calculation of molecular polar surface area as a sum of fragment based contributions and its application to the prediction of drug transport properties, Ertl, P., Rohde, B., Selzer, P., J. Med. Chem. 2000, 43, 3714-3717.
HBA
Count of the number of Hydrogen Bond Acceptors. Based on matching these SMARTS patterns:
[$([N;!H0;v3]),$([N;!H0;+1;v4]),$([O,S;H1;+0]),$([n;H1;+0])]
HBD
Count of the number of Hydrogen Bond Donors. Based on matching these SMARTS patterns:
[$([O,S;H1;v2]-[!$(*=[O,N,P,S])]),$([O,S;H0;v2]),$([O,S;-]),$([N;v3;!$(N-*=!@[O,N,P,S])]),$([nH0,o,s;+0])]
HBA_Lipinski
Count of nitrogen and oxygen atoms in the molecule
HBD_Lipinski
Count of hydrogens attached to nitrogen or oxygen atoms
RTB
Number of rotatable bonds in the molecule. Based on matching this SMARTS pattern:
[!$(*#*)&!D1]-&!@[!$(*#*)&!D1]
Num_RO5_Violations
Number of properties defined in Lipinski’s Rule of 5 (RO5) that the compound fails. Conditions which violate the RO5 are: Molecular weight>500, AlogP>5, HBD>5, HBA>=10
Num_Lipinski_RO5_Violations
As above except used HBA_Lipinski and HBD_Lipinski instead of HBA and HBD
Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Deliv. Rev., 1997, 23, 3-25.
RO3_Pass
Rule of 3 passes. It is suggested that compounds that pass all these criteria are more likely to be hits in fragment screening.
molecular weight <=300, number of hydrogen bond donors <=3, number of hydrogen bond acceptors <=3, AlogP <=3, RTB <=3, PSA<=60
A ‘Rule of Three’ for fragment-based lead discovery? Miles Congreve, Robin Carr,
Chris Murray and Harren Jhoti. Drug Discovery Today, 2003,8(19), 876-877
Aromatic_Rings
The number of aromatic rings in the molecule
Heavy_Atoms
The number of non-hydrogen atoms in the molecule
QED_Weighted
This is the quantitative estimate of drug-likeness as described in:
“Quantifying the chemical beauty of drugs”
G. Richard Bickerton, Gaia V. Paolini, Jeremy Besnard, Sorel Muresan and Andrew L. Hopkins. Nature Chemistry, 2012, 4, 90-98
The values range from 0 -1 where 1 is the most drug-like and 0 the least drug-like.
Section 2
These properties are calculated using ChemAxon tools (ChEMBL_26 onwards).
CX_LogP
This is the calculated Octanol/Water Partition Coefficient.
CX_LogD
This is the calculated Octanol/Water Distribution Coefficient at pH7.4. This is defined as the ratio of concentrations of all molecular species (neutral & ionized) in octanol divided by the concentration of all species in aqueous media at the pH specified.
3. pKa
pKa is defined as -log10 Ka, where Ka is the dissociation constant:
CX_MOST_APKA
Acidic pKa is the pKa for the most acidic group of the molecule
CX_MOST_BPKA
Basic pKa is the pKa for the most basic group of the molecule
Molecular Species
An approximation of the species occurring at pH7.4 and can be ACID, BASE, NEUTRAL or ZWITTERION
These are defined according to the definitions:
Acid(A) ACD_MOST_ApKa <6.5 and ACD_MOST_BpKa<8.5
Base (B) ACD_MOST_ApKa >6.5 and ACD_MOST_BpKa>8.5
Neutral (N) ACD_MOST_ApKa >6.5 and ACD_MOST_BpKa<8.5
Zwitterion (ZW) ACD_MOST_ApKa <6.5 and ACD_MOST_BpKa>8.5
The molecular species is an approximation. This does not use absolute pKa and considers both most acidic and most basic pKa; compounds may be polyprotic. The calculation of pKa is temperature-dependant; further details on the ChemAxon pKa calculations can be found here - https://docs.chemaxon.com/display/docs/calculators_pka-calculation.md
The canonical SMILES are calculated using RDKit. The standard InChI is calculated using the command line InChI generator and was developed by the InChI Trust (http://www.inchi-trust.org/inchi/) and is executed via the command line. The version of InChI used in ChEMBL is 1.06
ChEMBL compounds can be found as alternative forms (e.g. salts, hydrates, isotopes). On the ChEMBL interface, we typically map mechanisms to the parent compound. In the mechanisms table, mechanisms are mapped to the approved drug form (may be a parent or salt). Alternative forms can be linked through the molecule_hierarchy table to retrieve a mechanism mapped to any member of a compound family.
More details can be found in this Blog post.
ChEMBL uses the concept of compound 'families' so that each molecule (typically a salt) has a parent compound, and all the compounds in the family can be considered to have the same bioactivity.
Alternative forms of a compound (e.g. different salts, isotopes) can be viewed on the interface and activity data from alternative forms can be included or excluded using the ‘Exclude/include alternative forms’ button (see screenshot). The molecule form API endpoint or molecule_hierarchy table can be used to gather more information about all the salts of a given parent or salt.
This difference is due to the fact that some compounds are 'virtual parents' and we do not have records of documents nor activities related with this compound itself but only for a salt of it (e.g. CHEMBL1207557).
UniChem (https://www.ebi.ac.uk/unichem/) can be used to map between 2 sets of identifiers. UniChem sources are listed here and the WholeSourceMapping Table is here. There is also REST API access to the UniChem webservices and an accompanying webinar.
Since ChEMBL version 26, the similarity search is run with FPSim2, using 2048 bit, radius 2 RDKit calculated Morgan fingerprints.
The FPSim2 file in the main release downloads folder here: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/ by clicking on the link for chembl_XX.h5.
We also have some background information in this Blog post - and some documentation on the similarity search can be found on GitHub.
The similarity searches are run against the parent structures and running a similarity search using, for example, aspirin smiles (https://www.ebi.ac.uk/chembl/g/#similarity_search_results/CC(%3DO)Oc1ccccc1C(%3DO)O/95) will return 3 compounds with a 100% of similarity. The two additional compounds are salt forms of aspirin.
It’s possible to perform structure similarity searches using the ChEMBL interface or the API endpoint.
Agrochemicals are classified within ChEMBL using the IRAC, FRAC and HRAC classifications.
It’s also possible to use keywords on the interface to search assay descriptions for herbicidal, fungicidal and/or insecticidal assays. Alternatively, bespoke SQL queries using regular expressions (e.g. regexp_like (description, '(h|H)erbicid(e|al) activity|Phytotoxic(|ity)|Antialgal activity')) can be constructed to search assay descriptions using a local version of the database. The target dictionary contains details of plant, insect and fungal targets within ChEMBL and can be used as a starting point to identify compounds with activity against these targets.
The substructure search is challenging for low complexity molecules (and sometimes can’t be executed in a reasonable time). The substructure search works well with more complex molecules such as rifampin. We’re working on improving this feature but unfortunately don’t have a workaround for this at the moment.
In searches for compounds containing simple functional groups e.g. nitrile, another possibility is to mine the SMILES information.
Unfortunately, if the InChi key search does not return results it means that we do not have that molecule registered in ChEMBL. However, if you have a smiles or a mol-file then it is possible to run a structure flexmatch search on our web interface.
The 'Draw a structure' search functionality will open an editor where you can paste a smiles or a molfile, and create/edit a molecule for your search. There are 3 types of search. Connectivity (similar to exact match but it does not consider stereochemistry, isotopes etc.), similarity, and substructure.
For InChi keys that are not present in ChEMBL, UniChem can be used to search other chemistry databases where they might be present.
You can find more information on compound curation in this Blog post and on the structure standardiser here.
We would suggest downloading the full SDF from the FTP site rather than using the interface. The interface download may not be possible for extremely large files.
The ‘Rule of Five’ refers to the drug-like properties defined by Lipinski (oral drugs should have a molecular weight < 500, no more than 5 H-bond donors, no more than 10 H-bond acceptors and a CLogP not greater than 5). A compound that complies with the Rule of Five has 0 or 1 violations of these rules.
The number of violations of the Rule of Five is recorded in the ChEMBL database (within the MOLECULE_DICTIONARY) and is also available as a filter when browsing compounds on the interface (see below).
Whether compounds are compliant with the Rule of Five is also displayed as an icon in the molecule features section of the compound report card. The Ro5 flag shows as True for compounds with 0 or 1 violations. The exception is metal-containing compounds where the Ro5 is not calculated and is recorded as null in the database and the Ro5 icon shows False on the ChEMBL interface.
Virtual parents may arise when bioactivity data has been deposited for salts or hydrates of a drug-like compound but no data is recorded for the parent compound itself. The parent compound is registered in ChEMBL but is not directly linked to bioactivity data.
A virtual parent compound has an entry (i.e. its own row) in the MOLECULE_DICTIONARY but does NOT have an entry in its own right in the MOLECULE_HIERARCHY i.e. the virtual parent is present in the MOLECULE_HIERARCHY.parent_molregno field but does not have an entry in the MOLECULE_HIERARCHY.molregno field. Also, there is no entry for the virtual parent in the COMPOUND_RECORDS table.
We removed some structural alerts during the RDKit conversion for version 26. In summary, we kept datasets with a clear literature reference and removed poorly documented sets.
ChEMBL 34
Defining feature for inclusion in ChEMBL
Typical features in ChEMBL
Occasional or absent features in ChEMBL
Preclinical Compound
(~2.4 million)
Must have bioactivity data
Usually measured in an assay against a target.
Usually comes from scientific literature or a deposited dataset
Normally does not have a pref_name (unless also a drug). Sometimes has indication, safety warning or mechanism information (if also a drug)
Clinical Candidate Drug
(~14 k)
Must come from a source of clinical candidate information (e.g. USAN, INN, ClinicalTrials)
Has a pref_name, can be recognisable drug name, or a research code.
May have indication and mechanism information.
Usually does not have bioactivity data.
Does not have safety warning data.
Approved Drug
(~4 k)
Must come from a source of approved drug information (e.g. FDA, EMA, WHO ATC)
Has a pref_name, usually a recognisable drug name.
Normally has indication and mechanism information.
May have safety warning information.
Often has bioactivity data (i.e. it is also a preclinical compound).
May also be a clinical candidate drug.