Target Questions

Where can I find more details on the target prediction?

More details can be found in this Blog post. Target predictions can be found on the interface and predictions can be run using this Docker image.

What are the main target types in ChEMBL?

How can I find the binding site of my compound?

We annotate binding site information in two different ways:

1) For drugs and clinical candidates, where the target is a protein complex and we know which subunit(s) the drug binds to, we indicate this by including a site_id in the drug_mechanism table. This information can be viewed on the ChEMBL interface through the Drug Mechanisms view where the data is found in the Binding_site_name and Binding_site_comment fields.

2) For assay/bioactivity data, we have an algorithm that attempts to predict the most likely binding-domain of the protein target (based on the domain structure of the protein and a list of all curated known small molecule binding domains). This algorithm is only applied to a subset of the > 15 million bioactivity records - binding or functional assays with a single protein or protein complex targets and dose-response measurements - and is not able to make an unambiguous prediction in all cases. There are around 650 K activity measurements with a predicted domain in the predicted_binding_domains table.

How can I find the Pfam domains for my target?

Pfam-A domains can be found in the domains table and can be viewed in the Domain Cross References section of the target report card on the ChEMBL interface.

What does the target relationship_type flag mean?

The relationship_type flag provides the relation between the reported target in the source document and the assigned target. D Direct protein target assigned H Homologous protein target assigned M Molecular target other than protein assigned N Non-molecular target assigned S Subcellular target assigned U Default value - Target has yet to be curated

How can I find Entrez_IDs for my ChEMBL targets?

UniProt accessions are the main identifier for protein targets. These can be mapped to Entrez IDs using the UniProt mapping tool. ChEMBL protein targets may be a single protein (with a single accession) or a protein complex, protein family (with more than one associated accession).

My protein isn't in ChEMBL, how can I find similar proteins?

It’s possible to search ChEMBL for similar proteins using the EBI-wide BLAST search tool through the ChEMBL interface.

On the ChEMBL homepage, click on 'Advanced Search’ and select ‘Biological Sequence’ and then input your target protein. Similar proteins will be returned alongside the E-value.

We don’t currently have an API endpoint for the protein similarity searches but all target protein sequences are provided in the component_sequences table and can be extracted for further analyses.

Can you provide more details on the target relationship mapping?

This is based on both the target_type and the component_type (to deal with RNA targets) and is shown below.

How does ChEMBL treat protein variants?

We map proteins to UniProt accessions and parent and variant proteins will have the same target ChEMBL_ID. Variants are also mapped to a variant_ID that distinguishes these from the wild-type protein (where the variant_ID is null).

Variants may contain single or multiple mutations which include defined (e.g. single amino acid changes) or undefined (e.g. ‘deletion mutants’, mapped to variant_ID -1) changes. The variant sequences table contains details of mutated residues and the sequence.

Variant information is also available through the ChEMBL interface and can be viewed when browsing assays by adding additional columns (variant_sequence_accession and variant_sequence_mutation) and though web services (https://www.ebi.ac.uk/chembl/api/data/assay.json?variant_sequence__isnull=false). The variant_ID can be used to exclude or include protein variants when extracting bioactivity data.

Mutant targets may be associated with with drug resistance or disease but may also include some engineered mutations (users need to review the assay description or references to interpret the data).

...and why is there no referential integrity between the variant_sequences table and component_sequences table?

The variant_sequences table is not linked to the component sequences table. It’s possible that two assays, one describing a variant and one describing a wild-type protein, may refer to the same protein target (and share a ChEMBL target ID), but provide different accessions. We keep both accessions in case a slightly different sequence was used as the reference to generate the variant numbering in the corresponding assay.

How are isoforms treated by ChEMBL?

In ChEMBL, we map isoforms to the primary UniProt accession. Sometimes researchers test different isoforms in separate experiments and in these cases one assay is recorded per isoform. The isoform details are recorded in the assay descriptions.

For example:

Assay CHEMBL672502 Binding affinity against Dopamine receptor D2L

Assay CHEMBL670556 Binding Affinity of Compound of Dopamine receptor D2S

Is 'fuzzy' target matching available via the API?

Although ‘fuzzy’ matching is available though the UI, unfortunately, we don’t have any fuzzy API support at the moment. However, there is the option to use regular expressions -

e.g. https://www.ebi.ac.uk/chembl/api/data/target?pref_name__iregex=cdk1|cdk2

My assay target is 'Unchecked', what does this mean?

We extract bioactivity data from a selected set of journals, patents and deposited data sets and include a range of assays such as cytotoxicity assays, antibacterial assays and protein inhibition assays. The extracted data is curated and mapped to ChEMBL targets where possible. However, the curation is an ongoing process and some assays may not be mapped to a target and appear as ‘Unchecked’. There are a number of reasons why the target of an assay may be ‘Unchecked’. For example, the assay may be a physiochemical assay (such as solubility determination) where there is no target, a selectivity ratio where there isn’t a single target and/or the target may be ambiguous, has not yet been created or has not yet been reviewed.

How can I extract gene symbols for my target?

ChEMBL targets are all associated with a unique target ChEMBL_ID. We use UniProt accessions as our primary identifier for protein targets. However, gene symbols can be obtained from the component_synonyms table:

select td.chembl_ID as target_chembl_ID, td.organism, td.tax_ID, td.pref_name, cs.component_synonym
from chembl.target_dictionary td
left join chembl.target_components tc on td.tid=tc.tid
left join chembl.component_synonyms cs on tc.component_ID=cs.component_ID
where syn_type = 'GENE_SYMBOL' -- you can include other synonym types if needed
and chembl_ID = 'CHEMBL204' -- Melanocortin receptor 1 as an example

This webservices example may also be useful - https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services#having-a-list-of-molecules-chembl-ids-in-a-csv-file-produce-another-csv-file-that-maps-every-compound-id-into-a-list-of-human-gene-names.

What are the information sources for the protein classification?

The ChEMBL protein family classification is a bespoke classification, reflecting some of the preferred or widely used community classifications for different protein families. It does not aim to be comprehensive, but is focused around the main families that have been historically important for drug discovery. The classification is maintained manually, with new classes being added as/when needed (usually when a significant number of targets in a particular family appear in ChEMBL, or where they are of particular significance e.g., targets of approved drugs/candidates).

Curation of protein targets remains ongoing and we are in the process of reviewing/updating our protein classification. Any feedback on useful categories is welcome.

Here are some sources useful to our classification:

• Kinases - Manning et al kinome paper (PMID: 12471243)

• GPCRs - Guide to Pharmacology/IUPHAR, GPCRdb

• Transporters - mostly from TCDB

• Epigenetic - domain-based classification from SGC Chromohub (many epigenetic proteins have multiple domains, so multiple classifications)

• Proteases - MEROPS

• Enzymes – EC number

• UniProt keywords may have been used to classify other groups e.g. secreted proteins. UniProt is a comprehensive protein specific resource, and is a useful starting point that also links out to other specialised protein databases (e.g. GPCR, TCDB).

PreviousDrug and Compound Questions NextDocument and Data Source Questions

Last updated 8 months ago