Field names and data types - basic submission

The following files are necessary for a minimum data deposition of simple data, such as those describing experiments where 'compound X has affinity (Ki) of Y uM in assay Z.' For most depositors, this data model is sufficient to capture their data. Such depositors need only deposit their data using the files in this section. These must all be tab-separated files.

Additional files required to describe more complex data are found in the Complex results sets section.

PLEASE NOTE: For ease of viewing the column headers have been formatted as rows. When submitting data to ChEMBL the formatting should be rotated by 90 degrees so that the fields are column headers and each row below is a data entry point.

REFERENCE.tsv

The REFERENCE file describes the provenance of the deposition, including the title, authors and abstract. It can contain the details of a published reference, or a submitted paper pre-publication, or without being associated with a publication.

The file should contain sufficient details for a user to locate your publication or project by reading the reference. For example, data associated with a project the ABSTRACT field could contain a short description of the dataset and a link to the project site.

  • RIDX, TITLE, YEAR, ABSTRACT, AUTHORS and REF_TYPE are mandatory, all other fields are optional.

  • For a deposited dataset, YEAR should be the year we received the file.

  • It is mandatory to have either a PMID or DOI, if your data do not have one please contact us before submission. We can provide a temporary ChEMBL DOI for datasets until an external DOI has been generated.

The TITLE, ABSTRACT and AUTHORS fields of the REFERENCE file should be populated for datasets, as well as publications. These can be brief (i.e. an organisation name can be provided for the AUTHORS field) and the abstract can be a simple summary of the experiments and their overall purpose. It should contain sufficient details for a user to locate your publication or project by reading the reference; the ABSTRACT field could contain a short description of the dataset and a link to the project site.

Header

Description

Existence

Data Type

RIDX

The RIDX set by the depositor - a primary key

Mandatory

Any character up to a length of 200. Should not start with 0

PUBMED_ID

PubMed ID

Mandatory - Must be present, may be empty if there is a DOI

Any positive integer up to a length of 11.

JOURNAL_NAME

Journal name

Optional

Any character up to a length of 50

YEAR

Year of publication

Mandatory

Any integer up to a length of 4 between 1900 and 2050

VOLUME

The volume of the publication

Optional

Any character upt o a length of 50

ISSUE

The issue of the publication

Optional

Any character up to a length of 50

FIRST_PAGE

The first page of the article

Optional

Any positive integer up to a length of 50

LAST_PAGE

The last page of the article

Optional

Any positive integer up to a length of 50

REF_TYPE

The type of reference (Publication, Patent, Dataset, Book)

Mandatory

One of Patent, Publication, Dataset or Book

TITLE

The title of the reference

Mandatory

Any character up to a length of 500

DOI

The Digital Object Identifier

Mandatory - Must be present, may be empty if there is a PUBMED_ID

Any DOI up to a length of 200

PATENT_ID

The Patent Identifier

Optional

Any Patent Identifier up to a length of 200

ABSTRACT

The abstract of the article. For a dataset, include a description of the dataset here

Mandatory

A very large text field

AUTHORS

A list of the authors of the publication

Mandatory

Any character up to a length of 4000

COMPOUND_RECORD.tsv

COMPOUND_RECORD files provide links between a record ID and a compound ID. If a CIDX-RIDX combination does not yet exist in the COMPOUND_RECORDS table for this src_id, then one will be created. If one already exists, then this existing one will be updated with the COMPOUND_NAME, etc, present in the incoming file. The CIDX is then referenced in ACTIVITY files, to identify which compounds were used.

You must provide at least one of COMPOUND_NAME, COMPOUND_KEY or COMPOUND_SOURCE in order to make the compound searchable. RIDX is optional, but if included it must be an RIDX owned by the depositor.

Header

Description

Existence

Data Type

CIDX

The CIDX set by the depositor - a primary key

Mandatory

Any character up to a length of 200

RIDX

The RIDX cited by the depositor. MUST be owned by the same depositor

Optional

Any character up to a length of 200

COMPOUND_KEY

The local synonym used for this compound in the RIDX referenced

Mandatory

Any character up to a length of 250

COMPOUND_NAME

The name used for this compound in the RIDX referenced

Mandatory

Any character up to a length of 4000

COMPOUND_SOURCE

The source of this compound in the RIDX referenced

Optional

Any character up to a length of 400

COMPOUND_CTAB.sdf

The COMPOUND_CTAB file (a Chemical Table file) defines the structure of a compound. It is possible to load data without a CTAB file. If you include a CTAB file or will be loading structure data later, the CIDX fields in the CTAB must match the CIDX IDs in the ACTIVITY.tsv file.

This must be in V2000 molfile format. The InChI binaries we use currently do not accept V3000 molfiles. We cannot accept V3000 InChIs as we need an InChI Key for ChEMBL.

Header

Description

Existence

Data type

CIDX

The CIDX established by the depositor in the COMPOUNDS file - a foreign key

Mandatory

Any character up to a length of 200

CTAB

The CTAB (Connection table) assigned to this CIDX

Optional

A very large text field

ASSAY.tsv

The ASSAY file provides a brief description of the assay, along with the target organism, tissue, cellular fraction etc.

  • RIDX is optional, but if included it must be an RIDX owned by the depositor.

  • The RIDX can be included in the current deposition files or can be one already loaded into ChEMBL. CRIDX should generally be identical to RIDX.

  • ASSAY_TAX_ID and TARGET_TAX_ID must be the NCBI Taxon ID for the organism, not the strain. The strain for the assay organism can be given in ASSAY_STRAIN.

  • ACT_ID is mandatory if providing an ACTIVITY_PROPERTIES or ACTIVITY_SUPPLEMENTARY record

  • AIDX, ASSAY_DESCRIPTION and ASSAY_TYPE are all mandatory.

  • TARGET_TYPE must be one from the defined list.

  • Activity files must cite only existing DDIs.

An ASSAY record is a single instance of an assay, not an assay protocol record. If the same assay protocol is used in multiple datasets, it is still a new ASSAY in each dataset. Each assay can only have a single target; in the case of a panel assay, there will need to be one assay record per target.

Header

Description

Existence

Data type

AIDX

The AIDX set by the depositor - a primary key

Mandatory

Any character up to a length of 200

RIDX

The RIDX cited by the depositor. MUST be owned by the depositor

Optional

Any character up to a length of 200

ASSAY_DESCRIPTION

A description of the assay

Mandatory

Any character up to a length of 4000

ASSAY_TYPE

The type of the assay. B (Binding) ,F (Functional),A (ADMET),U (Unassigned),P (Physicochemical) or T (toxicity)

Mandatory

Accepted Assay types

(ADMET, A, Functional, F, Binding, B, Unassigned, U, Physiochemical, P, Toxicity, T)

ASSAY_ORGANISM

The assay organism

Optional (Mandatory if ASSAY_TAXON_ID is populated)

Any character up to a length of 250

ASSAY_STRAIN

The strain of the assay organism

Optional

Any character up to a length of 200

ASSAY_TAX_ID

NCBI taxonomy ID for the assay organism

Optional

Any positive integer up to a length of 11 or '0'

ASSAY_SOURCE

The original source of the assay

Optional

Any character up to a length of 100

ASSAY_TISSUE

The type of tissue used in the assay

Optional

Any character up to a length of 100

ASSAY_CELL_TYPE

The cell line - see section below

Optional

Any character up to a length of 100

ASSAY_SUBCELLULAR_FRACTION

The subcellular fraction used in the assay

Optional

Any character up to a length of 100

The type of target

Optional

Any character up to a length of 25. Target type list

TARGET_NAME

The name of the target

Optional

Any character up to a length of 400

TARGET_ACCESSION

A UniProt accession for the target where the target is a protein. Otherwise to be left blank.

Optional

An integer or a UniProt ID up to a length of 255

TARGET_ORGANISM

The target organism

Optional

Any character up to a length of 100

TARGET_TAX_ID

The NCBI taxonomy ID of the target organism

Optional

Any positive integer up to a length of 11 of '0'

ASSAY_CELL_TYPE rules

The asssay_cell_type is either a cell-line name (e.g. MRC-5) or endogenous cell type (e.g. fibroblasts). You can see this in an example cell report card from ChEMBL: https://www.ebi.ac.uk/chembl/cell_line_report_card/CHEMBL3308499/

For cells that are within ChEMBL, the cell report card cell name can be used to populate the 'ASSAY_CELL_TYPE' field, the CHEMBL_ID to populate the MC_CELL_LINE_CHEMBL_ID field and the Cellosaurus_ID can be used to populate the MC_CELLOSAURUS_ID field.

For cell lines that are not found within ChEMBL, Cellosaurus is a good resource. For endogenous cells please try to use the Experimental Factor Ontology (EFO) which can be searched on the Ontology Lookup Service. Otherwise, a depositor-defined name can be provided as the ASSAY_CELL_TYPE.

ASSAY_PARAM.tsv

The ASSAY_PARAM file describes the assay parameters. For example the concentration of compound used, the pH of the buffer, the instrument used for data collection or the timepoints of the experiment.

Thus the ASSAY_PARAM file can be used to assign a list of parameters to an AIDX, which must have either been previously deposited in ChEMBL, or be defined in the ASSAY file, or both.

It is possible to store multiple parameters for one assay. Depositors can set their own parameters, but should use the same description for a type of data every time it is entered; for example, you must not use both CONC and CONCENTRATION if they are both referring to the same sort of concentration data.

Either a numeric VALUE or a TEXT_VALUE can be given for a single parameter; if a numeric value is given then it must include a relation, for example =, < or >.

Header

Description

Existence

Data Type

AIDX

The AIDX established by the depositor in the ASSAY file - a foreign key

Mandatory

Any character up to a length of 200

TYPE

The type of parameter. Must be unique within an AIDX

Mandatory

Any character up to a length of 250

RELATION

Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')

Optional (Mandatory if VALUE is given)

Any relation symbol

(=, >, <, ~, <=, >=, >>, <<) up to a length of 50

VALUE

The numerical value of the parameter.

Optional

Any number (including decimals, negatives and scientific notation (e.g. 3×10^2))

UNITS

The units of the parameter measurement

Optional

Any character up to a length of 100

TEXT_VALUE

The text value of non-numerical values

Optional

Any character up to a length of 4000

COMMENTS

A comment on the parameter.

Optional

Any character up to a length of 4000

This file describes the assay parameters. For example, here the first two records show the concentration of compound used.

  • AIDX and TYPE are both mandatory.

  • A VALUE requires an entry in the RELATION field. A TEXT_VALUE requires that RELATION is empty.

  • It is a many-to-one mapping, so you can store multiple parameters for one assay.

  • Depositors can set their own TYPE, but you need to use the same TYPE string for each form of data. For example, you may not use both CONC and CONCENTRATION as TYPEs if they are both referring to the same sort of concentration data.

  • AIDX must match an existing AIDX owned by the depositor.

ACTIVITY.tsv

The ACTIVITY file outlines the numerical or text value of the data arising from a compound (CIDX) used in an assay (AIDX). If a source is updated and includes the previously loaded ACTIVITY file, existing activities will need to be wiped before loading these supplementary files.

For more complex data types, discussed in Complex results sets, supplementary ACTIVITY_PROPERTIES, ACTIVITY_SUPPLEMENTARY and ACTIVITY_SUPP_MAP files can be used to append additional data to the activities using the activity ID (ACT_ID) and test occasion ID (TEOID) set in this file to link them.

  • CIDX, AIDX, ACT_ID , CRIDX , TYPE and ACTIVITY are mandatory.

  • A VALUE is numeric and requires an entry in the RELATION field.

  • TEXT_VALUEs should not have a RELATION sign.

  • ACTION_TYPE must be one from the defined list.

  • ACT_ID is mandatory if providing an ACTIVITY_PROPERTIES or ACTIVITY_SUPPLEMENTARY record that maps to a given line.

  • It is possible to load data without a CTAB file. If you include a CTAB file or will be loading structure data later, the CIDX fields in the CTAB must match the CIDX IDs here.

  • RIDX is optional, but if included it must be an RIDX owned by the depositor.

  • CRIDX should generally be identical to RIDX, unless you need to reference a separate paper for the activity and the compound.

  • SRC_ID_AIDX and SRC_ID_CIDX are used when depositing against other depositor’s entities, as discussed in a later section.

Header

Description

Existence

Data type

CIDX

The CIDX established by the depositor in the COMPOUNDS file - a foreign key

Mandatory

Any character up to a length of 200

CRIDX

The RIDX to be associated with the CIDX in the creation of the compound record. Must belong to SRC_ID_CIDX.

Mandatory

Any character up to a length of 200

SRC_ID_CIDX

The SRC_ID for the CIDX. If not specified, then value is assumed to be the SRC_ID for the depositor - this field can be used to reference compounds deposited by other depositors

Optional

Any positive integer up to a length of 4

AIDX

The AIDX established by the depositor in the ASSAY file - a foreign key

Mandatory

Any character up to a length of 200

SRC_ID_AIDX

The SRC_ID for the AIDX. If not specified, then value is assumed to be the SRC_ID for the depositor

Optional

Any positive integer up to a length of 4

RIDX

The RIDX established by the depositor in the REFERENCE file - a foreign key

Deprecated. Not visible to end-users

Any character up to a length of 200

TEXT_VALUE

The text value of non-numerical values. Do not use for value ranges. Provide these as upper and lower bounds, as VALUE fields.

Optional

Any character up to a length of 1000

RELATION

Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')

Optional (Mandatory if a VALUE is given)

Any Relation symbol

(=, >, <, ~, <=, >=, >>, <<) up to a length of 50

VALUE

The numerical value of the activity measurement (see ACTIVITY_COMMENT for non-numerical values)

Optional

Any number (including decimals, negatives and scientific notation (e.g. 3×10^2))

UPPER_VALUE

Where the activity is a range, this represents the highest value of the range (numerically), while the VALUE column represents the lower value

Optional

Any number (including decimals, negatives and scientific notation (e.g. 3×10^2))

UNITS

The units of the measurement

Optional

Any character up to a length of 100

SD_MINUS

Standard Deviation Lower limit

Optional

Any number (including decimals, negatives and scientific notation (e.g. 3x10^2))

SD_PLUS

Standard Deviation Upper limit

Optional

Any number (including decimals, negatives and scientific notation (e.g. 3x10^2))

ACTIVITY_COMMENT

A comment on the activity measurement. Non-numerical 'values' should be given here. Equivalent to 'TEXT_VALUE' field in many other tables.

Optional

Any character up to a length of 4000

CRIDX_CHEMBLID

The CHEMBLID for the CRIDX. Must belong to the SRC_ID_CIDX

Optional

Any character up to a length of 200

CRIDX_DOCID

The DOCID for the CRIDX. Must belong to the SRC_ID_CIDX

Optional

Any character up to a length of 200

ACT_ID

A local ID used to relate records in ACTIVITY_PROPERTIES and Supplementary tables. A primary key in this table. Not required unless depositing such data.

Optional (Mandatory if there are ACTIVITY_PROPERTIES or ACTIVITY SUPPLEMENTARY files)

VARCHAR(50) Must be unique for each activity line, even if they have identical Properties.

TEOID

TEst Occasion ID, grouping together related Activity records. Usually used to group activities by time point and treatment. A primary key in this table.

Optional

Any integer up to a length of 11

TYPE

The type of measurement

Mandatory

Any character up to a length of 250

ACTION_TYPE

Specifies the effect of the drug on its target for protein-based assays. Must match one of the names in the ACTION_TYPE table

Optional

Any character up to a length of 50 - must be in the ACTION_TYPE table (see below)

INFO.txt

An INFO file can be included in a deposition to contain any additional information you might want to include with the deposited data. The INFO file will be stored but not entered into the database.

Last updated