Field names and data types - basic submission
Last updated
Last updated
The following files are necessary for a minimum data deposition of simple data, such as those describing experiments where 'compound X has affinity (Ki) of Y uM in assay Z.' For most depositors, this data model is sufficient to capture their data. Such depositors need only deposit their data using the files in this section. These must all be tab-separated files.
Additional files required to describe more complex data are found in the Complex results sets section.
PLEASE NOTE: For ease of viewing the column headers have been formatted as rows. When submitting data to ChEMBL the formatting should be rotated by 90 degrees so that the fields are column headers and each row below is a data entry point.
The REFERENCE file describes the provenance of the deposition, including the title, authors and abstract. It can contain the details of a published reference, or a submitted paper pre-publication, or without being associated with a publication.
The file should contain sufficient details for a user to locate your publication or project by reading the reference. For example, data associated with a project the ABSTRACT field could contain a short description of the dataset and a link to the project site.
RIDX, TITLE, YEAR, ABSTRACT, AUTHORS and REF_TYPE are mandatory, all other fields are optional.
For a deposited dataset, YEAR should be the year we received the file.
It is mandatory to have either a PMID or DOI, if your data do not have one please contact us before submission. We can provide a temporary ChEMBL DOI for datasets until an external DOI has been generated.
The TITLE, ABSTRACT and AUTHORS fields of the REFERENCE file should be populated for datasets, as well as publications. These can be brief (i.e. an organisation name can be provided for the AUTHORS field) and the abstract can be a simple summary of the experiments and their overall purpose. It should contain sufficient details for a user to locate your publication or project by reading the reference; the ABSTRACT field could contain a short description of the dataset and a link to the project site.
COMPOUND_RECORD files provide links between a record ID and a compound ID. If a CIDX-RIDX combination does not yet exist in the COMPOUND_RECORDS table for this src_id, then one will be created. If one already exists, then this existing one will be updated with the COMPOUND_NAME, etc, present in the incoming file. The CIDX is then referenced in ACTIVITY files, to identify which compounds were used.
You must provide at least one of COMPOUND_NAME, COMPOUND_KEY or COMPOUND_SOURCE in order to make the compound searchable. RIDX is optional, but if included it must be an RIDX owned by the depositor.
The COMPOUND_CTAB file (a Chemical Table file) defines the structure of a compound. It is possible to load data without a CTAB file. If you include a CTAB file or will be loading structure data later, the CIDX fields in the CTAB must match the CIDX IDs in the ACTIVITY.tsv file.
This must be in V2000 molfile format. The InChI binaries we use currently do not accept V3000 molfiles. We cannot accept V3000 InChIs as we need an InChI Key for ChEMBL.
The ASSAY file provides a brief description of the assay, along with the target organism, tissue, cellular fraction etc.
RIDX is optional, but if included it must be an RIDX owned by the depositor.
The RIDX can be included in the current deposition files or can be one already loaded into ChEMBL. CRIDX should generally be identical to RIDX.
ASSAY_TAX_ID and TARGET_TAX_ID must be the NCBI Taxon ID for the organism, not the strain. The strain for the assay organism can be given in ASSAY_STRAIN.
ACT_ID is mandatory if providing an ACTIVITY_PROPERTIES or ACTIVITY_SUPPLEMENTARY record
AIDX, ASSAY_DESCRIPTION and ASSAY_TYPE are all mandatory.
TARGET_TYPE must be one from the defined list.
Activity files must cite only existing DDIs.
An ASSAY record is a single instance of an assay, not an assay protocol record. If the same assay protocol is used in multiple datasets, it is still a new ASSAY in each dataset. Each assay can only have a single target; in the case of a panel assay, there will need to be one assay record per target.
The asssay_cell_type is either a cell-line name (e.g. MRC-5) or endogenous cell type (e.g. fibroblasts). You can see this in an example cell report card from ChEMBL: https://www.ebi.ac.uk/chembl/cell_line_report_card/CHEMBL3308499/
For cells that are within ChEMBL, the cell report card cell name can be used to populate the 'ASSAY_CELL_TYPE' field, the CHEMBL_ID to populate the MC_CELL_LINE_CHEMBL_ID field and the Cellosaurus_ID can be used to populate the MC_CELLOSAURUS_ID field.
For cell lines that are not found within ChEMBL, Cellosaurus is a good resource. For endogenous cells please try to use the Experimental Factor Ontology (EFO) which can be searched on the Ontology Lookup Service. Otherwise, a depositor-defined name can be provided as the ASSAY_CELL_TYPE.
The ASSAY_PARAM file describes the assay parameters. For example the concentration of compound used, the pH of the buffer, the instrument used for data collection or the timepoints of the experiment.
Thus the ASSAY_PARAM file can be used to assign a list of parameters to an AIDX, which must have either been previously deposited in ChEMBL, or be defined in the ASSAY file, or both.
It is possible to store multiple parameters for one assay. Depositors can set their own parameters, but should use the same description for a type of data every time it is entered; for example, you must not use both CONC and CONCENTRATION if they are both referring to the same sort of concentration data.
Either a numeric VALUE or a TEXT_VALUE can be given for a single parameter; if a numeric value is given then it must include a relation, for example =, < or >.
This file describes the assay parameters. For example, here the first two records show the concentration of compound used.
AIDX and TYPE are both mandatory.
A VALUE requires an entry in the RELATION field. A TEXT_VALUE requires that RELATION is empty.
It is a many-to-one mapping, so you can store multiple parameters for one assay.
Depositors can set their own TYPE, but you need to use the same TYPE string for each form of data. For example, you may not use both CONC and CONCENTRATION as TYPEs if they are both referring to the same sort of concentration data.
AIDX must match an existing AIDX owned by the depositor.
The ACTIVITY file outlines the numerical or text value of the data arising from a compound (CIDX) used in an assay (AIDX). If a source is updated and includes the previously loaded ACTIVITY file, existing activities will need to be wiped before loading these supplementary files.
For more complex data types, discussed in Complex results sets, supplementary ACTIVITY_PROPERTIES, ACTIVITY_SUPPLEMENTARY and ACTIVITY_SUPP_MAP files can be used to append additional data to the activities using the activity ID (ACT_ID) and test occasion ID (TEOID) set in this file to link them.
CIDX, AIDX, ACT_ID , CRIDX , TYPE and ACTIVITY are mandatory.
A VALUE is numeric and requires an entry in the RELATION field.
TEXT_VALUEs should not have a RELATION sign.
ACTION_TYPE must be one from the defined list.
ACT_ID is mandatory if providing an ACTIVITY_PROPERTIES or ACTIVITY_SUPPLEMENTARY record that maps to a given line.
It is possible to load data without a CTAB file. If you include a CTAB file or will be loading structure data later, the CIDX fields in the CTAB must match the CIDX IDs here.
RIDX is optional, but if included it must be an RIDX owned by the depositor.
CRIDX should generally be identical to RIDX, unless you need to reference a separate paper for the activity and the compound.
SRC_ID_AIDX and SRC_ID_CIDX are used when depositing against other depositor’s entities, as discussed in a later section.
An INFO file can be included in a deposition to contain any additional information you might want to include with the deposited data. The INFO file will be stored but not entered into the database.
Header
Description
Existence
Data Type
RIDX
The RIDX set by the depositor - a primary key
Mandatory
Any character up to a length of 200. Should not start with 0
PUBMED_ID
PubMed ID
Mandatory - Must be present, may be empty if there is a DOI
Any positive integer up to a length of 11.
JOURNAL_NAME
Journal name
Optional
Any character up to a length of 50
YEAR
Year of publication
Mandatory
Any integer up to a length of 4 between 1900 and 2050
VOLUME
The volume of the publication
Optional
Any character upt o a length of 50
ISSUE
The issue of the publication
Optional
Any character up to a length of 50
FIRST_PAGE
The first page of the article
Mandatory if it is a Publication
Any positive integer up to a length of 50
LAST_PAGE
The last page of the article
Optional
Any positive integer up to a length of 50
REF_TYPE
The type of reference (Publication, Patent, Dataset, Book)
Mandatory
One of Patent, Publication, Dataset or Book
TITLE
The title of the reference
Mandatory
Any character up to a length of 500
DOI
The Digital Object Identifier
Mandatory - Must be present, may be empty if there is a PUBMED_ID
Any DOI up to a length of 200
PATENT_ID
The Patent Identifier
Optional
Any Patent Identifier up to a length of 200
ABSTRACT
The abstract of the article. For a dataset, include a description of the dataset here
Mandatory
A very large text field
AUTHORS
A list of the authors of the publication
Mandatory
Any character up to a length of 4000
CONTACT
A contact profile of someone willing to be contacted about details of the dataset (see here)
Recommended
Any character up to a length of 200
Header
Description
Existence
Data Type
CIDX
The CIDX set by the depositor - a primary key
Mandatory
Any character up to a length of 200
RIDX
The RIDX cited by the depositor. MUST be owned by the same depositor
Optional
Any character up to a length of 200
COMPOUND_KEY
The local synonym used for this compound in the RIDX referenced
Mandatory
Any character up to a length of 250
COMPOUND_NAME
The name used for this compound in the RIDX referenced
Mandatory
Any character up to a length of 4000
COMPOUND_SOURCE
The source of this compound in the RIDX referenced
Optional
Any character up to a length of 400
Header
Description
Existence
Data type
CIDX
The CIDX established by the depositor in the COMPOUNDS file - a foreign key
Mandatory
Any character up to a length of 200
CTAB
The CTAB (Connection table) assigned to this CIDX
Optional
A very large text field
Header
Description
Existence
Data type
AIDX
The AIDX set by the depositor - a primary key
Mandatory
Any character up to a length of 200
RIDX
The RIDX cited by the depositor. MUST be owned by the depositor
Optional
Any character up to a length of 200
ASSAY_DESCRIPTION
A description of the assay
Mandatory
Any character up to a length of 4000
ASSAY_TYPE
The type of the assay. B (Binding) ,F (Functional),A (ADMET),U (Unassigned),P (Physicochemical) or T (toxicity)
Mandatory
Accepted Assay types
(ADMET, A, Functional, F, Binding, B, Unassigned, U, Physiochemical, P, Toxicity, T)
ASSAY_ORGANISM
The assay organism
Optional (Mandatory if ASSAY_TAXON_ID is populated)
Any character up to a length of 250
ASSAY_STRAIN
The strain of the assay organism
Optional
Any character up to a length of 200
ASSAY_TAX_ID
NCBI taxonomy ID for the assay organism
Optional
Any positive integer up to a length of 11 or '0'
ASSAY_SOURCE
The original source of the assay
Optional
Any character up to a length of 100
ASSAY_TISSUE
The type of tissue used in the assay
Optional
Any character up to a length of 100
ASSAY_CELL_TYPE
The cell line - see section below
Optional
Any character up to a length of 100
ASSAY_SUBCELLULAR_FRACTION
The subcellular fraction used in the assay
Optional
Any character up to a length of 100
The type of target
Optional
Any character up to a length of 25. Target type list
TARGET_NAME
The name of the target
Optional
Any character up to a length of 400
TARGET_ACCESSION
A UniProt accession for the target where the target is a protein. Otherwise to be left blank.
Optional
An integer or a UniProt ID up to a length of 255
TARGET_ORGANISM
The target organism
Optional
Any character up to a length of 100
TARGET_TAX_ID
The NCBI taxonomy ID of the target organism
Optional
Any positive integer up to a length of 11 of '0'
Header
Description
Existence
Data Type
AIDX
The AIDX established by the depositor in the ASSAY file - a foreign key
Mandatory
Any character up to a length of 200
TYPE
The type of parameter. Must be unique within an AIDX
Mandatory
Any character up to a length of 250
RELATION
Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')
Optional (Mandatory if VALUE is given)
Any relation symbol
(=, >, <, ~, <=, >=, >>, <<) up to a length of 50
VALUE
The numerical value of the parameter.
Optional
Any number (including decimals, negatives and scientific notation (e.g. 3×10^2))
UNITS
The units of the parameter measurement
Optional
Any character up to a length of 100
TEXT_VALUE
The text value of non-numerical values
Optional
Any character up to a length of 4000
COMMENTS
A comment on the parameter.
Optional
Any character up to a length of 4000
Header
Description
Existence
Data type
CIDX
The CIDX established by the depositor in the COMPOUNDS file - a foreign key
Mandatory
Any character up to a length of 200
CRIDX
The RIDX to be associated with the CIDX in the creation of the compound record. Must belong to SRC_ID_CIDX.
Mandatory
Any character up to a length of 200
SRC_ID_CIDX
The SRC_ID for the CIDX. If not specified, then value is assumed to be the SRC_ID for the depositor - this field can be used to reference compounds deposited by other depositors
Optional
Any positive integer up to a length of 4
AIDX
The AIDX established by the depositor in the ASSAY file - a foreign key
Mandatory
Any character up to a length of 200
SRC_ID_AIDX
The SRC_ID for the AIDX. If not specified, then value is assumed to be the SRC_ID for the depositor
Optional
Any positive integer up to a length of 4
RIDX
The RIDX established by the depositor in the REFERENCE file - a foreign key
Deprecated. Not visible to end-users
Any character up to a length of 200
TEXT_VALUE
The text value of non-numerical values. Do not use for value ranges. Provide these as upper and lower bounds, as VALUE fields.
Optional
Any character up to a length of 1000
RELATION
Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')
Optional (Mandatory if a VALUE is given)
Any Relation symbol
(=, >, <, ~, <=, >=, >>, <<) up to a length of 50
VALUE
The numerical value of the activity measurement (see ACTIVITY_COMMENT for non-numerical values)
Optional
Any number (including decimals, negatives and scientific notation (e.g. 3×10^2))
UPPER_VALUE
Where the activity is a range, this represents the highest value of the range (numerically), while the VALUE column represents the lower value
Optional
Any number (including decimals, negatives and scientific notation (e.g. 3×10^2))
UNITS
The units of the measurement
Optional
Any character up to a length of 100
SD_MINUS
Standard Deviation Lower limit
Optional
Any number (including decimals, negatives and scientific notation (e.g. 3x10^2))
SD_PLUS
Standard Deviation Upper limit
Optional
Any number (including decimals, negatives and scientific notation (e.g. 3x10^2))
ACTIVITY_COMMENT
A comment on the activity measurement. Non-numerical 'values' should be given here. Equivalent to 'TEXT_VALUE' field in many other tables.
Optional
Any character up to a length of 4000
CRIDX_CHEMBLID
The CHEMBLID for the CRIDX. Must belong to the SRC_ID_CIDX
Optional
Any character up to a length of 200
CRIDX_DOCID
The DOCID for the CRIDX. Must belong to the SRC_ID_CIDX
Optional
Any character up to a length of 200
ACT_ID
A local ID used to relate records in ACTIVITY_PROPERTIES and Supplementary tables. A primary key in this table. Not required unless depositing such data.
Optional (Mandatory if there are ACTIVITY_PROPERTIES or ACTIVITY SUPPLEMENTARY files)
VARCHAR(50) Must be unique for each activity line, even if they have identical Properties.
TEOID
TEst Occasion ID, grouping together related Activity records. Usually used to group activities by time point and treatment. A primary key in this table.
Optional
Any integer up to a length of 11
TYPE
The type of measurement
Mandatory
Any character up to a length of 250
ACTION_TYPE
Specifies the effect of the drug on its target for protein-based assays. Must match one of the names in the ACTION_TYPE table
Optional
Any character up to a length of 50 - must be in the ACTION_TYPE table (see below)