Input File structure and requirements

This page describes all aspects of the requirements for loading BioActivity data, including the permitted file names and column headers, and the permitted content of the cells (or 'fields') within these files.
Files must be supplied as tab-separated text files with UTF-8 encoding (See examples, and the walkthrough slides here). Spreadsheets will not work with the loader.
Data deposition ChEMBL slides.pptx
3MB
Binary

Simplified input data schema

Please note that unless you are loading additional data to existing assays, RIDX is a mandatory field in any file that includes it. Without including your RIDX, records cannot be linked to your references and the load will fail.

​

Rules

The 'rules' referred to are summarized in table form at the end of this page.
  • Rules are applied when validating the file.
  • Each rule is associated with a single ‘Penalty Score’ (PS) value, which can range from 0 to 9 inclusive.
  • The higher the score, the more serious the problem. Scores of 9 cause an automatic load failure.

Guidance

  • Consider limiting numeric data to a small number of decimal places so it is easily compared by users.
  • Leave values empty if they are null. If you use placeholders like "-", "None" or "NULL" then data may load into the database while being invalid. Or it may be difficult to search for as we will not record these as a null value. The loader attempts to convert such values to nulls, but we cannot cover every possibility.

Files that may be included in a CHEMBL deposition

Filenames
Existence
Level
Depositor Defined ID (DDID) defined by this file
Definition of Primary Key
All records in this file must be 'Foreign-Keyed' to...
ASSAY.tsv
Optional
primary
AIDX
('AIDX',)
-
ASSAY_PARAM.tsv
Optional
secondary
-
-
AIDX in ASSAY
COMPOUND_RECORD.tsv
Optional (But Mandatory if compound records do not exist for your compounds)
primary
CIDX
('CIDX', 'RIDX')
-
COMPOUND_CTAB.sdf
Optional
secondary
-
-
CIDX in COMPOUND_RECORD
REFERENCE.tsv
Optional
primary
RIDX
('RIDX',)
-
ACTIVITY.tsv
Optional
tertiary
-
-
-
ACTIVITY_PROPERTIES.tsv
Optional
not defined
-
-
-
ACTIVITY_SUPP.tsv
Optional
not defined
-
-
-
ACTIVITY_SUPP_MAP.tsv
Optional
not defined
-
-
-
INFO.txt
Irrelevant
not defined
-
-
-
File: All filenames must have a 3 letter extension (eg: '.txt', '.tsv', '.sdf', etc).
Existence:
  • Irrelevant - May exist, but will be ignored if it does.
  • Optional - May exist, and will be used if it does.
  • Mandatory - Must exist, and the dataset will not load if it does not.
Some fields will only take certain valid identifiers, for example ASSAY.TEST_TYPE must be "in vitro", "in vivo", or "ex vivo". Valid datatypes are described in the tables and also summarised here.

Field names and data types for the deposited files

For further information on constraints and error scores for the loader, see this page. For most depositors, the info here should be sufficient.

ASSAY

An ASSAY record is a single instance of an assay, not an assay protocol record. If the same assay protocol is used in multiple datasets, it is still a new ASSAY in each dataset.
Header
Description
Existence
Field type
Additional constraints
AIDX
The AIDX cited by the depositor. A Primary Key defined header
Mandatory
Any character upto a length of 200
​
RIDX
The RIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor]
Optional
Any character upto a length of 200
​
ASSAY_DESCRIPTION
A description of the assay
Mandatory
Any character upto a length of 4000
​
ASSAY_TYPE
The type of the assay. B (Binding) ,F (Functional),A (ADMET),U (Unassigned),P (Physicochemical) or T (toxicity)
Mandatory
Any character upto a length of 1
Accepted Assay types
(ADMET, A, Functional, F, Binding, B, Unassigned, U, Physiochemical, P, Toxicity, T)
[gp019 ]
ASSAY_ORGANISM
The assay organism
Optional (Mandatory if ASSAY_TAXON_ID is populated)
Any character upto a length of 250
​
ASSAY_STRAIN
The strain of the assay organism
Optional
Any character upto a length of 200
​
ASSAY_TAX_ID
NCBI taxonomy ID for the assay organism
Optional
Any integer upto a length of 11
Positive integer or '0' (regex='^\d*$')
[gp003]
ASSAY_SOURCE
The original source of the assay
Optional
Any character upto a length of 100
​
ASSAY_TISSUE
The type of tissue used in the assay
Optional
Any character upto a length of 100
​
ASSAY_CELL_TYPE
The cell line
Optional
Any character upto a length of 100
​
ASSAY_SUBCELLULAR_FRACTION
The subcellular fraction used in the assay
Optional
Any character upto a length of 100
​
TARGET_TYPE
The type of target
Optional
Any character upto a length of 25
​Target type list​
TARGET_NAME
The name of the target
Optional
Any character upto a length of 400
​
TARGET_ACCESSION
The accession number of the target (eg: UniProt Acc, NCBI tax ID)
Optional
Any character upto a length of 255
An integer or a UniProt ID
[gp024]
TARGET_ORGANISM
The target organism
Optional
Any character upto a length of 100
​
TARGET_TAX_ID
The NCBI taxonomy ID of the target organism
Optional
Any integer upto a length of 11
Positive integer or '0' (regex='^\d*$')
[gp003]

ASSAY_PARAM

Header
Description
Existence
Field Type
Additional constraints
AIDX
The AIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor]
Mandatory
Any character upto a length of 200
​
TYPE
The type of parameter. Must be unique within an AIDX
Mandatory
Any character upto a length of 250
​
RELATION
Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')
​
Optional (Mandatory if VALUE is given)
Any character upto a length of 50
Relation symbol
(=, >, <, ~, <=, >=, >>, <<)
[gp022 ]
VALUE
The numerical value of the parameter.
Optional
Any number (incl decimals, negatives and sci Notn)
Any Number. Decimal, Sci Notn, +/-
[gp005 }
UNITS
The units of the parameter measurement
Optional
Any character upto a length of 100
​
TEXT_VALUE
The text value of non-numerical values
Optional
Any character upto a length of 4000
​
COMMENTS
A comment on the parameter.
Optional
Any character upto a length of 4000
​

COMPOUND_RECORD

We need you to provide at least one of COMPOUND_NAME, COMPOUND_KEY or COMPOUND_SOURCE in order to make the compound searchable.
Header
Description
Existence
Field Type
Additional constraints
CIDX
The CIDX cited by the depositor. A Primary Key defined header
Mandatory
Any character upto a length of 200
​
RIDX
The RIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor]
Optional
Any character upto a length of 200
​
COMPOUND_KEY
The local synonym used for this CIDX in the RIDX quoted
Mandatory
Any character upto a length of 250
​
COMPOUND_NAME
The name used for this CIDX in the RIDX quoted
Mandatory
Any character upto a length of 4000
​
COMPOUND_SOURCE
The source of this CIDX in the RIDX quoted
Optional
Any character upto a length of 400
​

COMPOUND_CTAB

This must be in V2000 molfile format The InChI binaries we use currently do not accept V3000 molfiles. We cannot accept V3000 InChIs as we need an InChI Key for ChEMBL.
Header
Description
Existence
Field type
Additional Constraints
CIDX
The CIDX cited by the depositor [but, note that an alternative header label can be set using the -C option]
Mandatory
Any character upto a length of 200
​
CTAB
The CTAB (Connection table) assigned to this CIDX
Optional
A very large text field
​

REFERENCE

The TITLE, ABSTRACT and AUTHORS fields of the DOCS table should also be populated for Datasets, as well as Publications.
These can be brief (i.e. an organisation name can be provided for the AUTHORS field) and the abstract can be a simple summary of the experiments and their overall purpose. It should contain sufficient details for a user to locate your publication or project by reading the reference. E.g the ABSTRACT field could contain a short description of the dataset and a link to the project site.
Header
Description
Existence
Fie Type
Additional Constraints
RIDX
The RIDX cited by the depositor. A Primary Key defined header
Mandatory
Any character upto a length of 200. Will warn if this starts with 0
​
PUBMED_ID
PubMed ID
Mandatory to have either PMID or DOI if one exists. If your data do not have one, contact us.
Any integer upto a length of 11
Positive integer (regex='^[1-9]\d*$')
[gp006]
JOURNAL_NAME
Journal name
Optional
Any character upto a length of 50
​
YEAR
Year of publication
Mandatory
Any integer upto a length of 4
1900 > year > 2050
[gp031]
VOLUME
The volume of the publication
Optional
Any character upto a length of 50
​
ISSUE
The issue of the publication
Optional
Any character upto a length of 50
​
FIRST_PAGE
The first page of the article
Optional
Any character upto a length of 50
Positive integer (regex='^[1-9]\d*$')
[gp006]
LAST_PAGE
The last page of the article
Optional
Any character upto a length of 50
Positive integer (regex='^[1-9]\d*$')
[gp006]
REF_TYPE
The type of reference (Publication, Patent, Dataset, Book)
Mandatory
Any character upto a length of 50
An accepted reference type [case ins]
(Patent, Publication, Dataset, Book)
[gp032]
TITLE
The title of the reference
Mandatory
Any character upto a length of 500
​
DOI
The Digital Object Identifier
Mandatory to have either PMID or DOI if one exists. If your data do not have one, contact us.
Any character upto a length of 200
A Digital Object Identifier (regex='^(10\.\d\d\d\d+\/.*)$')
[gp010]
PATENT_ID
The Patent Identifier
Optional
Any character upto a length of 200
A Patent Identifier (regex='^(WO|EP|US)\-?\d+.*$')
[gp011]
ABSTRACT
The abstract of the article. For a dataset, include a description of the dataset here.
Mandatory
A very large text field
​
AUTHORS
A list of the authors of the publication
Mandatory
Any character upto a length of 4000
​

ACTIVITY

Header
Description
Existence
Datatype rule
Additional Constraints
CIDX
The CIDX cited by the depositor
Mandatory
Any character upto a length of 200
​
CRIDX
The RIDX to be associated with the CIDX in the creation of the compound record. Must belong to SRC_ID_CIDX.
Mandatory
Any character upto a length of 200
​
SRC_ID_CIDX
The SRC_ID for the CIDX. If not specified, then value is assumed to be the SRC_ID for the depositor
Optional
Any integer upto a length of 4
Positive integer (regex='^[1-9]\d*$')
[gp006]
AIDX
The AIDX cited by the depositor
Mandatory
Any character upto a length of 200
​
SRC_ID_AIDX
The SRC_ID for the AIDX. If not specified, then value is assumed to be the SRC_ID for the depositor
Optional
Any integer upto a length of 4
Positive integer (regex='^[1-9]\d*$')
[gp006]
RIDX
The RIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor]
Deprecated. Not visible to end-users
Any character upto a length of 200
​
TEXT_VALUE
The text value of non-numerical values
Optional
Any character upto a length of 1000
​
RELATION
Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')
Optional (Mandatory if a VALUE is given)
Any character upto a length of 50
Relation symbol
(=, >, <, ~, <=, >=, >>, <<)
[gp022 ]
VALUE
The numerical value of the activity measurement (see ACTIVITY_COMMENT for non-numerical values)
Optional
Any number (incl decimals, negatives and sci Notn)
Any Number. Decimal, Sci Notn, +/-
UPPER_VALUE
Where the activity is a range, this represents the highest value of the range (numerically), while the PUBLISHED_VALUE column represents the lower value
Optional
Any number (incl decimals, negatives and sci Notn)
​
UNITS
The units of the measurement
Optional
Any character upto a length of 100
​
SD_MINUS
Standard Deviation Lower limit
Optional
Any number (incl decimals, negatives and sci Notn)
​
SD_PLUS
Standard Deviation Upper limit
Optional
Any number (incl decimals, negatives and sci Notn)
​
ACTIVITY_COMMENT
A comment on the activity measurement. Non-numerical 'values' should be given here. Equivalent to 'TEXT_VALUE' field in many other tables.
Optional
Any character upto a length of 4000
​
CRIDX_CHEMBLID
The CHEMBLID for the CRIDX. Must belong to the SRC_ID_CIDX
Optional
Any character upto a length of 200
CHEMBLID format (regex='^CHEMBL\d+$')
[gp023]
CRIDX_DOCID
The DOCID for the CRIDX. Must belong to the SRC_ID_CIDX
Optional
Any character upto a length of 200
​
ACT_ID
A local ID used to relate records in ACTIVITY_PROPERTIES and Supplementary tables. Not required unless depositing such data.
Optional (Mandatory if there are ACTIVITY_PROPERTIES or ACTIVITY SUPPLEMENTARY files)
Any integer upto a length of 11 Must be unique for each activity.
​
TEOID
TEst Occasion ID, grouping together related Activity records. Depositor defined. Usually used to goup activities by timepoint and treatment.
Optional
Any integer upto a length of 11
​
TYPE
The type of measurement
Mandatory
Any character upto a length of 250
​
ACTION_TYPE
Specifies the effect of the drug on its target for protein-based assays. Must match one of the names in the ACTION_TYPE table
Optional
Any character up to a length of 50
Must be in the ACTION_TYPE table

ACTIVITY_PROPERTIES

Each ACTIVITY can relate to multiple properties, via the ACT_ID field. Each ACTIVITY_PROPERTIES record can only link back to a single ACTIVITY record. Even if two ACTIVITY records have the same experimental conditions, they both need their own set of ACTIVITY_PROPERTIES records.
Header
Description
Existence
Field type
Additional Constraints
ACT_ID
FK to the ACTIVITY file. Depositor defined.
Mandatory
Any integer upto a length of 11
​
TYPE
The type of property measurement. Must be unique within an ACT_ID Should be self-explanatory to external researchers
Mandatory
Any character upto a length of 250
​
RELATION
Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')
Optional (Mandatory if a VALUE is given)
Any character upto a length of 50
Relation symbol
(=, >, <, ~, <=, >=, >>, <<)
[gp022 ]
VALUE
The numerical value of the property measurement
Optional
Any number (incl decimals, negatives and sci Notn)
Any Number. Decimal, Sci Notn, +/-
[gp005]
UNITS
The units of the property measurement
Optional
Any character upto a length of 100
​
TEXT_VALUE
The text value of non-numerical values
Optional
Any character upto a length of 1000
​
COMMENTS
A comment on the property. Can be used to further describe the TYPE, TEXT_VALUE or VALUE if needed
Optional
Any character upto a length of 4000
​
RESULT_FLAG
A flag to indicate, if set to 1, that this type is a dependent variable/result (e.g., slope) rather than an independent variable/parameter (0, the default).
Optional
Any integer upto a length of 1
0 or 1 (regex='^(0|1)*$')
[gp001]

ACTIVITY_SUPPLEMENTARY

Header
Description
Existence
Field Type
Additional Constraints
TYPE
The type of supplementary measurement
Mandatory
Any character upto a length of 250
​
RELATION
Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>')
Optional (Mandatory if a VALUE is given)
Any character upto a length of 50
Relation symbol
(=, >, <, ~, <=, >=, >>, <<)
[gp022 ]
VALUE
The numerical value of the supplementary measurment
Optional
Any number (incl decimals, negatives and sci Notn)
Any Number. Decimal, Sci Notn, +/-
[gp005]
UNITS
The units of the supplementary measurement
Optional
Any character upto a length of 100
​
TEXT_VALUE
The text value of non-numerical values
Optional
Any character upto a length of 1000
​
COMMENTS
A comment on the record.
Optional
Any character upto a length of 4000
​
REGID
Record Grouping Identifier. Groups together records in ACTIVITY_SUPP file. Depositor defined. Has often been used to group by (for example) all mesurements from a single animal in a study.
Analagous to TOID.
Mandatory
Any integer upto a length of 11
​
SAMID
FK to the FK SAMID in ACTIVITY_SUPP file. Usually referring to a single specific measurement, not a group, animnal or well. Depositor defined.
Mandatory
Any integer upto a length of 11
​

Summary of Pattern and Dependency Rules

Regexes are only shown for Pattern rules. Dependency rules involve a number of regexes, and so are not easily shown here.
Rule Type
Rule ID
Short Description
Regex
Long Description
DEPENDENCY
gd1
Type of target restricts kind of accession used.
​
The type of target restricts the kind of accession that should be used. For example, if the target type is 'Protein', then a UniProt ID is expected.. Target Field='TARGET_ACCESSION'
DEPENDENCY
gd2
A short desc of gd2
​
A longer desc of gd2 Target Field='ASSAY_TAX_ID'
DEPENDENCY
gd3
If populated with an integer, then some text expected in this field
​
If this sfield is populated with an integer, then the target field should contain some text. Thus if the ASSAY_TAX_ID is given, then a name should also be provided for the organism. Target Field='ASSAY_ORGANISM'
PATTERN
gp001
0 or 1
^(0|1)*$
0 or 1
PATTERN
gp003
Positive integer or '0'
^\d*$
A positive integer or zero
PATTERN
gp005
Any Number. Decimal, Sci Notn, +/-
^\-?(\d+(\.\d+)?|\.\d+)(e\-?\+?\d\d?)?$
A number. May be a decimal and may be positive or negative, May be scientific notation. Case Ins
PATTERN
gp006
Positive integer
^[1-9]\d*$
gp006 A positive integer. Not zero
PATTERN
gp010
A Digital Object Identifier
^(10\.\d\d\d\d+\/.*)$
A Digital Object Identifier
PATTERN
gp011
A Patent Identifier
^(WO|EP|US)\-?\d+.*$
A Patent Identifier. WO,EP,US
PATTERN
gp018
Accepted Assay test types [case Ins]
^(in vitro|in vivo|ex vivo)$
Accepted Assay test types. Case insensitive match
PATTERN
gp019
Accepted Assay types [case Ins]
^(ADMET|A|Functional|F|Binding|B|Unassigned|U|Physiochemical|P|Toxicity|T)$
Accepted Assay types. Case insensitive match
PATTERN
gp021
Accepted Target types [case ins]
^(None|NUCLEIC\-ACID|NUCLEIC ACID|TISSUE|PROTEIN|ORGANISM|CELL\-LINE|CELL\_LINE| CELL LINE|ADMET|UNKNOWN|UNCHECKED|SUBCELLULAR|NO TARGET|PROTEIN COMPLEX|PROTEIN FAMILY|PROTEIN COMPLEX GROUP|CHIMERIC PROTEIN|SELECTIVITY GROUP|PROTEIN\-PROTEIN INTERACTION|SINGLE PROTEIN|MOLECULAR|NON\-MOLECULAR|UNDEFINED|PHENOTYPE|PROTEIN NUCLEIC\-ACID COMPLEX|SMALL MOLECULE|OLIGOSACCHARIDE|METAL|LIPID|MACROMOLECULE| Other)$
Accepted Taregt types. Case insensitive match.
PATTERN
gp022
relation symbol (=,>,etc).
^(\=|\>|\<|\~|\<\=|\>\=|\>\>|\<\<)\=?$
An accepted relation symbol
PATTERN
gp023
CHEMBLID format
^CHEMBL\d+$
The accepted format for a CHEMBLID
PATTERN
gp024
An integer or a UniProt ID
^(\d+)|([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2})
An integer or a UniProt ID. Acceptable 'Accession IDs'
PATTERN
gp031
1900 > year > 2050
^(19\d\d|20(0|1|2|3|4)\d)$
1900 > year > 2050
PATTERN
gp032
An accepted reference type [case ins]
^(Patent|Publication|Dataset|Book)$
A accepted reference type. Case insensitive match

ACTION_TYPE valid names

​
ACTION_TYPE
DESCRIPTION
PARENT_TYPE
ACTIVATOR
Positively effects the normal functioning of the protein e.g., activation of an enzyme or cleaving a clotting protein precursor
POSITIVE MODULATOR
AGONIST
Binds to and activates a receptor, often mimicking the effect of the endogenous ligand
POSITIVE MODULATOR
ALLOSTERIC ANTAGONIST
Binds to a receptor at an allosteric site and prevents activation by a positive allosteric modulator at that site
NEGATIVE MODULATOR
ANTAGONIST
Binds to a receptor and prevents activation by an agonist through competing for the binding site
NEGATIVE MODULATOR
ANTISENSE INHIBITOR
Prevents translation of a complementary mRNA sequence through binding and targeting it for degradation
NEGATIVE MODULATOR
BINDING AGENT
Binds to a substance such as a cell surface antigen, targetting a drug to that location, but not necessarily affecting the functioning of the substance itself
OTHER
BLOCKER
Negatively effects the normal functioning of an ion channel e.g., prevents or reduces transport of ions through the channel
NEGATIVE MODULATOR
CHELATING AGENT
Binds to a metal, reducing its availability for further interactions
OTHER
CROSS-LINKING AGENT
Induces cross-linking of proteins or nucleic acids
OTHER
DEGRADER
Binds to or antagonizes a receptor, leading to its degradation
NEGATIVE MODULATOR
DISRUPTING AGENT
Destabilises or disrupts a protein complex, macromolecular assembly, cell membrane etc
OTHER
HYDROLYTIC ENZYME
Hydrolyses a substrate through enzymatic reaction
OTHER
INHIBITOR
Negatively effects (inhibits) the normal functioning of the protein e.g., prevention of enzymatic reaction or activation of downstream pathway
NEGATIVE MODULATOR
INVERSE AGONIST
Binds to and inactivates a receptor
NEGATIVE MODULATOR
METHYLATING AGENT
Methylates or participates in methylation (e.g., through donation of a methyl group) of a substrate molecule
OTHER
MODULATOR
Effects the normal functioning of a protein in some way e.g., mixed agonist/antagonist or unclear whether action is positive or negative
OTHER
NEGATIVE ALLOSTERIC MODULATOR
Reduces or prevents the action of the endogenous ligand of a receptor through binding to a site distinct from that ligand (non-competitive inhibition)
NEGATIVE MODULATOR
NEGATIVE MODULATOR
Negatively effects the normal functioning of a protein e.g., receptor antagonist, inverse agonist or negative allosteric modulator
NEGATIVE MODULATOR
OPENER
Positively effects the normal functioning of an ion channel e.g., facilitates transport of ions through the channel
POSITIVE MODULATOR
OTHER
Other action type, not clearly postively or negatively affecting the normal functioning of a protein e.g., chelation of substances, hydrolysis of substrate
OTHER
OXIDATIVE ENZYME
Oxidises a substrate through enzymatic reaction