Input File structure and requirements
This page describes all aspects of the requirements for loading BioActivity data, including the permitted file names and column headers, and the permitted content of the cells (or 'fields') within these files.
Files must be supplied as tab-separated text files with UTF-8 encoding (See examples, and the walkthrough slides here). Spreadsheets will not work with the loader.
Data deposition ChEMBL slides.pptx
3MB
Binary
Please note that unless you are loading additional data to existing assays, RIDX is a mandatory field in any file that includes it. Without including your RIDX, records cannot be linked to your references and the load will fail.

The 'rules' referred to are summarized in table form at the end of this page.
- Rules are applied when validating the file.
- Each rule is associated with a single ‘Penalty Score’ (PS) value, which can range from 0 to 9 inclusive.
- The higher the score, the more serious the problem. Scores of 9 cause an automatic load failure.
- Consider limiting numeric data to a small number of decimal places so it is easily compared by users.
- Leave values empty if they are null. If you use placeholders like "-", "None" or "NULL" then data may load into the database while being invalid. Or it may be difficult to search for as we will not record these as a null value. The loader attempts to convert such values to nulls, but we cannot cover every possibility.
Filenames | Existence | Level | Depositor Defined ID (DDID) defined by this file | Definition of Primary Key | All records in this file must be 'Foreign-Keyed' to... |
ASSAY.tsv | Optional | primary | AIDX | ('AIDX',) | - |
ASSAY_PARAM.tsv | Optional | secondary | - | - | AIDX in ASSAY |
COMPOUND_RECORD.tsv | Optional (But Mandatory if compound records do not exist for your compounds) | primary | CIDX | ('CIDX', 'RIDX') | - |
COMPOUND_CTAB.sdf | Optional | secondary | - | - | CIDX in COMPOUND_RECORD |
REFERENCE.tsv | Optional | primary | RIDX | ('RIDX',) | - |
ACTIVITY.tsv | Optional | tertiary | - | - | - |
ACTIVITY_PROPERTIES.tsv | Optional | not defined | - | - | - |
ACTIVITY_SUPP.tsv | Optional | not defined | - | - | - |
ACTIVITY_SUPP_MAP.tsv | Optional | not defined | - | - | - |
INFO.txt | Irrelevant | not defined | - | - | - |
File: All filenames must have a 3 letter extension (eg: '.txt', '.tsv', '.sdf', etc).
Existence:
- Irrelevant - May exist, but will be ignored if it does.
- Optional - May exist, and will be used if it does.
- Mandatory - Must exist, and the dataset will not load if it does not.
Some fields will only take certain valid identifiers, for example ASSAY.TEST_TYPE must be "in vitro", "in vivo", or "ex vivo". Valid datatypes are described in the tables and also summarised here.
For further information on constraints and error scores for the loader, see this page. For most depositors, the info here should be sufficient.
An ASSAY record is a single instance of an assay, not an assay protocol record. If the same assay protocol is used in multiple datasets, it is still a new ASSAY in each dataset.
Header | Description | Existence | Field type | Additional constraints |
AIDX | The AIDX cited by the depositor. A Primary Key defined header | Mandatory | Any character upto a length of 200 | ​ |
RIDX | The RIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor] | Optional | Any character upto a length of 200 | ​ |
ASSAY_DESCRIPTION | A description of the assay | Mandatory | Any character upto a length of 4000 | ​ |
ASSAY_TYPE | The type of the assay. B (Binding) ,F (Functional),A (ADMET),U (Unassigned),P (Physicochemical) or T (toxicity) | Mandatory | Any character upto a length of 1 | Accepted Assay types (ADMET, A, Functional, F, Binding, B, Unassigned, U, Physiochemical, P, Toxicity, T) [gp019 ] |
ASSAY_ORGANISM | The assay organism | Optional
(Mandatory if ASSAY_TAXON_ID is populated) | Any character upto a length of 250 | ​ |
ASSAY_STRAIN | The strain of the assay organism | Optional | Any character upto a length of 200 | ​ |
ASSAY_TAX_ID | NCBI taxonomy ID for the assay organism | Optional | Any integer upto a length of 11 | Positive integer or '0' (regex='^\d*$') [gp003] |
ASSAY_SOURCE | The original source of the assay | Optional | Any character upto a length of 100 | ​ |
ASSAY_TISSUE | The type of tissue used in the assay | Optional | Any character upto a length of 100 | ​ |
ASSAY_CELL_TYPE | The cell line | Optional | Any character upto a length of 100 | ​ |
ASSAY_SUBCELLULAR_FRACTION | The subcellular fraction used in the assay | Optional | Any character upto a length of 100 | ​ |
TARGET_TYPE | The type of target | Optional | Any character upto a length of 25 | |
TARGET_NAME | The name of the target | Optional | Any character upto a length of 400 | ​ |
TARGET_ACCESSION | The accession number of the target (eg: UniProt Acc, NCBI tax ID) | Optional | Any character upto a length of 255 | An integer or a UniProt ID [gp024] |
TARGET_ORGANISM | The target organism | Optional | Any character upto a length of 100 | ​ |
TARGET_TAX_ID | The NCBI taxonomy ID of the target organism | Optional | Any integer upto a length of 11 | Positive integer or '0' (regex='^\d*$') [gp003] |
Header | Description | Existence | Field Type | Additional constraints |
AIDX | The AIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor] | Mandatory | Any character upto a length of 200 | ​ |
TYPE | The type of parameter. Must be unique within an AIDX | Mandatory | Any character upto a length of 250 | ​ |
RELATION | Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>') | ​ Optional (Mandatory if VALUE is given) | Any character upto a length of 50 | Relation symbol (=, >, <, ~, <=, >=, >>, <<) [gp022 ] |
VALUE | The numerical value of the parameter. | Optional | Any number (incl decimals, negatives and sci Notn) | Any Number. Decimal, Sci Notn, +/- [gp005 } |
UNITS | The units of the parameter measurement | Optional | Any character upto a length of 100 | ​ |
TEXT_VALUE | The text value of non-numerical values | Optional | Any character upto a length of 4000 | ​ |
COMMENTS | A comment on the parameter. | Optional | Any character upto a length of 4000 | ​ |
We need you to provide at least one of COMPOUND_NAME, COMPOUND_KEY or COMPOUND_SOURCE in order to make the compound searchable.
Header | Description | Existence | Field Type | Additional constraints |
CIDX | The CIDX cited by the depositor. A Primary Key defined header | Mandatory | Any character upto a length of 200 | ​ |
RIDX | The RIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor] | Optional | Any character upto a length of 200 | ​ |
COMPOUND_KEY | The local synonym used for this CIDX in the RIDX quoted | Mandatory | Any character upto a length of 250 | ​ |
COMPOUND_NAME | The name used for this CIDX in the RIDX quoted | Mandatory | Any character upto a length of 4000 | ​ |
COMPOUND_SOURCE | The source of this CIDX in the RIDX quoted | Optional | Any character upto a length of 400 | ​ |
This must be in V2000 molfile format The InChI binaries we use currently do not accept V3000 molfiles. We cannot accept V3000 InChIs as we need an InChI Key for ChEMBL.
Header | Description | Existence | Field type | Additional Constraints |
CIDX | The CIDX cited by the depositor [but, note that an alternative header label can be set using the -C option] | Mandatory | Any character upto a length of 200 | ​ |
CTAB | The CTAB (Connection table) assigned to this CIDX | Optional | A very large text field | ​ |
The TITLE, ABSTRACT and AUTHORS fields of the DOCS table should also be populated for Datasets, as well as Publications.
These can be brief (i.e. an organisation name can be provided for the AUTHORS field) and the abstract can be a simple summary of the experiments and their overall purpose. It should contain sufficient details for a user to locate your publication or project by reading the reference. E.g the ABSTRACT field could contain a short description of the dataset and a link to the project site.
Header | Description | Existence | Fie Type | Additional Constraints |
RIDX | The RIDX cited by the depositor. A Primary Key defined header | Mandatory | Any character upto a length of 200. Will warn if this starts with 0 | ​ |
PUBMED_ID | PubMed ID | Mandatory to have either PMID or DOI if one exists. If your data do not have one, contact us. | Any integer upto a length of 11 | Positive integer (regex='^[1-9]\d*$') [gp006] |
JOURNAL_NAME | Journal name | Optional | Any character upto a length of 50 | ​ |
YEAR | Year of publication | Mandatory | Any integer upto a length of 4 | 1900 > year > 2050 [gp031] |
VOLUME | The volume of the publication | Optional | Any character upto a length of 50 | ​ |
ISSUE | The issue of the publication | Optional | Any character upto a length of 50 | ​ |
FIRST_PAGE | The first page of the article | Optional | Any character upto a length of 50 | Positive integer (regex='^[1-9]\d*$') [gp006] |
LAST_PAGE | The last page of the article | Optional | Any character upto a length of 50 | Positive integer (regex='^[1-9]\d*$') [gp006] |
REF_TYPE | The type of reference (Publication, Patent, Dataset, Book) | Mandatory | Any character upto a length of 50 | An accepted reference type [case ins] (Patent, Publication, Dataset, Book) [gp032] |
TITLE | The title of the reference | Mandatory | Any character upto a length of 500 | ​ |
DOI | The Digital Object Identifier | Mandatory to have either PMID or DOI if one exists. If your data do not have one, contact us. | Any character upto a length of 200 | A Digital Object Identifier (regex='^(10\.\d\d\d\d+\/.*)$') [gp010] |
PATENT_ID | The Patent Identifier | Optional | Any character upto a length of 200 | A Patent Identifier (regex='^(WO|EP|US)\-?\d+.*$') [gp011] |
ABSTRACT | The abstract of the article. For a dataset, include a description of the dataset here. | Mandatory | A very large text field | ​ |
AUTHORS | A list of the authors of the publication | Mandatory | Any character upto a length of 4000 | ​ |
Header | Description | Existence | Datatype rule | Additional Constraints |
CIDX | The CIDX cited by the depositor | Mandatory | Any character upto a length of 200 | ​ |
CRIDX | The RIDX to be associated with the CIDX in the creation of the compound record. Must belong to SRC_ID_CIDX. | Mandatory | Any character upto a length of 200 | ​ |
SRC_ID_CIDX | The SRC_ID for the CIDX. If not specified, then value is assumed to be the SRC_ID for the depositor | Optional | Any integer upto a length of 4 | Positive integer (regex='^[1-9]\d*$') [gp006] |
AIDX | The AIDX cited by the depositor | Mandatory | Any character upto a length of 200 | ​ |
SRC_ID_AIDX | The SRC_ID for the AIDX. If not specified, then value is assumed to be the SRC_ID for the depositor | Optional | Any integer upto a length of 4 | Positive integer (regex='^[1-9]\d*$') [gp006] |
RIDX | The RIDX cited by the depositor. A Self-Referencing field. MUST be owned by depositor [ie: Not a PK Identifier, but a FK to an identifier owned by the depositor] | Deprecated. Not visible to end-users | Any character upto a length of 200 | ​ |
TEXT_VALUE | The text value of non-numerical values | Optional | Any character upto a length of 1000 | ​ |
RELATION | Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>') | Optional (Mandatory if a VALUE is given) | Any character upto a length of 50 | Relation symbol (=, >, <, ~, <=, >=, >>, <<) [gp022 ] |
VALUE | The numerical value of the activity measurement (see ACTIVITY_COMMENT for non-numerical values) | Optional | Any number (incl decimals, negatives and sci Notn) | Any Number. Decimal, Sci Notn, +/- |
UPPER_VALUE | Where the activity is a range, this represents the highest value of the range (numerically), while the PUBLISHED_VALUE column represents the lower value | Optional | Any number (incl decimals, negatives and sci Notn) | ​ |
UNITS | The units of the measurement | Optional | Any character upto a length of 100 | ​ |
SD_MINUS | Standard Deviation Lower limit | Optional | Any number (incl decimals, negatives and sci Notn) | ​ |
SD_PLUS | Standard Deviation Upper limit | Optional | Any number (incl decimals, negatives and sci Notn) | ​ |
ACTIVITY_COMMENT | A comment on the activity measurement. Non-numerical 'values' should be given here. Equivalent to 'TEXT_VALUE' field in many other tables. | Optional | Any character upto a length of 4000 | ​ |
CRIDX_CHEMBLID | The CHEMBLID for the CRIDX. Must belong to the SRC_ID_CIDX | Optional | Any character upto a length of 200 | CHEMBLID format (regex='^CHEMBL\d+$') [gp023] |
CRIDX_DOCID | The DOCID for the CRIDX. Must belong to the SRC_ID_CIDX | Optional | Any character upto a length of 200 | ​ |
ACT_ID | A local ID used to relate records in ACTIVITY_PROPERTIES and Supplementary tables. Not required unless depositing such data. | Optional (Mandatory if there are ACTIVITY_PROPERTIES or ACTIVITY SUPPLEMENTARY files) | Any integer upto a length of 11
Must be unique for each activity. | ​ |
TEOID | TEst Occasion ID, grouping together related Activity records. Depositor defined.
Usually used to goup activities by timepoint and treatment. | Optional | Any integer upto a length of 11 | ​ |
TYPE | The type of measurement | Mandatory | Any character upto a length of 250 | ​ |
ACTION_TYPE | Specifies the effect of the drug on its target for protein-based assays. Must match one of the names in the ACTION_TYPE table | Optional | Any character up to a length of 50 | Must be in the ACTION_TYPE table |
Each ACTIVITY can relate to multiple properties, via the ACT_ID field. Each ACTIVITY_PROPERTIES record can only link back to a single ACTIVITY record. Even if two ACTIVITY records have the same experimental conditions, they both need their own set of ACTIVITY_PROPERTIES records.
Header | Description | Existence | Field type | Additional Constraints |
ACT_ID | FK to the ACTIVITY file. Depositor defined. | Mandatory | Any integer upto a length of 11 | ​ |
TYPE | The type of property measurement. Must be unique within an ACT_ID
Should be self-explanatory to external researchers | Mandatory | Any character upto a length of 250 | ​ |
RELATION | Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>') | Optional (Mandatory if a VALUE is given) | Any character upto a length of 50 | Relation symbol (=, >, <, ~, <=, >=, >>, <<) [gp022 ] |
VALUE | The numerical value of the property measurement | Optional | Any number (incl decimals, negatives and sci Notn) | Any Number. Decimal, Sci Notn, +/- [gp005] |
UNITS | The units of the property measurement | Optional | Any character upto a length of 100 | ​ |
TEXT_VALUE | The text value of non-numerical values | Optional | Any character upto a length of 1000 | ​ |
COMMENTS | A comment on the property. Can be used to further describe the TYPE, TEXT_VALUE or VALUE if needed | Optional | Any character upto a length of 4000 | ​ |
RESULT_FLAG | A flag to indicate, if set to 1, that this type is a dependent variable/result (e.g., slope) rather than an independent variable/parameter (0, the default). | Optional | Any integer upto a length of 1 | 0 or 1 (regex='^(0|1)*$') [gp001] |
Header | Description | Existence | Field Type | Additional Constraints |
TYPE | The type of supplementary measurement | Mandatory | Any character upto a length of 250 | ​ |
RELATION | Symbol indicating relationship between the Type and the Value (permitted: '>','<','=','~','<=','>=','<<','>>') | Optional (Mandatory if a VALUE is given) | Any character upto a length of 50 | Relation symbol (=, >, <, ~, <=, >=, >>, <<) [gp022 ] |
VALUE | The numerical value of the supplementary measurment | Optional | Any number (incl decimals, negatives and sci Notn) | Any Number. Decimal, Sci Notn, +/- [gp005] |
UNITS | The units of the supplementary measurement | Optional | Any character upto a length of 100 | ​ |
TEXT_VALUE | The text value of non-numerical values | Optional | Any character upto a length of 1000 | ​ |
COMMENTS | A comment on the record. | Optional | Any character upto a length of 4000 | ​ |
REGID | Record Grouping Identifier. Groups together records in ACTIVITY_SUPP file. Depositor defined.
Has often been used to group by (for example) all mesurements from a single animal in a study. Analagous to TOID. | Mandatory | Any integer upto a length of 11 | ​ |
SAMID | FK to the FK SAMID in ACTIVITY_SUPP file.
Usually referring to a single specific measurement, not a group, animnal or well.
Depositor defined. | Mandatory | Any integer upto a length of 11 | ​ |
Regexes are only shown for Pattern rules. Dependency rules involve a number of regexes, and so are not easily shown here.
Rule Type | Rule ID | Short Description | Regex | Long Description |
DEPENDENCY | gd1 | Type of target restricts kind of accession used. | ​ | The type of target restricts the kind of accession that should be used. For example, if the target type is 'Protein', then a UniProt ID is expected.. Target Field='TARGET_ACCESSION' |
DEPENDENCY | gd2 | A short desc of gd2 | ​ | A longer desc of gd2 Target Field='ASSAY_TAX_ID' |
DEPENDENCY | gd3 | If populated with an integer, then some text expected in this field | ​ | If this sfield is populated with an integer, then the target field should contain some text. Thus if the ASSAY_TAX_ID is given, then a name should also be provided for the organism. Target Field='ASSAY_ORGANISM' |
PATTERN | gp001 | 0 or 1 | ^(0|1)*$ | 0 or 1 |
PATTERN | gp003 | Positive integer or '0' | ^\d*$ | A positive integer or zero |
PATTERN | gp005 | Any Number. Decimal, Sci Notn, +/- | ^\-?(\d+(\.\d+)?|\.\d+)(e\-?\+?\d\d?)?$ | A number. May be a decimal and may be positive or negative, May be scientific notation. Case Ins |
PATTERN | gp006 | Positive integer | ^[1-9]\d*$ | gp006 A positive integer. Not zero |
PATTERN | gp010 | A Digital Object Identifier | ^(10\.\d\d\d\d+\/.*)$ | A Digital Object Identifier |
PATTERN | gp011 | A Patent Identifier | ^(WO|EP|US)\-?\d+.*$ | A Patent Identifier. WO,EP,US |
PATTERN | gp018 | Accepted Assay test types [case Ins] | ^(in vitro|in vivo|ex vivo)$ | Accepted Assay test types. Case insensitive match |
PATTERN | gp019 | Accepted Assay types [case Ins] | ^(ADMET|A|Functional|F|Binding|B|Unassigned|U|Physiochemical|P|Toxicity|T)$ | Accepted Assay types. Case insensitive match |
PATTERN | gp021 | Accepted Target types [case ins] | ^(None|NUCLEIC\-ACID|NUCLEIC ACID|TISSUE|PROTEIN|ORGANISM|CELL\-LINE|CELL\_LINE| CELL LINE|ADMET|UNKNOWN|UNCHECKED|SUBCELLULAR|NO TARGET|PROTEIN COMPLEX|PROTEIN FAMILY|PROTEIN COMPLEX GROUP|CHIMERIC PROTEIN|SELECTIVITY GROUP|PROTEIN\-PROTEIN INTERACTION|SINGLE PROTEIN|MOLECULAR|NON\-MOLECULAR|UNDEFINED|PHENOTYPE|PROTEIN NUCLEIC\-ACID COMPLEX|SMALL MOLECULE|OLIGOSACCHARIDE|METAL|LIPID|MACROMOLECULE| Other)$ | Accepted Taregt types. Case insensitive match. |
PATTERN | gp022 | relation symbol (=,>,etc). | ^(\=|\>|\<|\~|\<\=|\>\=|\>\>|\<\<)\=?$ | An accepted relation symbol |
PATTERN | gp023 | CHEMBLID format | ^CHEMBL\d+$ | The accepted format for a CHEMBLID |
PATTERN | gp024 | An integer or a UniProt ID | ^(\d+)|([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}) | An integer or a UniProt ID. Acceptable 'Accession IDs' |
PATTERN | gp031 | 1900 > year > 2050 | ^(19\d\d|20(0|1|2|3|4)\d)$ | 1900 > year > 2050 |
PATTERN | gp032 | An accepted reference type [case ins] | ^(Patent|Publication|Dataset|Book)$ | A accepted reference type. Case insensitive match |
​
ACTION_TYPE | DESCRIPTION | PARENT_TYPE |
---|---|---|
ACTIVATOR | Positively effects the normal functioning of the protein e.g., activation of an enzyme or cleaving a clotting protein precursor | POSITIVE MODULATOR |
AGONIST | Binds to and activates a receptor, often mimicking the effect of the endogenous ligand | POSITIVE MODULATOR |
ALLOSTERIC ANTAGONIST | Binds to a receptor at an allosteric site and prevents activation by a positive allosteric modulator at that site | NEGATIVE MODULATOR |
ANTAGONIST | Binds to a receptor and prevents activation by an agonist through competing for the binding site | NEGATIVE MODULATOR |
ANTISENSE INHIBITOR | Prevents translation of a complementary mRNA sequence through binding and targeting it for degradation | NEGATIVE MODULATOR |
BINDING AGENT | Binds to a substance such as a cell surface antigen, targetting a drug to that location, but not necessarily affecting the functioning of the substance itself | OTHER |
BLOCKER | Negatively effects the normal functioning of an ion channel e.g., prevents or reduces transport of ions through the channel | NEGATIVE MODULATOR |
CHELATING AGENT | Binds to a metal, reducing its availability for further interactions | OTHER |
CROSS-LINKING AGENT | Induces cross-linking of proteins or nucleic acids | OTHER |
DEGRADER | Binds to or antagonizes a receptor, leading to its degradation | NEGATIVE MODULATOR |
DISRUPTING AGENT | Destabilises or disrupts a protein complex, macromolecular assembly, cell membrane etc | OTHER |
HYDROLYTIC ENZYME | Hydrolyses a substrate through enzymatic reaction | OTHER |
INHIBITOR | Negatively effects (inhibits) the normal functioning of the protein e.g., prevention of enzymatic reaction or activation of downstream pathway | NEGATIVE MODULATOR |
INVERSE AGONIST | Binds to and inactivates a receptor | NEGATIVE MODULATOR |
METHYLATING AGENT | Methylates or participates in methylation (e.g., through donation of a methyl group) of a substrate molecule | OTHER |
MODULATOR | Effects the normal functioning of a protein in some way e.g., mixed agonist/antagonist or unclear whether action is positive or negative | OTHER |
NEGATIVE ALLOSTERIC MODULATOR | Reduces or prevents the action of the endogenous ligand of a receptor through binding to a site distinct from that ligand (non-competitive inhibition) | NEGATIVE MODULATOR |
NEGATIVE MODULATOR | Negatively effects the normal functioning of a protein e.g., receptor antagonist, inverse agonist or negative allosteric modulator | NEGATIVE MODULATOR |
OPENER | Positively effects the normal functioning of an ion channel e.g., facilitates transport of ions through the channel | POSITIVE MODULATOR |
OTHER | Other action type, not clearly postively or negatively affecting the normal functioning of a protein e.g., chelation of substances, hydrolysis of substrate | OTHER |
OXIDATIVE ENZYME | Oxidises a substrate through enzymatic reaction |