Common data issues
Placeholder values
You don't generally need to use placeholder values. Very few fields are mandatory, and so you can leave a field empty if necessary.
Default values and RIDXs
All src_id's are associated with a single default document ID; the RIDX associated with this default document ID is always named 'default'. The default RIDX and associated document_id are always used for incoming data unless a depositor specifies a RIDX during loading. The default doc_id definition can be manually edited by the ChEMBL administrator in consultation with the depositor.
For a given deposition, a depositor may decide to not associate one of their RIDX’s to one of their CIDX’s, AIDX’s or Activities. If they do not, then the loader will automatically assign the ‘default’ RIDX for this SRC_ID.
The database requires that a unique SRC_ID / RIDX combination be enforced. As a result of this constraint a ‘null’ value cannot be assigned if either of these fields are left blank. Therefore, ‘default’ is used for the RIDX field when no RIDX has been specified.
Attempting to create or edit an RIDX called ‘default’ will be met with an error, although it can be assigned to the RIDX field in the ACTIVITY file. Leaving the RIDX field blank, or not specifying the field at all would have the same effect.
Sources with multiple depositors
It is possible for a source to have multiple depositors, for example a large study that tests the same panel of drugs at several labs on several sites.
The clearest example of this is Source 52, which contains SARS-CoV-2 assay data from multiple depositors.
The depositors can share RIDXs and CIDXs, but the AIDX must be unique for each assay deposited.
To avoid duplication within a source, we recommend that each site, research group etc adds a prefix to the AIDX. The AIDX field is a 200-character field, so this can easily be a name.
Updating existing ChEMBL data using DDIs
If a depositor uses a CIDX of 'XYZ_123' to define a particular chemical structure, but after loading to ChEMBL they subsequently discover that their structure is incorrect, they can 'edit' the deposited structure by re-depositing XYZ_123 with the new correct structure. The new deposition will overwrite the older data.
A depositor can only edit data relating to their own identifiers. Thus, in the example above, if a second depositor from a different src_id owns a CIDX of 'XYZ_123', then this will remain unaltered by the edits undertaken by the first depositor.
ASSAY_PARAM rules
If a deposition contains an ASSAY_PARAM with multiple rows for the AIDX ‘abc’, and the AIDX ‘abc’ already exists in the database with the same src_id, or in the ASSAY file in the same deposition, then these parameters will be loaded against the ‘abc’ assay, replacing all previous parameters loaded against ’abc’.
If the same file contains a single record for AIDX ‘def’ with empty values in ‘TYPE’ and ‘VALUE’ columns, then all existing parameters associated with ‘def’ will be deleted from the database, and no new ones created.
If in the same deposition the ASSAY file defines or redefines AIDX ‘ghi’, but no records exist for ‘ghi’ in the ASSAY_PARAM file: if ‘ghi’ does not already exist, then it will be created with no parameters. However, if it does already exists, then it will be updated, but any existing secondary parameters associated with ‘ghi’ in the ASSAY_PARAMETER table will not be affected. This behaviour prevents the inadvertent deletion of secondary data when an assay is updated.
Updating compound structures
If a COMPOUND_CTAB file contains a record for a new, unique CIDX ‘str1’, then that CTAB defines the structure. A molregno, a unique compound identifier, will be assigned to the new structure.
In the event that the CTAB for ‘str1’ is empty, then ALL records in the COMPOUND_RECORDS table which contain ‘str1’ will be automatically assigned a new blank molregno with no structure assigned to it in ChEMBL.
If ‘str1’ already exists in the database, or in the COMPOUND_RECORD file in the same deposition, then the old CTAB for ‘str1’ will be replaced with the new structure, and ALL records in the COMPOUND_RECORDS table which contain ‘str1’ will be automatically reassigned a new molregno for this new structure.
Similarly, if the COMPOUND_CTAB file does not contain ‘str1’ at all, but the COMPOUND_RECORD file does, then the records for ‘str1’ in the COMPOUND_RECORD table will be updated but the structure and the molregno for the record will not be changed.
Thus, the only way to update the structure for CIDX ‘str1’ from a previously deposited structure to no structure is to deposit a COMPOUND_CTAB file with a record containing a CIDX field ‘str1’, and an empty CTAB.
Illegal DDI values
Some values are not permitted for AIDX/RIDX/CIDX. Currently these are
CR0
default
Unpublished data
You can send us unpublished datasets and we'll assign a DOI. If it's accepted, then send us the journal details and we'll reassign the DOI to the publication.
If you do have a DOI but it's not actually published, then send us all the details.
If you need a reviewer to be able to see the data during review, then contact us and we can discuss hosting.
Incomplete datasets
The COMPOUND_CTAB file contains CIDXs that are not in COMPOUND_RECORD/Not in ACTIVITY
If this results in an ACTIVITY with no legal COMPOUND_RECORD entry, it'll cause a CIDX/CRIDX error.
If there is a CIDX in the CTAB that is not represented in in the COMPOUND_RECORD, we get "CIDX is a self-referencing ID in this file, so must exist in accompanying corresponding primary file or DB"
Under normal operation data will not load into the database if a CIDX in the COMPOUND_CTAB does not exist in either the accompanying COMPOUND_RECORD file or in the database.
The COMPOUND_CTAB file does not contain all the structures in COMPOUND_RECORDS
This will load correctly, but the compounds will be missing structures unless a later load using this dataset is done using the CTAB and REFERENCE file.
Last updated