# Common data issues

## Placeholder values

You don't generally need to use placeholder values. Very few fields are mandatory, and so you can leave a field empty if necessary.

## Default values and RIDXs

All SRC\_IDs are associated with a single default document ID (DOC\_ID); the RIDX associated with this default DOC\_ID is always named 'default'. The default DOC\_ID definition can be manually edited by the ChEMBL administrator in consultation with the depositor.

The database requires that a unique SRC\_ID / RIDX combination be enforced. As a result of this constraint a ‘null’ value cannot be assigned if either of these fields are left blank.  Therefore, ‘default’ is used for the RIDX field when no RIDX has been specified.&#x20;

The default RIDX and associated DOC\_ID are used for incoming data where an RIDX has not been specified during loading, or if the RIDX in the ACTIVITIES file doesn't match the RIDX used in the ASSAY, REFERENCE and COMPOUND\_RECORD files. This can be because a depositor decided to not associate one of their RIDX’s to one of their CIDX’s, AIDX’s or Activities, or due to an error in file formatting. &#x20;

Attempting to create or edit an RIDX called ‘default’ will be met with an error, although it can be assigned to the RIDX field in the ACTIVITY file.  Leaving the RIDX field blank, or not specifying the field at all would have the same effect.

## Sources with multiple depositors

* It is possible for a source to have multiple depositors, for example a large study that tests the same panel of drugs at several labs on several sites.&#x20;
* The clearest example of this is Source 52, which contains SARS-CoV-2 assay data from multiple depositors.
* The depositors can share RIDXs and CIDXs, but the AIDX must be unique for each assay deposited.
* To avoid duplication within a source, we recommend that each site, research group etc adds a prefix to the AIDX. The AIDX field is a 200-character field, so this can easily be a name.&#x20;

## ASSAY\_ORGANISM and TARGET\_ORGANISM&#x20;

#### If the assay only involves one organism, or biological entities from one organism

When all the biological components of an assay have been sourced from the same organism, this organism is the ASSAY\_ORGANIS&#x4D;**.** For example, a human protein in human cells, or an isolated human protein in a binding assay.

#### If the assay involves biological entities from two organisms

Where there are biological components from more than one organism, the TARGET\_ORGANISM is the organism that the researchers were measuring the effect of the compound on. In the case of a protein or other subcellular target, it's the organism the target biological entity is naturally produced by.  If antiviral effect is being tested in a host cell line or organism, the virus is the target biological entity.

The ASSAY\_ORGANISM is a 'host' organism that is used as part of the assay, but is not the target of the compound. In these examples, the researchers were not measuring the effect of the chemical on the ASSAY\_ORGANISM as the primary endpoint.

* **Human** protein in **human** cells
  * **Assay\_organism** =  Homo sapiens
* Recombinant **human** protein expressed in **mouse** cells.&#x20;
  * **Target organism** = Homo sapiens
  * **Assay\_organism** = Mus musculus
* Antibacterial activity against ***E. coli*** infected in **rat**.&#x20;
  * **Target organism** = Escherichia coli
  * **Assay\_organism** = Rattus rattus
* Recombinant **human** protein expressed in **mouse** cells. Xenografted in **mouse**.&#x20;
  * **Target organism** = Homo sapiens
  * **Assay\_organism** = Mus musculus
* Antiviral activity against **HIV1** measured in **human** cells.
  * **Target organism** = Human immunodeficiency virus 1
  * **Assay\_organism** = Homo sapiens

## ASSAY\_PARAM rules

If a deposition contains an ASSAY\_PARAM with multiple rows for the AIDX ‘abc’, and the AIDX ‘abc’ already exists in the database with the same SRC\_ID, or in the ASSAY file in the same deposition, then these parameters will be loaded against the ‘abc’ assay, **replacing** all previous parameters loaded against ’abc’.&#x20;

If the same ASSAY\_PARAM file contains a single record for AIDX ‘def’ with empty values in ‘TYPE’ and ‘VALUE’ columns, where 'def' has been used in a previous deposition to that SRC\_ID, then all existing parameters associated with ‘def’ will be deleted from the database, and no new ones created.&#x20;

If in the same deposition the ASSAY file defines or redefines AIDX ‘ghi’, but no records exist for ‘ghi’ in the ASSAY\_PARAM file:

* If ‘ghi’ does not already exist, then it will be created with no parameters.
* If it does already exists, then it will be updated, but any existing secondary parameters associated with ‘ghi’ in the ASSAY\_PARAMETER table will not be affected. This behaviour prevents the inadvertent deletion of secondary data when an assay is updated.

## Updating compound structures

If a COMPOUND\_CTAB file contains a record for a new, unique CIDX ‘str1’, then that CTAB defines the structure.  A MOLREGNO, a unique compound identifier, will be assigned to the new structure.

In the event that the CTAB for ‘str1’ is empty, then ALL records in the COMPOUND\_RECORDS table which contain ‘str1’ will be automatically assigned a new blank MOLREGNO with no structure assigned to it in ChEMBL. &#x20;

If ‘str1’ already exists in the database, or in the COMPOUND\_RECORD file in the same deposition, then the old CTAB for ‘str1’ will be replaced with the new structure, and ALL records in the COMPOUND\_RECORDS table which contain ‘str1’ will be automatically reassigned a new MOLREGNO for this new structure.&#x20;

Similarly, if the COMPOUND\_CTAB file does not contain ‘str1’ at all, but the COMPOUND\_RECORD file does, then the records for ‘str1’ in the COMPOUND\_RECORD table will be updated but the structure and the MOLREGNO for the record will not be changed.

Thus, the only way to update the structure for CIDX ‘str1’ from a previously deposited structure to no structure is to deposit a COMPOUND\_CTAB file with a record containing a CIDX field ‘str1’, and an empty CTAB.

## Illegal DDI values

Some values are not permitted for AIDX/RIDX/CIDX. Currently these are

* CR0
* default

## Unpublished data

You can send us unpublished datasets and we'll assign a DOI.  If it's accepted, then send us the journal details and we'll reassign the DOI to the publication.&#x20;

If you do have a DOI but it's not actually published, then send us all the details.&#x20;

If you need a reviewer to be able to see the data during review, then contact us and we can discuss hosting.

## Incomplete datasets

**The COMPOUND\_CTAB  file contains CIDXs that are not in COMPOUND\_RECORD/Not in ACTIVITY**

* If this results in an ACTIVITY with no legal COMPOUND\_RECORD entry, it will cause a CIDX/CRIDX error.
* If there is a CIDX in the CTAB that is not represented in in the COMPOUND\_RECORD, we get *"CIDX is a self-referencing ID in this file, so must exist in accompanying corresponding primary file or DB"*
* Under normal operation data will not load into the database if a CIDX in the COMPOUND\_CTAB does not exist in either the accompanying COMPOUND\_RECORD file or in the database.

**The COMPOUND\_CTAB  file does not contain all the structures in COMPOUND\_RECORDS**

* This will load correctly, but the compounds will be missing structures unless a later load using this dataset is done using the CTAB and REFERENCE file.

### Compound normalisation

A compound normalisation procedure is run separately from loading processes, and includes an assessment of the quality of the drawn structures.

Standard values are rounded to three significant figures, or 2 decimal places for values > 10.

Depositors will be contacted by the ChEMBL team if there are issues with the structures they have supplied.
