Supplementary data in the ACTIVITY_SUPP file

Traditionally users have been directed to supplementary and supporting data located externally to ChEMBL through information contained in the REFERENCE file. This mechanism still exists, but it is also now possible to deposit such additional supportive or supplementary data in ChEMBL.

An example of this might be the individual percentage inhibition points on a curve where an IC50 value has been quoted in the ACTIVITY file. Supporting data such as these are stored in the ACTIVITY_SUPP table and require the introduction of two new identifiers: record grouping ID (REGID) and supplementary activity mapping ID (SAMID).

The ACTIVITY_SUPP file

This table is used for the bulk storage of raw data relating to ACTIVITY values, including both data used directly to calculate the ACTIVITY value, and also miscellaneous data from the same experiment, not used in this calculation, but which may be of interest to the specialist user.

REGID and SAMID identifiers.

A REGID is used to cluster or group together related records in the ACTIVITY_SUPP file, just as the TEOID is used in the ACTIVITY table. An SAMID usually refers to a single specific measurement like a time point, rather than a group, animal or well.

A REGID is a required field when an ACTIVITY_SUPP file is included and must contain a positive integer. The REGID only has meaning within the deposition, so values (such as 1,2,3, etc) can be re-used between depositions. When preparing data for loading it can be useful to think of REGIDs as 'row numbers' in a 2D table of data.

The ACTIVITY_SUPP_MAP file acts as a intermediate file to connect data in the ACTIVITY file, represented by ACT_IDs, to data in the ACTIVITY_SUPP file, represented by SAMIDs. In this way it is possible to map which records in the ACTIVITY_SUPP table are supporting evidence for which records in the ACTIVITY table. SAMIDs are mandatory in the ACTIVITY_SUPP_MAP file, but the SAMID field in ACTIVITY_SUPP can be left as null if the supplementary record does not directly point to any ACTIVITY value in particular.

Complex result Sets as a 2D matrix.

For many scientists one of the most convenient ways of browsing the values in a highly complex result set is in the form of a 2D matrix, such as a spreadsheet. Indeed, this is the most common format for complex result sets presented to ChEMBL administrators for loading into ChEMBL.

The ACTIVITY_SUPP table can be thought of as a transformation of a 2D matrix where REGIDs represent rows, the TYPEs are the column headers, and the VALUEs are the cells. In fact, one of the best ways of creating the ACTIVITY_SUPP file can be to start with a 2D matrix of all the data to be loaded, and transform these data into the ACTIVITY_SUPP file with a script. By doing this, reconstructing the original data set is straightforward when such data are subsequently exported from ChEMBL.

Integrity Constraints

For data to be loaded correctly, the following integrity constraints must be imposed on the deposition files:

  • All SAMIDs used in a deposition must be present at least once in both ACTIVITY_SUPP and ACTIVITY_SUPP_MAP

  • All ACTIVITY_SUPP records where SAMID is 'null' must have a REGID that is also present in at least one ACTIVITY_SUPP record where SAMID is non-null.

Last updated