Complex results sets

ChEMBL has always handled simple forms of bioactivity data extremely well, such as 'cpdX has affinity (Ki) of Y uM in assay Z.' Such data is represented by a single row in an ACTIVITY file, such those shown in the Example dataset.

However, there is a growing need for more complex data to be captured within ChEMBL. Examples of more complex data could include:

  • Assays where multiple different result types are required to describe the activity of the test compound, not just a single value as in the example above.

  • Data where different values of the same result type must be qualified or contextualised to convey the result faithfully.

  • A need to provide supporting or supplementary data, and not simply a reference to a literature paper.

To describe the model, we discuss datasets with multiple result types and supplementary data. We consider them separately, although in practice many complex assays require a combination of strategies to be properly formatted.

Other practicalities

One of the practical considerations for us at ChEMBL is that we have fairly limited resources, so we will not always have time to get to grips with understanding extremely complex assays to the same level as the creators of the data. Depositors of complex result sets often ask us to reformat the data for loading. For this reason, our process may involve asking depositors to:

  1. Provide a 2D matrix (spreadsheet) of the data to be loaded.

  2. Identify the minimal self defining, self-referencing 'units' or 'sets' within the data, including both test and control data.

  3. Identify the top-level result types from the set, singly or in combination. Depositors should also advise on how these should be averaged, if necessary, and what properties may be associated with them, for example test concentrations, age ranges or Hill Slopes.

  4. Advise on how the set(s) may be pivoted to one or more top level result types.

Last updated