📥
ChEMBL Data Deposition Guide
  • Introduction
  • Overview
    • Source Identifier
    • Depositor-Defined Identifiers
      • AIDX
      • RIDX
      • CIDX
    • File types and names
  • File structure
    • File hierarchy
    • Simplified input data schema
    • Deposition file list
    • Field names and data types - basic submission
      • The CONTACT field
      • ACTION_TYPE valid names
      • TARGET_TYPE list
      • Adding context using the ACTIVITY_PROPERTIES file
    • The ASSAY_DESCRIPTION field
  • Complex results sets
    • Linking files through depositor defined IDs
    • Linking multiple result types using TEOID
    • Supplementary data in the ACTIVITY_SUPP file
    • Flexible SAMID mapping
    • Field names and data types - more complex data types
  • Example dataset
  • Common data issues
  • Advanced features and documentation
  • Depositing activities against other depositors entities
  • Creating a COMPOUND_CTAB file from a file containing SMILES strings
  • FAQs
  • Glossary
Powered by GitBook
On this page

Complex results sets

PreviousThe ASSAY_DESCRIPTION fieldNextLinking files through depositor defined IDs

Last updated 1 year ago

ChEMBL has always handled simple forms of bioactivity data extremely well, such as 'cpdX has affinity (Ki) of Y uM in assay Z.' Such data is represented by a single row in an ACTIVITY file, such those shown in the dataset.

However, there is a growing need for more complex data to be captured within ChEMBL. Examples of more complex data could include:

  • Assays where multiple different result types are required to describe the activity of the test compound, not just a single value as in the example above.

  • Data where different values of the same result type must be qualified or contextualised to convey the result faithfully.

  • A need to provide supporting or supplementary data, and not simply a reference to a literature paper.

To describe the model, we discuss datasets with multiple result types and supplementary data. We consider them separately, although in practice many complex assays require a combination of strategies to be properly formatted.

Other practicalities

One of the practical considerations for us at ChEMBL is that we have fairly limited resources, so we will not always have time to get to grips with understanding extremely complex assays to the same level as the creators of the data. Depositors of complex result sets often ask us to reformat the data for loading. For this reason, our process may involve asking depositors to:

  1. Provide a 2D matrix (spreadsheet) of the data to be loaded.

  2. Identify the minimal self defining, self-referencing 'units' or 'sets' within the data, including both test and control data.

  3. Identify the top-level result types from the set, singly or in combination. Depositors should also advise on how these should be averaged, if necessary, and what properties may be associated with them, for example test concentrations, age ranges or Hill Slopes.

  4. Advise on how the set(s) may be pivoted to one or more top level result types.

Example