📥
ChEMBL Data Deposition Guide
  • Introduction
  • Overview
    • Source Identifier
    • Depositor-Defined Identifiers
      • AIDX
      • RIDX
      • CIDX
    • File types and names
  • File structure
    • File hierarchy
    • Simplified input data schema
    • Deposition file list
    • Field names and data types - basic submission
      • The CONTACT field
      • ACTION_TYPE valid names
      • TARGET_TYPE list
      • Adding context using the ACTIVITY_PROPERTIES file
    • The ASSAY_DESCRIPTION field
  • Complex results sets
    • Linking files through depositor defined IDs
    • Linking multiple result types using TEOID
    • Supplementary data in the ACTIVITY_SUPP file
    • Flexible SAMID mapping
    • Field names and data types - more complex data types
  • Example dataset
  • Common data issues
  • Advanced features and documentation
  • Depositing activities against other depositors entities
  • Creating a COMPOUND_CTAB file from a file containing SMILES strings
  • FAQs
  • Glossary
Powered by GitBook
On this page
  1. Overview

File types and names

PreviousCIDXNextFile structure

Last updated 2 years ago

For ease of data entry, files can initially be formatted as spreadsheets. File names must be the same as those specified in the section: ASSAY, ACTIVITY, COMPOUND_RECORD, etc. The files should be formatted so that the fields outlined in the sections are columns and each data point is a row in the file.

Files must be or encoded, i.e. they must not use any characters outside one of these sets. An example is the use of uM as a substitute for micromolar units.

When data entry is complete, files should be saved as tsv (tab-separated variable) files, except for the INFO.txt file and the COMPOUND_CTAB.sdf file. Depositors should then create a zip or tar archive of the files and email it to the ChEMBL team.

The naming of this archive folder should be unique and meaningful, so that depositions can be quickly and easily recognised by the team at ChEMBL. It is good practice to include the name of the group or consortium the data is being submitted by, the dataset name or type of data in the folder and the date on which the data is submitted to ChEMBL, e.g. Cambridge-ITC-17-January-2023.

The loading of a collection of files for a single deposition is called a job. The submitted data of registered depositors is loaded into a test database. If loading is successful, the collection of files are given a 'job_id' and the submitter will be sent a load report.

Issues with the data may cause the load to fail. If this is the case the ChEMBL team will contact the submitter to discuss the problem(s) with the data.

When data is submitted by an individual or group which are not already registered, then we will consider whether the data is appropriate for ChEMBL, and be in contact.

Input file structure
Field names and data types
ASCII
UTF-8