File Types, Naming, and Deposition

How to structure your deposition files and folder, and send it to us

The loading of a collection of files for a single deposition is called a job. The submitted data of registered depositors is loaded into a test database. If loading is successful, the collection of files are given a JOB_ID and the submitter will be sent a load report.

Issues with the data may cause the load to fail. If this is the case the ChEMBL team will contact the submitter to discuss the problem(s) with the data.

When data is submitted by an individual or group which are not already registered, then we will consider whether the data is appropriate for ChEMBL, and be in contact.

File Types

For ease of data entry, files can initially be formatted as spreadsheets. File names must be the same as those specified in the Input file structure section: ASSAY, ACTIVITY, COMPOUND_RECORD, etc. The files should be formatted so that the fields outlined in the Field names and data types sections are columns and each data point is a row in the file.

Files must be ASCII or UTF-8 encoded, i.e. they must not use any characters outside one of these sets. An example is the use of uM as a substitute for micromolar units.

When data entry is complete, files should be saved as tsv (tab-separated variable) files, except for the INFO.txt file and the COMPOUND_CTAB.sdf file. Please ensure you save your tab-separated files with the file ending .tsv not .txt . It is substantially easier to inspect your files when our software knows they are tab-separated.

Depositors should then create a zip or tar archive of the files and email it to the ChEMBL team.

Folder Naming Rules

  • The naming of the archive folder (zip or tar) should be unique and meaningful, so that depositions can be quickly and easily recognised by the team at ChEMBL. We receive multiple datasets for every release and we want to clearly distinguish your data from similar sets.

  • Folder names should describe your institution, group, and what the data is.

  • If you are submitting multiple datasets at once, there should be a descriptively named primary folder containing all the datasets, each of which should be in a subfolder with a name that describes that particular dataset.

  • Please don't use special characters like spaces, full stops . or slashes / in folder names. These can break UNIX filesystems.

  • Underscores _ and dashes - are the preferred 'space' characters.

  • The data folder must only contain data files specified in the Input file structure section, and optionally an additional README file.

Good folder names example

Primary folder: Cambridge_Smith_Lab_T_gondii_09-09-2025
Subfolder 1: Cambridge_Smith_Lab_T_gondii_09-09-2025/Binding_IC50
Subfolder 2: Cambridge_Smith_Lab_T_gondii_09-09-2025/Cell_Death

Bad folder names example:

Primary folder: ChEMBL Deposition/
Subfolder 1: ChEMBL Deposition/Assay.1
Subfolder 2: Smith_Lab_T_gondii_09/09/2025

File Transfer

We now use Globus to receive data. You can log in with an institutional login, GitHub, Google, or ORCID. You can also sign up for an individual account.

By using Globus, we ensure that each depositor has an individual folder containing their most recent deposition. We no longer accept data by email, as it is often unclear which is the latest dataset that someone has sent.

If you are unable to access Globus, please contact us for support at chembl-deposition@ebi.ac.uk.

Last updated