# File Types, Naming, and Deposition

The loading of a collection of files for a single deposition is called a job.  The submitted data of registered depositors is loaded into a test database.  If loading is successful, the collection of files are given a JOB\_ID and the submitter will be sent a load report. The JOB\_ID is an internal identifier and will not be visible to the end users of ChEMBL.  &#x20;

Issues with the data may cause the load to fail.  If this is the case the ChEMBL team will contact the submitter to discuss the problem(s) with the data.

When data is submitted by an individual or group not already registered, then we will consider whether the data is appropriate for ChEMBL and be in contact.

### Downloadable Quick Reference Guides

* [The Deposition Overview](https://docs.google.com/document/d/1uiT8mDCvMlOkw-zk4F2CbBMPYJCg5np0Zasnfb6Bzbo/edit?tab=t.0)
* [Uploading data to Globus](https://docs.google.com/document/d/1Rwpswq4wI8RPK_fHYNwf7RKDLzJ5t-PoUn5VUfNJKJc/edit?tab=t.0#heading=h.36bsypln2f0o)
* [Deposition data checklist](https://docs.google.com/document/d/1MGuo3MCk-BP7wgBffOn1bRZ7AdSrrHUDIvFaiOr0oZI/edit#heading=h.ky9sc98p8pao)

### File Types

For ease of data entry, files can initially be formatted as spreadsheets.  File names must be the same as those specified in the [Input file structure](https://chembl.gitbook.io/chembl-data-deposition-guide/file-structure) section: ASSAY, ACTIVITY, COMPOUND\_RECORD, et&#x63;**.**   The files should be formatted so that the fields outlined in the [Field names and data types](https://chembl.gitbook.io/chembl-data-deposition-guide/file-structure/field-names-and-data-types-minimal-data-submission) sections are columns and each data point is a row in the file. &#x20;

Files must be [ASCII](https://en.wikipedia.org/wiki/ASCII) or [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoded, i.e. they must not use any characters outside one of these sets.  An example is the use of uM as a substitute for micromolar units.&#x20;

When data entry is complete, data files should be saved as tsv (tab-separated variable) files, except for the COMPOUND\_CTAB.sdf file (Chemical Table File).  Please ensure you save your tab-separated files with the file ending `.tsv` not `.txt` . It is substantially easier to inspect your files when our software knows they are tab-separated. &#x20;

This can be achieved by saving a Microsoft Excel spreadsheet as a Tab-delimited Text (.txt) file and then, after saving, changing the file extension to `.tsv`.

E.g. &#x20;

<img src="https://1256054238-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fayyn7ftmmim4k0lK8GCA%2Fuploads%2F4hTn24h5dt6FTf0JBdaZ%2FScreenshot%202025-09-15%20at%2016.22.26.png?alt=media&#x26;token=831d2d22-3961-4024-97a1-7bd6d1cef039" alt="" data-size="line">

ACTIVITY.txt  →  ACTIVITY.tsv

Depositors should then create a zip or tar archive of the files and email it to the ChEMBL team.

### Folder Naming Rules

* The naming of the archive folder (zip or tar) should be unique and meaningful, so that depositions can be quickly and easily recognised by the team at ChEMBL. We receive multiple datasets for every release and we want to clearly distinguish your data from similar sets.
* **Folder** names should describe your **institution**, **group**, and **what the data is**.&#x20;
* If you are submitting multiple datasets at once, there should be a descriptively named **primary folder** containing all the datasets, each of which should be in a **subfolder** with a name that describes that particular dataset.
* **Please don't use** **special characters** like spaces, full stops `.` or slashes `/` in folder names.  These can break UNIX filesystems.
* **Underscores** `_` and **dashes** `-` are the preferred 'space' characters.
* The data folder must only contain data files specified in the [Input file structure](https://chembl.gitbook.io/chembl-data-deposition-guide/file-structure) sectio&#x6E;**,** and optionally an additional `README` file.
* The `README` can be used to capture data which describes the deposition.
* If you need to provide any other supporting information that will help with loading, you must prefix the file name with INFO.&#x20;

#### **Good folder names example**

```
Primary folder: Cambridge_Smith_Lab_T_gondii_09-09-2025
Subfolder 1: Cambridge_Smith_Lab_T_gondii_09-09-2025/Binding_IC50
Subfolder 2: Cambridge_Smith_Lab_T_gondii_09-09-2025/Cell_Death
```

#### **Bad folder names example:**

```
Primary folder: ChEMBL Deposition/
Subfolder 1: ChEMBL Deposition/Assay.1
Subfolder 2: Smith_Lab_T_gondii_09/09/2025
```

### File Transfer

We now use [Globus](https://www.globus.org/) to receive data ([Full instructions here](https://docs.google.com/document/d/1Rwpswq4wI8RPK_fHYNwf7RKDLzJ5t-PoUn5VUfNJKJc/edit?tab=t.0#heading=h.36bsypln2f0o)). You can log in with an institutional login, GitHub, Google, or ORCID. You can also sign up for an individual account.&#x20;

By using Globus, we ensure that each depositor has an individual folder containing their most recent deposition. We no longer accept data by email, as it is often unclear which is the latest dataset that someone has sent.

If you are unable to access Globus, please contact us for support at <chembl-deposition@ebi.ac.uk>.

### Embargoing

Embargoing is managed by administrators.  If a data embargo is required this should be discussed in advance with the ChEMBL administrators.
