File types and names

For ease of data entry, files can initially be formatted as spreadsheets. File names must be the same as those specified in the Input file structure section: ASSAY, ACTIVITY, COMPOUND_RECORD, etc. The files should be formatted so that the fields outlined in the Field names and data types sections are columns and each data point is a row in the file.

Files must be ASCII or UTF-8 encoded, i.e. they must not use any characters outside one of these sets. An example is the use of uM as a substitute for micromolar units.

When data entry is complete, files should be saved as tsv (tab-separated variable) files, except for the INFO.txt file and the COMPOUND_CTAB.sdf file. Depositors should then create a zip or tar archive of the files and email it to the ChEMBL team.

The naming of this archive folder should be unique and meaningful, so that depositions can be quickly and easily recognised by the team at ChEMBL. It is good practice to include the name of the group or consortium the data is being submitted by, the dataset name or type of data in the folder and the date on which the data is submitted to ChEMBL, e.g. Cambridge-ITC-17-January-2023.

The loading of a collection of files for a single deposition is called a job. The submitted data of registered depositors is loaded into a test database. If loading is successful, the collection of files are given a 'job_id' and the submitter will be sent a load report.

Issues with the data may cause the load to fail. If this is the case the ChEMBL team will contact the submitter to discuss the problem(s) with the data.

When data is submitted by an individual or group which are not already registered, then we will consider whether the data is appropriate for ChEMBL, and be in contact.

Last updated