Submission of data to UniChem
UniChem seeks to integrate chemistry aware data sources (more background).
If you are in charge of a data source which you believe could be included into UniChem, then please contact us.
The most important criteria for including a source is that the source itself should host a site that permits users to view compound-specific pages. In other words, each src_compound_id from the source should have its own web page where information relating to the compound can be viewed publicly.
Please note that we are a database that deals with existing compounds. We do not accept data for "virtual" compounds that have not yet been synthesised.
Also, the source should contain some useful information relating to the compound. Ideally, the source should be the primary provider, and not simply act as a aggregator of data from other sources.
Lastly, the source must define the structures of the compounds, and make available regularly updated versions of the mappings between the src_compound_ids and the structures.
There are several mechanisms by which a source's data can be integrated into UniChem. The simplest and preferred scenario is where a source provides a regularly updated SDF file, using either the header or a key to include the compound ID. UniChem will then automatically poll for updates and upload these new datasets.
However, other mechanisms are possible if an SDF file is not available. A source can also provide a tabular file (TSV, Parquet, or CSV formats are supported) with two columns: compound_id and smiles. If the source does not have SMILES available, the last option is to provide a tabular file including two columns: compound_id and molfile.
You can see some example files here. Take into account that each compound ID reported by the source into the tabular or SDF file will be used in the final detail URL. For example, if source reported the compound ID CHEBI:15377, then UniChem will use it to create the URL https://ebi.ac.uk/chebi/CHEBI:15377
Reasons for data omission
UniChem needs a way to calculate Standard InChI. Depending on whether a tabular or SDF file is provided, the calculation is derived from either the SMILES string or the Molblock. For the chemical calculation RDKit is used. If during the loading process, UniChem can not calculate the Standard InChI, that compound will be omitted.
Last updated
Was this helpful?