Unichem Contents
What sort of sources are used in UniChem ?
Any chemically aware database that contains compounds which have been assigned ids (named as src_compound_ids in UniChem) and structures. A source may store structures in a variety of ways.
If standard InChIs are provided by the source, then these are used by UniChem.
However, if a source does not provide standard InChIs, UniChem will produce these during loading of the data.
There need not be a 1:1 relationship between src_compound_id's and InChIs in a source. Many sources define chemical uniqueness by other means than the standard InChI. UniChem will handle 1:many and many:1 relationships between src_compound_ids and standard InChIs.
What sources are currently used in UniChem ? These are listed on the 'Sources' page... Go here Some src_compound_ids are missing from a source in UniChem. Why is this ? This is explained on the 'Reasons for data omission' page... Go here
The source I am trying to create hyperlinks to does not use the src_compound_id to create the URL for compound-specific pages. How does UniChem deal with this ? In these circumstances, UniChem makes use of 'auxiliary data' to create links, as described immediately below. What is 'auxiliary data' for a src_compound_id and when would I need to use it ?
Most sources within UniChem create URLs for compound specific pages by simply appending a src_compound_id to a ‘base URL’ (eg: appending 'CHEMBL59' to 'https://www.ebi.ac.uk/chembldb/compound/inspect/' gives: https://www.ebi.ac.uk/chembldb/compound/inspect/CHEMBL59).
However, in some instances of UniChem a small number of sources exist which create URLs for compound-specific pages by using strings or identifiers ('auxiliary data') that are different to the src_compound_ids for the source. This is not very common, but is dealt with in UniChem by use of an additional mapping step for these sources, where the src_compound_ids are mapped to the 'auxiliary data'.
Sources where this step is necesary are marked up with a '1' in the 'AUX_FOR_URL' field of the UC_SOURCES table.
It looks like some data is replicated in multiple sources in UniChem, for example… ‘pubchem’, ‘pubchem_tpharma’ and 'pubchem_dotf’ all come from PubChem. Why is this ?
Some users wish to create hyperlinks to an entire data source, others to only sub-sets of data within a data sources. For this reason, some sources are maintained in UniChem as a separate source for the entire source, and others for sub-sets of the source.
Some sources in UniChem, such as PubChem, have integrated structures from a wide variety of depositing primary sources. In the case of PubChem, the structures as originally deposited are assigned a different set of identifiers (SIDs) to the ‘integrated’ equivalent of the molecule (which is assigned a ‘CID’). SIDs therefore represents the original depositor-defined version of the structure, and the CID represents ‘PubChem’s’ normalized, integrated form of the molecule.
Sometimes these structures are different to one another, as the normalization process may change the structure.
In the case of PubChem, UniChem has included some sub-sets on the basis of the original depositor, and has therefore adopted the following policy:
For the entire Pubchem data source CIDs are used.
For depositor-defined sub-sets of PubChem, SIDs are used instead.
Please go here for a full discussion of provenance in relation to UniChem.
Last updated
Was this helpful?