What sort of sources are used in UniChem ?
  • Any chemically aware database that contains compounds which have been assigned ids (named as src_compound_ids in UniChem) and structures. A source may store structures in a variety of ways.
  • If standard InChIs are provided by the source, then these are used by UniChem.
  • However, if a source does not provide standard InChIs, UniChem will produce these during loading of the data.
  • There need not be a 1:1 relationship between src_compound_id's and InChIs in a source. Many sources define chemical uniqueness by other means than the standard InChI. UniChem will handle 1:many and many:1 relationships between src_compound_ids and standard InChIs.
What sources are currently used in UniChem ? These are listed on the 'Sources' page... Some src_compound_ids are missing from a source in UniChem. Why is this ? This is explained on the 'Reasons for data omission' page...
The source I am trying to create hyperlinks to does not use the src_compound_id to create the URL for compound-specific pages. How does UniChem deal with this ? In these circumstances, UniChem makes use of 'auxiliary data' to create links, as described immediately below. What is 'auxiliary data' for a src_compound_id and when would I need to use it ?
  • Most sources within UniChem create URLs for compound specific pages by simply appending a src_compound_id to a ‘base URL’ (eg: appending 'CHEMBL59' to '' gives:
  • However, in some instances of UniChem a small number of sources exist which create URLs for compound-specific pages by using strings or identifiers ('auxiliary data') that are different to the src_compound_ids for the source. This is not very common, but is dealt with in UniChem by use of an additional mapping step for these sources, where the src_compound_ids are mapped to the 'auxiliary data'.
  • Sources where this step is necesary are marked up with a '1' in the 'AUX_FOR_URL' field of the UC_SOURCES table.
It looks like some data is replicated in multiple sources in UniChem, for example… ‘pubchem’, ‘pubchem_tpharma’ and 'pubchem_dotf’ all come from PubChem. Why is this ?
  • Some users wish to create hyperlinks to an entire data source, others to only sub-sets of data within a data sources. For this reason, some sources are maintained in UniChem as a separate source for the entire source, and others for sub-sets of the source.
  • Some sources in UniChem, such as PubChem, have integrated structures from a wide variety of depositing primary sources. In the case of PubChem, the structures as originally deposited are assigned a different set of identifiers (SIDs) to the ‘integrated’ equivalent of the molecule (which is assigned a ‘CID’). SIDs therefore represents the original depositor-defined version of the structure, and the CID represents ‘PubChem’s’ normalized, integrated form of the molecule.
  • Sometimes these structures are different to one another, as the normalization process may change the structure.
  • In the case of PubChem, UniChem has included some sub-sets on the basis of the original depositor, and has therefore adopted the following policy:
    • For the entire Pubchem data source CIDs are used.
    • For depositor-defined sub-sets of PubChem, SIDs are used instead.
  • Please go here for a full discussion of provenance in relation to UniChem.