Flexible SAMID mapping

For very large and complicated data sets, with multiple controls which may or may not be shared between sets, the creation of the correct SAMID mappings can be an onerous, and often error-prone, curation task. This task can be a considerable barrier to submitting data. For this reason, SAMID use is interpreted flexibly.

When submitting the supplementary data the granularity of the SAMID mapping could be very high or very low (depending upon how much curation effort is available). The lowest, or 'default' mapping would be to simply map all ACTIVITY values for an assay in a job to all the supplementary data supplied. Using this lowest default mapping allows supplementary data to be loaded easily and quickly; high granularity mapping can be postponed until curation resources become available, and then the entire set updated at a later date.

High granularity mapping requires the submitter to assign the correct SAMID(s) to each ACT_ID in the ACT_SUPP_MAP file, linking single measurements to processed data.

Data loaded with low granularity mapping has more limited usefulness from the point of view of the user querying such data; the querier can identify the entire data set which a given ACTIVITY value came from, but it is not clear exactly which ACTIVITY_SUPP data values were used to generate which ACTIVITY values.

For this reason it may be necessary for a user to be able to distinguish between high and low granularity mapping. On loading to ChEMBL an option exists to signal whether the supplementary data in the job has been mapped to Activities with the highest granularity possible. Depositors of complex data sets must state whether this option should be used (it cannot be encoded within the deposition files). If this option is not included, then by default the job is marked as 'incomplete SAMID mapping'. However, if the option is used, then the set is marked up as 'SAMID mapping completed', and this information is stored in the database. A querier can determine whether the retrieved supplementary data associated with a single ACTIVITY record in this data set will contain only data which directly supports the ACTIVITY value.

Summary of the advantages of the SAMID + REGID model.

Although the loading of supplementary data remains a complex exercise, it is worth emphasising some of the advantages of loading these data using ACTIVITY_SUPP and ACTIVITY_SUPP_MAP files.

  1. The barrier to loading supplementary data is lowered; the fine grained SAMID mapping curation of very complex data can be achieved at a later date when resources permit, but does not delay loading.

  2. Smaller, simpler supplementary data sets can often be 'SAMID mapped' quite simply, so the supplementary data becomes available for querying immediately in these cases.

  3. When the Supplementary data is normalised the same algorithms are applied to ACTIVITY and ACTIVITY_PROPERTY, maximising the wider value of the data in the integrated whole of ChEMBL.

  4. In many cases, mapping REGID and SAMID as a transformation of a 2D matrix enables the original matrix to be recreated simply and exported .

Last updated