MAP files

Quarterly we deliver a compound-patent mapping file that provides all the compounds found in the patents over the period of time, and their location in the patents. The files can be accessed from the SureChEMBL interface they can also be directly accessed.

The format of the files is GZipped, tab separated. Each row contains a specific compound that has been extracted from a specific section of a specific patent document. It contains chemistry extracted from the full text patent documents (including images and Complex Work Unit mol files, where available) from the WIPO, EPO and USPTO authorities. In addition, it contains chemistry from the titles and abstracts of patent documents from JPO.

Each file has the same fields:

SCHEMBL_IDSMILESINCHI_KEYCORPUS_FREQUENCYPATENT_IDPUBLICATION_DATEFIELDFIELD_FREQUENCY

Unique SureChEMBL compound identifier, e.g. SCHEMBL1001

ChemAxon canonical kekule-based SMILES representation

ChemAxon-generated standard InChI key

Frequency of occurrence of a compound across all sections of all patent documents

Standardised representation of a patent number, e.g. WO-2014059185-A1

Patent publication date according to the respective patent authority. The format is YYYY-MM-DD, e.g. 1979-07-15

The field that the compound appears in. An integer value, one of: 1 - Description 2 - Claims 3 - Abstract 4 - Title 5 - Image (for patents after 2007) 6 - MOL Attachment (US patents after 2007)

The number of times the given compound appears in the given field in the document.

The compounds and patents in this map file were preprocessed according to the following criteria:

  • Compound filters:

    • Must not be radical

    • Must have fewer than 4 components and more than 6 carbons

    • Must be organic

    • Molecular weight must be between 100 and 6000

  • Patent filters:

    • A document must have one of the following codes: A01, A23, A24, A61, A62B C05, C06, C07, C08, C09, C10, C11, C12, C13, C14 G01N

    • The IPC, ECLA, IPCR, and CPC classification systems were checked for the above classification codes.

    • Key for top level areas: A=Human Necessities C=Chemistry G=Physics

Last updated