MAP files
Last updated
Last updated
Quarterly we deliver a compound-patent mapping file that provides all the compounds found in the patents over the period of time, and their location in the patents. The files can be accessed from the SureChEMBL interface they can also be directly accessed.
The format of the files is GZipped, tab separated. Each row contains a specific compound that has been extracted from a specific section of a specific patent document. It contains chemistry extracted from the full text patent documents (including images and Complex Work Unit mol files, where available) from the WIPO, EPO and USPTO authorities. In addition, it contains chemistry from the titles and abstracts of patent documents from JPO.
Each file has the same fields:
The compounds and patents in this map file were preprocessed according to the following criteria:
Compound filters:
Must not be radical
Must have fewer than 4 components and more than 6 carbons
Must be organic
Molecular weight must be between 100 and 6000
Patent filters:
A document must have one of the following codes: A01, A23, A24, A61, A62B C05, C06, C07, C08, C09, C10, C11, C12, C13, C14 G01N
The IPC, ECLA, IPCR, and CPC classification systems were checked for the above classification codes.
Key for top level areas: A=Human Necessities C=Chemistry G=Physics
SCHEMBL_ID | SMILES | INCHI_KEY | CORPUS_FREQUENCY | PATENT_ID | PUBLICATION_DATE | FIELD | FIELD_FREQUENCY |
---|---|---|---|---|---|---|---|
Unique SureChEMBL compound identifier, e.g. SCHEMBL1001
ChemAxon canonical kekule-based SMILES representation
ChemAxon-generated standard InChI key
Frequency of occurrence of a compound across all sections of all patent documents
Standardised representation of a patent number, e.g. WO-2014059185-A1
Patent publication date according to the respective patent authority. The format is YYYY-MM-DD, e.g. 1979-07-15
The field that the compound appears in. An integer value, one of: 1 - Description 2 - Claims 3 - Abstract 4 - Title 5 - Image (for patents after 2007) 6 - MOL Attachment (US patents after 2007)
The number of times the given compound appears in the given field in the document.