⚗️
SureChEMBL
  • SureChEMBL
  • Contact us
  • AVAILABLE DATA SOURCES
  • PATENT VIEW
  • TEXT SEARCH
    • Search interface overview
    • Simple patent query overview
    • Complex Solr search
    • Patents with small molecules filter
    • Date filter
    • Query assistant
    • Solr query field names and examples
    • Patent number search format
  • CHEMICAL SEARCH
    • Structure drawing
    • Insert a SMILES, SMARTS, MOL, or Name Entry
    • Structure search type
    • Similarity Search, Tanimoto Coefficient and Fingerprint Generation
    • Filter by molecular weight
    • Search for structure in document section(s)
    • SMARTS search
  • PATENT ANNOTATION
    • Chemistry annotations
    • Biomedical annotations
  • API
    • API documentation
    • Examples
  • DOWNLOADS
    • Bulk data
    • Export annotations for patent
    • Old data downloads
      • MAP files
      • SureChEMBL data client
      • SureChEMBL compound data dump
  • FAQ
  • Data protection: Privacy notice for SureChEMBL's public website
Powered by GitBook
On this page
  1. DOWNLOADS
  2. Old data downloads

MAP files

PreviousOld data downloadsNextSureChEMBL data client

Last updated 1 month ago

Quarterly we deliver a compound-patent mapping file that provides all the compounds found in the patents over the period of time, and their location in the patents. The files can be accessed from the SureChEMBL interface they can also be directly .

The format of the files is GZipped, tab separated. Each row contains a specific compound that has been extracted from a specific section of a specific patent document. It contains chemistry extracted from the full text patent documents (including images and Complex Work Unit mol files, where available) from the WIPO, EPO and USPTO authorities. In addition, it contains chemistry from the titles and abstracts of patent documents from JPO.

Each file has the same fields:

SCHEMBL_ID
SMILES
INCHI_KEY
CORPUS_FREQUENCY
PATENT_ID
PUBLICATION_DATE
FIELD
FIELD_FREQUENCY

Unique SureChEMBL compound identifier, e.g. SCHEMBL1001

ChemAxon canonical kekule-based SMILES representation

ChemAxon-generated standard InChI key

Frequency of occurrence of a compound across all sections of all patent documents

Standardised representation of a patent number, e.g. WO-2014059185-A1

Patent publication date according to the respective patent authority. The format is YYYY-MM-DD, e.g. 1979-07-15

The field that the compound appears in. An integer value, one of: 1 - Description 2 - Claims 3 - Abstract 4 - Title 5 - Image (for patents after 2007) 6 - MOL Attachment (US patents after 2007)

The number of times the given compound appears in the given field in the document.

The compounds and patents in this map file were preprocessed according to the following criteria:

  • Compound filters:

    • Must not be radical

    • Must have fewer than 4 components and more than 6 carbons

    • Must be organic

    • Molecular weight must be between 100 and 6000

  • Patent filters:

    • A document must have one of the following codes: A01, A23, A24, A61, A62B C05, C06, C07, C08, C09, C10, C11, C12, C13, C14 G01N

    • The IPC, ECLA, IPCR, and CPC classification systems were checked for the above classification codes.

    • Key for top level areas: A=Human Necessities C=Chemistry G=Physics

accessed
Access to the FTP server from the SureChEMBL interface
MAP file directory on the FTP server
MAP file directory on the FTP server