⚗️
SureChEMBL
  • SureChEMBL
  • Contact us
  • AVAILABLE DATA SOURCES
  • PATENT VIEW
  • TEXT SEARCH
    • Search interface overview
    • Simple patent query overview
    • Complex Solr search
    • Patents with small molecules filter
    • Date filter
    • Query assistant
    • Solr query field names and examples
    • Patent number search format
  • CHEMICAL SEARCH
    • Structure drawing
    • Insert a SMILES, SMARTS, MOL, or Name Entry
    • Structure search type
    • Similarity Search, Tanimoto Coefficient and Fingerprint Generation
    • Filter by molecular weight
    • Search for structure in document section(s)
    • SMARTS search
  • PATENT ANNOTATION
    • Chemistry annotations
    • Biomedical annotations
  • API
    • API documentation
    • Examples
  • DOWNLOADS
    • Bulk data
    • Export annotations for patent
    • Old data downloads
      • MAP files
      • SureChEMBL data client
      • SureChEMBL compound data dump
  • FAQ
  • Data protection: Privacy notice for SureChEMBL's public website
Powered by GitBook
On this page
  • Content
  • compounds
  • patents
  • patent_compound_map
  • fields
  • Format
  • Access
  • Frequency of release
  • Usage
  1. DOWNLOADS

Bulk data

The unique way of downloading SureChEMBL data.

PreviousDOWNLOADSNextExport annotations for patent

Last updated 6 hours ago

Bulk data are machine-readable and allow any user to download the SureChEMBL data.

Content

The bulk data offer access to the whole collection of SureChEMBL compounds with the patents in which they have been extracted from.

compounds

All the compounds in SureChEMBL

COLUMN_NAME
DATA_TYPE
COMMENT

id

INT64

Compound unique identifier

smiles

STRING

Canonical smiles, generated using RDKit

inchi

STRING

IUPAC standard InChI for the compound

inchi_key

STRING

IUPAC standard InChI key for the compound

mol_weight

DOUBLE

Molecular weight of the full compound including any salts

patents

All the patents containing compounds in SureChEMBL.

COLUMN_NAME
DATA_TYPE
COMMENT

id

INT64

Patent unique identifier

patent_number

STRING

Standardised format used to search system. Format: CC-PATNO-KK, e.g. WO-2011161255-A2

country

STRING

Publication country

publication_date

DATE

The date when the patent was first published

family_id

INT64

Data retrieved from the EPO. Generally, simple families contain all records which share the same priority. -1 if we have not received a family ID from DOCDB.

cpc

LIST OF STRINGS

ipcr

LIST OF STRINGS

ipc

LIST OF STRINGS

ecla

LIST OF STRINGS

European Classification (replaced by CPC in 2013)

assignee

LIST OF STRINGS

Assignee refers to the person or legal entity who owns the entire right, title, and interest in the application. Note that this field contains the assignee information provided by the publishing authority and in most cases doesn’t reflect reassignments.

title

STRING

Title of the document

patent_compound_map

Patent-compound relationships with the patent field where the compound is found.

COLUMN_NAME
DATA_TYPE
COMMENT

patent_id

INT64

Patent unique identifier

compound_id

INT64

Compound unique identifier

field_id

INT64

Field unique identifier

fields

patent field descriptions

COLUMN_NAME
DATA_TYPE
COMMENT

id

INT64

field unique identifier

fieldname

INT64

Patent field where information can be found 1 - Description 2 - Claims 3 - Abstract 4 - Title 5 - Image (for patents after 2007) 6 - MOL attachments (US patents after 2007)

Format

We produce the datasets in Parquet format that allows us to expose nested information in a machine-readable way. Parquet is a columnar storage format widely used in big data platforms like Apache Spark and Hadoop. It’s designed for efficient querying, particularly for analytics that access only a subset of columns. Benefits include faster read performance, better compression, and support for complex nested data.

While interacting with Parquet files requires using a library (e.g., Pandas, PyArrow, Polars, DuckDB in Python), it’s very similar to querying a SQL database.

Note: we no longer provide an out-of-the-box solution for database creation (SureChEMBL data client no longer gets data update).

Access

Frequency of release

The bulk data are updated every 2 weeks with new parquet files containing the whole dataset. Because every release is independent from the previous one, the data schema might change to offer more data. In such case, the users will be notified.

Usage

Cooperative Patent Classification ()

International Patent Classification Reform ()

International Patent Classification ()

The parquet files are available .

Examples (available soon) of how to use the bulk data are available in this . Ultimately, the data can be loaded in a database.

here
notebook
full list
full list
full list