Bulk data
The only way of downloading up-to-date SureChEMBL data.
Bulk data are machine-readable and allow any user to download the SureChEMBL data.
Content
The bulk data offer access to the whole collection of SureChEMBL compounds with the patents in which they have been extracted from.

compounds
All the compounds in SureChEMBL
id
INT64
Compound unique identifier (PK)
smiles
STRING
Canonical smiles, generated using RDKit
inchi
STRING
IUPAC standard InChI for the compound
inchi_key
STRING
IUPAC standard InChI key for the compound
mol_weight
DOUBLE
Molecular weight of the full compound including any salts
patents
All the patents containing compounds in SureChEMBL.
id
INT64
Patent unique identifier (PK)
patent_number
STRING
Standardised format used to search system. Format: CC-PATNO-KK, e.g. WO-2011161255-A2
country
STRING
Publication country
publication_date
DATE
The date when the patent was first published
family_id
INT64
Data retrieved from the EPO. Generally, simple families contain all records which share the same priority. -1 if we have not received a family ID from DOCDB.
ecla
LIST OF STRINGS
European Classification (replaced by CPC in 2013)
assignee
LIST OF STRINGS
Assignee refers to the person or legal entity who owns the entire right, title, and interest in the application. Note that this field contains the assignee information provided by the publishing authority and in most cases doesn’t reflect reassignments.
title
STRING
Title of the document
patent_compound_map
Patent-compound relationships with the patent field where the compound is found.
patent_id
INT64
Patent unique identifier (FK)
compound_id
INT64
Compound unique identifier (FK)
field_id
INT64
Field unique identifier (FK)
fields
patent field descriptions
id
INT64
Field unique identifier (PK)
fieldname
INT64
Patent field where information can be found 1 - Description 2 - Claims 3 - Abstract 4 - Title 5 - Image (for patents after 2007) 6 - MOL attachments (US patents after 2007)
biomedical_entities
All the non chemical annotation in SureChEMBL. We provide the captured terms, the corrected terms and the resolved form when the entity can be mapped to an ontology or controlled vocabulary.
id
INT64
Biomedical entity unique identifer (PK)
type_id
INT64
Unique biomedical entity type identifier (FK)
original_text
STRING
Text entity as it was captured in the patent
corrected_text
STRING
Corrected text
resolved_form
STRING
Identifier from a relevant data source based on the entity type. 1 - Uniprot & HGNC 2 - MeSH, Disease Ontology & Wikipedia
biomedical_locations
id
INT64
Biomedical entity location unique identifier (PK)
entity_id
INT64
Biomedical entity unique identifer (FK)
patent_id
INT64
Patent unique identifier (FK)
field_id
INT64
Field unique identifier (FK)
count
INT64
Entity-Document-Field triplate count
biomedical_types
id
INT64
Unique biomedical entity type identifier (PK)
type_name
STRING
Leadmine entity type name
description
STRING
Entity type description
Format
We produce the datasets in Parquet format that allows us to expose nested information in a machine-readable way. Parquet is a columnar storage format widely used in big data platforms like Apache Spark and Hadoop. It’s designed for efficient querying, particularly for analytics that access only a subset of columns. Benefits include faster read performance, better compression, and support for complex nested data.
While interacting with Parquet files requires using a library (e.g., Pandas, PyArrow, Polars, DuckDB in Python), it’s very similar to querying a SQL database.
Note: we no longer provide an out-of-the-box solution for database creation (SureChEMBL data client no longer gets data update).
Access
The parquet files are available here.
Frequency of release
The bulk data are updated every 2 weeks with new parquet files containing the whole dataset. Because every release is independent from the previous one, the data schema might change to offer more data. In such case, the users will be notified.
Usage
Examples (available soon) of how to use the bulk data are available in this notebook. Ultimately, the data can be loaded in a database.
Last updated