Biomedical annotations
Note: At the moment, our biomedical annotations are only availble throught the bulk data download. Bare with us as we are working on the UI integration.
We adopted a new approach for non-chemical annotations: moving from our in-house NLP model to a commercial grammar and dictionary-based system LeadMine, developed by NextMove Software.
LeadMine uses curated and public dictionaries with a custom grammar for fast and accurate text annotation. It has some options to automatically fix spelling mistakes that are frequently found in patent text due to the OCR. Using the provided dictionaries, it can also resolve an annotation to a unique identifier.
Using these functionalities, we annotated all patents in SureChEMBL for three key biomedical entity types and match them to the relevant data source when possible:
Gene/Protein: HGNC, Uniprot
Disease: MeSH, Human Disease Ontology
Mechanism (terms such as inhibitor, antagonist, modulator, etc.)
LeadMine is fast, robust, and easily scalable. Exactly what we need for SureChEMBL production-level annotation!

Patent annotation with Leadmine. Colour code: orange: generic chemical name, pink: generic molecule, grey: anatomy, violet: molecule dictionary, turquoise: mechanism, green: PubChem dictionary, dark red: gene, yellow: polymer, light red: journal, khaki: organism, dark orange: disease
For now, the new biomedical annotations are limited to these three types. More entity types, or custom dictionaries, may follow in later phases.
Last updated