SUSIE: Pharmaceutical CMC ontology-based information extraction for drug machine

Vipul Mann,Shekhar Viswanath, Shankar Vaidyaraman, Jeya Balakrishnan,Venkat Venkatasubramanian

COMPUTERS & CHEMICAL ENGINEERING(2023)

引用 0|浏览0
暂无评分
摘要
Automatically extracting information from unstructured text in pharmaceutical documents is important for drug discovery and development. This information can be integrated with structured datasets to ultimately accelerate pharmaceutical product development. To this end, we report an end-to-end information extraction framework based on a custom-built pharmaceutical drug development ontology, a weak supervision framework, contextualization algorithms, and a fine-tuned BioBERT model (adaptation of BERT or Bidirectional Encoder Representations from Transformers for biomedical text). The proposed framework, SUSIE (Schema-based Un-supervised Semantic Information Extraction), was trained on ICH (International Conference on Harmonization) documents to identify important entities and relations from unstructured text and auto-generate knowledge graphs representing crucial information in a structured format. On the entity identification task, the framework achieves a test accuracy and F1-score of 96% and 88%, respectively, on out-of-sample documents. A major contribution of this work is to build an automated, unsupervised information extraction framework around a domain-specific, custom-built pharmaceutical drug development ontology without the need for manual curation of training datasets for specific tasks. The efficacy of the approach was tested on out-of-sample documents including an internal Eli Lilly technical document.
更多
查看译文
关键词
Ontology, Pharmaceutical drug development, Information extraction, Hybrid machine learning, Chemistry manufacturing and control
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要