跳至主導覽 跳至搜尋 跳過主要內容

zERExtractor: An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature

  • Rui Zhou
  • , Haohui Ma
  • , Tianle Xin
  • , Qiuchen Miao
  • , Lixin Zou
  • , Qiuyue Hu
  • , Hongxi Cheng
  • , Jingjing Guo
  • , Yuguang Mu
  • , Sheng Wang
  • , Guoqing Zhang
  • , Yanjie Wei
  • , Liangzhen Zheng
  • Shenzhen Institute of Advanced Technology
  • University of Chinese Academy of Sciences
  • Shanghai Zelixir Biotech Company Ltd.
  • CAS - Shanghai Institute of Nutrition and Health
  • Ltd.
  • Nanyang Technological University
  • Shenzhen University of Advanced Technology

研究成果: Article同行評審

摘要

The rapid expansion of enzyme reaction literature has created a major bottleneck in database curation, leaving vast amounts of enzyme-substrate-condition relationships unstructured and inaccessible for DL-driven modeling. How to fully utilize the enzymatic reaction data has been an important task for future accurate enzyme activity prediction models. Current deep learning (DL)-based data extraction models heavily rely on large language models (LLMs) without a fidelity check and the ability to continuously evolve. To address these issues, we developed zERExtractor (Zelixir's Enzyme Reaction Data Extractor), an accuracy-oriented and extensible platform for extracting enzyme-catalyzed reaction data from scientific publications. This system offers a unified multimodal information extraction framework (covering molecular reaction diagrams, tables, and texts) to integrate enzymatic reaction descriptors into structured storage. We employ fine-tuned large LLMs together with DL in a human-in-the-loop pipeline that evolves through data fidelity validation by experts and active learning. Also, zERExtractor achieves 89.9% accuracy in table recognition and over 98% accuracy in molecular image recognition on synthetic data sets, outperforming the strongest baseline by more than 2% and consistently maintaining above 95% on realistic benchmarks. zERExtractor bridges the data gap in enzyme reaction data with a scalable framework for accurate multimodal extraction, advancing DL-driven enzyme modeling and enabling future applications in computational enzymology and biotechnology. The platform is publicly accessible online at https://zpaper.zelixir.com/.

原文English
頁(從 - 到)4296-4309
頁數14
期刊Journal of Chemical Information and Modeling
66
發行號7
DOIs
出版狀態Published - 13 4月 2026

指紋

深入研究「zERExtractor: An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature」主題。共同形成了獨特的指紋。

引用此