TY - JOUR
T1 - zERExtractor
T2 - An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature
AU - Zhou, Rui
AU - Ma, Haohui
AU - Xin, Tianle
AU - Miao, Qiuchen
AU - Zou, Lixin
AU - Hu, Qiuyue
AU - Cheng, Hongxi
AU - Guo, Jingjing
AU - Mu, Yuguang
AU - Wang, Sheng
AU - Zhang, Guoqing
AU - Wei, Yanjie
AU - Zheng, Liangzhen
PY - 2026/4/13
Y1 - 2026/4/13
N2 - The rapid expansion of enzyme reaction literature has created a major bottleneck in database curation, leaving vast amounts of enzyme-substrate-condition relationships unstructured and inaccessible for DL-driven modeling. How to fully utilize the enzymatic reaction data has been an important task for future accurate enzyme activity prediction models. Current deep learning (DL)-based data extraction models heavily rely on large language models (LLMs) without a fidelity check and the ability to continuously evolve. To address these issues, we developed zERExtractor (Zelixir's Enzyme Reaction Data Extractor), an accuracy-oriented and extensible platform for extracting enzyme-catalyzed reaction data from scientific publications. This system offers a unified multimodal information extraction framework (covering molecular reaction diagrams, tables, and texts) to integrate enzymatic reaction descriptors into structured storage. We employ fine-tuned large LLMs together with DL in a human-in-the-loop pipeline that evolves through data fidelity validation by experts and active learning. Also, zERExtractor achieves 89.9% accuracy in table recognition and over 98% accuracy in molecular image recognition on synthetic data sets, outperforming the strongest baseline by more than 2% and consistently maintaining above 95% on realistic benchmarks. zERExtractor bridges the data gap in enzyme reaction data with a scalable framework for accurate multimodal extraction, advancing DL-driven enzyme modeling and enabling future applications in computational enzymology and biotechnology. The platform is publicly accessible online at https://zpaper.zelixir.com/.
AB - The rapid expansion of enzyme reaction literature has created a major bottleneck in database curation, leaving vast amounts of enzyme-substrate-condition relationships unstructured and inaccessible for DL-driven modeling. How to fully utilize the enzymatic reaction data has been an important task for future accurate enzyme activity prediction models. Current deep learning (DL)-based data extraction models heavily rely on large language models (LLMs) without a fidelity check and the ability to continuously evolve. To address these issues, we developed zERExtractor (Zelixir's Enzyme Reaction Data Extractor), an accuracy-oriented and extensible platform for extracting enzyme-catalyzed reaction data from scientific publications. This system offers a unified multimodal information extraction framework (covering molecular reaction diagrams, tables, and texts) to integrate enzymatic reaction descriptors into structured storage. We employ fine-tuned large LLMs together with DL in a human-in-the-loop pipeline that evolves through data fidelity validation by experts and active learning. Also, zERExtractor achieves 89.9% accuracy in table recognition and over 98% accuracy in molecular image recognition on synthetic data sets, outperforming the strongest baseline by more than 2% and consistently maintaining above 95% on realistic benchmarks. zERExtractor bridges the data gap in enzyme reaction data with a scalable framework for accurate multimodal extraction, advancing DL-driven enzyme modeling and enabling future applications in computational enzymology and biotechnology. The platform is publicly accessible online at https://zpaper.zelixir.com/.
UR - https://www.scopus.com/pages/publications/105035656453
U2 - 10.1021/acs.jcim.6c00090
DO - 10.1021/acs.jcim.6c00090
M3 - Article
C2 - 41844379
AN - SCOPUS:105035656453
SN - 1549-9596
VL - 66
SP - 4296
EP - 4309
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 7
ER -