MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model

Dashun Zheng, Jiaxuan Li, Yunchu Yang, Yapeng Wang, Patrick Cheong Iao Pang

研究成果: Article同行評審

摘要

Natural language-processing tasks have been improved greatly by large language models (LLMs). However, numerous parameters make their execution computationally expensive and difficult on resource-constrained devices. For this problem, as well as maintaining accuracy, some techniques such as distillation and quantization have been proposed. Unfortunately, current methods fail to integrate model pruning with downstream tasks and overlook sentence-level semantic modeling, resulting in reduced efficiency of distillation. To alleviate these limitations, we propose a novel distilled lightweight model for BERT named MicroBERT. This method can transfer the knowledge contained in the “teacher” BERT model to a “student” BERT model. The sentence-level feature alignment loss (FAL) distillation mechanism, guided by Mixture-of-Experts (MoE), captures comprehensive contextual semantic knowledge from the “teacher” model to enhance the “student” model’s performance while reducing its parameters. To make the outputs of “teacher” and “student” models comparable, we introduce the idea of a generative adversarial network (GAN) to train a discriminator. Our experimental results based on four datasets show that all steps of our distillation mechanism are effective, and the MicroBERT (101.14%) model outperforms TinyBERT (99%) by 2.24% in terms of average distillation reductions in various tasks on the GLUE dataset.

原文English
文章編號6171
期刊Applied Sciences (Switzerland)
14
發行號14
DOIs
出版狀態Published - 7月 2024

指紋

深入研究「MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model」主題。共同形成了獨特的指紋。

引用此