MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model

Dashun Zheng, Jiaxuan Li, Yunchu Yang, Yapeng Wang, Patrick Cheong Iao Pang

Research output: Contribution to journalArticlepeer-review

Abstract

Natural language-processing tasks have been improved greatly by large language models (LLMs). However, numerous parameters make their execution computationally expensive and difficult on resource-constrained devices. For this problem, as well as maintaining accuracy, some techniques such as distillation and quantization have been proposed. Unfortunately, current methods fail to integrate model pruning with downstream tasks and overlook sentence-level semantic modeling, resulting in reduced efficiency of distillation. To alleviate these limitations, we propose a novel distilled lightweight model for BERT named MicroBERT. This method can transfer the knowledge contained in the “teacher” BERT model to a “student” BERT model. The sentence-level feature alignment loss (FAL) distillation mechanism, guided by Mixture-of-Experts (MoE), captures comprehensive contextual semantic knowledge from the “teacher” model to enhance the “student” model’s performance while reducing its parameters. To make the outputs of “teacher” and “student” models comparable, we introduce the idea of a generative adversarial network (GAN) to train a discriminator. Our experimental results based on four datasets show that all steps of our distillation mechanism are effective, and the MicroBERT (101.14%) model outperforms TinyBERT (99%) by 2.24% in terms of average distillation reductions in various tasks on the GLUE dataset.

Original languageEnglish
Article number6171
JournalApplied Sciences (Switzerland)
Volume14
Issue number14
DOIs
Publication statusPublished - Jul 2024

Keywords

  • generative adversarial networks
  • knowledge distillation
  • Mixture-of-Experts
  • natural language processing

Fingerprint

Dive into the research topics of 'MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model'. Together they form a unique fingerprint.

Cite this