TY - JOUR
T1 - A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis
AU - Yuan, Yanhong
AU - Duo, Shuangsheng
AU - Tong, Xuming
AU - Wang, Yapeng
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/8
Y1 - 2025/8
N2 - Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the naturalness, expressiveness, and response efficiency of human–computer emotional interaction. By introducing a modular layered design, a six-dimensional emotional space, a gated attention mechanism, and a dynamic model scheduling strategy, the system overcomes challenges such as limited emotional representation, modality misalignment, and high-latency responses. Experimental results demonstrate that the framework achieves superior performance in speech synthesis quality (MOS: 4.35), emotion recognition accuracy (91.6%), and response latency (<1.2 s), outperforming baseline models like Tacotron2 and FastSpeech2. Through model lightweighting, GPU parallel inference, and load balancing optimization, the system validates its robustness and generalizability across English and Chinese corpora in cross-linguistic tests. The modular architecture and dynamic scheduling ensure scalability and efficiency, enabling a more humanized and immersive interaction experience in typical application scenarios such as psychological companionship, intelligent education, and high-concurrency customer service. This study provides an effective technical pathway for developing the next generation of personalized and immersive affective intelligent interaction systems.
AB - Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the naturalness, expressiveness, and response efficiency of human–computer emotional interaction. By introducing a modular layered design, a six-dimensional emotional space, a gated attention mechanism, and a dynamic model scheduling strategy, the system overcomes challenges such as limited emotional representation, modality misalignment, and high-latency responses. Experimental results demonstrate that the framework achieves superior performance in speech synthesis quality (MOS: 4.35), emotion recognition accuracy (91.6%), and response latency (<1.2 s), outperforming baseline models like Tacotron2 and FastSpeech2. Through model lightweighting, GPU parallel inference, and load balancing optimization, the system validates its robustness and generalizability across English and Chinese corpora in cross-linguistic tests. The modular architecture and dynamic scheduling ensure scalability and efficiency, enabling a more humanized and immersive interaction experience in typical application scenarios such as psychological companionship, intelligent education, and high-concurrency customer service. This study provides an effective technical pathway for developing the next generation of personalized and immersive affective intelligent interaction systems.
KW - affective intelligence
KW - emotion modeling
KW - human–computer interaction
KW - multimodal interaction
KW - speech synthesis
UR - https://www.scopus.com/pages/publications/105014426549
U2 - 10.3390/a18080513
DO - 10.3390/a18080513
M3 - Article
AN - SCOPUS:105014426549
SN - 1999-4893
VL - 18
JO - Algorithms
JF - Algorithms
IS - 8
M1 - 513
ER -