摘要
Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation power, and time. However, it is difficult to achieve in low-resource or small-sample situations. Therefore, we propose VL-Meta, Vision Language Models for Multimodal Meta Learning. It (1) presents the vision-language mapper and multimodal fusion mapper, which are light model structures, to use the existing pre-trained models to make models understand images to language feature space and save training data, computation power, and time; (2) constructs the meta-task pool that can only use a small amount of data to construct enough training data and improve the generalization of the model to learn the data knowledge and task knowledge; (3) proposes the token-level training that can align inputs with the outputs during training to improve the model performance; and (4) adopts the multi-task fusion loss to learn the different abilities for the models. It achieves a good performance on the Visual Question Answering (VQA) task, which shows the feasibility and effectiveness of the model. This solution can help blind or visually impaired individuals obtain visual information.
| 原文 | English |
|---|---|
| 文章編號 | 286 |
| 期刊 | Mathematics |
| 卷 | 12 |
| 發行號 | 2 |
| DOIs | |
| 出版狀態 | Published - 1月 2024 |
指紋
深入研究「VL-Meta: Vision-Language Models for Multimodal Meta-Learning」主題。共同形成了獨特的指紋。新聞/媒體
-
Faculty of Applied Sciences Researcher Adds New Findings in the Area of Mathematics (VL-Meta: Vision-Language Models for Multimodal Meta-Learning)
LAM, C. T., MA, H. & NG, K. K. B.
2/02/24
1 的項目 媒體報導
新聞/媒體: Press/Media
引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver