Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields - computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.