An Investigation of CNN-CARU for Image Captioning

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

1 Citation (Scopus)


The goal of an image description is to extract essential information and a description of the content of a media feature from an image. This description can be obtained directly from a human-understandable description of an interesting image (retrieval-based image with object(s) and their action description) or encoded by an encoder–decoder neural network. The challenge of the learning model is that it tries to project the media feature into a neutral language, which also produces the description in another feature domain. It may suffer from misidentification of scene or semantic elements. In this chapter, we attempt to address these challenges by introducing a novel image captioning framework that combines generation and retrieval. A CNN-CARU model is introduced, where the image is first encoded by a CNN-based network, and multiple captions are generated/created for a target image by an RNN network of CARU.

Original languageEnglish
Title of host publicationSignals and Communication Technology
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages9
Publication statusPublished - 2024

Publication series

NameSignals and Communication Technology
VolumePart F1810
ISSN (Print)1860-4862
ISSN (Electronic)1860-4870


  • CARU
  • CNN
  • Encoder–decoder network
  • Image captioning
  • NLP


Dive into the research topics of 'An Investigation of CNN-CARU for Image Captioning'. Together they form a unique fingerprint.

Cite this