Enhancing neural mean teacher learning-based emotion-centric model for image captioning

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Laurentian University Library & Archives

Abstract

Image captioning is a task in computer vision and natural language processing that involves generating a textual description of the content of an image. The goal of image captioning is to create a system that can accurately recognize the objects, attributes, and relationships depicted in an image, and generate a meaningful description of it in natural language, typically in the form of a sentence or short paragraph. One of the state-of-the-art methods that we can use for image captioning is Nemesis: Neural Mean Teacher Learning-based Emotion-centric Speaker. Nemesis is a neural mean teacher learning-based emotion-centric speaker. It is a proposed neural speaker capable of leveraging emotional supervision signals in the caption generation process. Nemesis has been applied to the recently introduced ArtEmis dataset, which is the first large-scale dataset for emotion-centric image captioning, containing 455K emotional descriptions of 80K artworks from WikiArt. In this study, I employed a straightforward but improved version of Self-Critical Sequence Training. By modifying the baseline function choice in the REINFORCE algorithm, I introduced a simple alteration. The updated baseline offers enhanced performance without any additional expenses, when compared to the baseline that utilizes greedy decoding.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By