Enhancing neural mean teacher learning-based emotion-centric model for image captioning
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Image captioning is a task in computer vision and natural language processing that involves generating a textual description of the content of an image. The goal of image captioning is to create a system that can accurately recognize the objects, attributes, and relationships depicted in an image, and generate a meaningful description of it in natural language, typically in the form of a sentence or short paragraph. One of the state-of-the-art methods that we can use for image captioning is Nemesis: Neural Mean Teacher Learning-based Emotion-centric Speaker. Nemesis is a neural mean teacher learning-based emotion-centric speaker. It is a proposed neural speaker capable of leveraging emotional supervision signals in the caption generation process. Nemesis has been applied to the recently introduced ArtEmis dataset, which is the first large-scale dataset for emotion-centric image captioning, containing 455K emotional descriptions of 80K artworks from WikiArt. In this study, I employed a straightforward but improved version of Self-Critical Sequence Training. By modifying the baseline function choice in the REINFORCE algorithm, I introduced a simple alteration. The updated baseline offers enhanced performance without any additional expenses, when compared to the baseline that utilizes greedy decoding.