Enhancing neural mean teacher learning-based emotion-centric model for image captioning

Date

2023-11-09

Journal Title

Journal ISSN

Volume Title

Publisher

Laurentian University Library & Archives

Abstract

Image captioning is a task in computer vision and natural language processing that involves generating a textual description of the content of an image. The goal of image captioning is to create a system that can accurately recognize the objects, attributes, and relationships depicted in an image, and generate a meaningful description of it in natural language, typically in the form of a sentence or short paragraph. One of the state-of-the-art methods that we can use for image captioning is Nemesis: Neural Mean Teacher Learning-based Emotion-centric Speaker. Nemesis is a neural mean teacher learning-based emotion-centric speaker. It is a proposed neural speaker capable of leveraging emotional supervision signals in the caption generation process. Nemesis has been applied to the recently introduced ArtEmis dataset, which is the first large-scale dataset for emotion-centric image captioning, containing 455K emotional descriptions of 80K artworks from WikiArt. In this study, I employed a straightforward but improved version of Self-Critical Sequence Training. By modifying the baseline function choice in the REINFORCE algorithm, I introduced a simple alteration. The updated baseline offers enhanced performance without any additional expenses, when compared to the baseline that utilizes greedy decoding.

Description

Keywords

Image captioning, Natural language processing, ArtEmis dataset, REINFORCE algorithm

Citation