Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering

Ze Fu, Changmeng Zheng, Yi Cai*, Qing Li, Tao Wang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

1 Citation (Scopus)

Abstract

Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.

Original languageEnglish
Title of host publicationWeb and Big Data - 5th International Joint Conference, APWeb-WAIM 2021, Proceedings
EditorsLeong Hou U, Marc Spaniol, Yasushi Sakurai, Junying Chen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages316-331
Number of pages16
ISBN (Print)9783030858957
DOIs
Publication statusPublished - 2021
Event5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021 - Guangzhou, China
Duration: 23 Aug 202125 Aug 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12858 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021
Country/TerritoryChina
CityGuangzhou
Period23/08/202125/08/2021

Keywords

  • Domain adaptation
  • Modality-invariant co-learning
  • Visual question answering

Fingerprint

Dive into the research topics of 'Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering'. Together they form a unique fingerprint.

Cite this