Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

TMM Volume 25 | 2023

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

TMM Articles

By:

Kun Zhang; Zhendong Mao; An-An Liu; Yongdong Zhang

Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g. , via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%−4% (16.9%−35.3% for baseline SCAN), and reducing the retrieval time by 50%−73%.

Associating the two most prevalent modalities of vision and language is crucial for artificial intelligence to understand our real world. Image-text matching, which refers to bridge the semantic gap between these two heterogeneous modalities, is a fundamental research in the cross-modal field [1]–[8]. It benefits many multimodal applications, such as visual question answering [3], [8] and image captioning [6]. This matching task aims to search images for a given textual description or find texts w.r.t. an image query. Despite the exciting progress, image-text matching retains challenges in how to accurately learn semantic alignment to find relevant shared semantics between modalities for measuring similarity.

Read on IEEE Xplore

Tags:

IEEE TMM Article

SPS Social Media

IEEE SPS Facebook Page https://www.facebook.com/ieeeSPS
IEEE SPS X Page https://x.com/IEEEsps
IEEE SPS Instagram Page https://www.instagram.com/ieeesps/?hl=en
IEEE SPS LinkedIn Page https://www.linkedin.com/company/ieeesps/
IEEE SPS YouTube Channel https://www.youtube.com/ieeeSPS

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2025 IEEE - All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

signal_general.jpg

IEEE JSTSP Special Issue on Advanced AI and Signal Processing for Low-Altitude Wireless Networks

TMM.png

New Editor-in-Chief (EIC) of the IEEE Transactions on Multimedia (T-MM)

ICASSP 2026 Blog Header.png

(ICASSP 2026) 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Transactions on Multimedia

Publications & Resources

For Authors

TMM.png

mentor_help_general_3.jpg

general_get_involved_tc_article_full.jpg

Top Reasons to Join SPS Today!

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Search form

You are here

Transactions on Multimedia

Publications & Resources

For Authors

Top Reasons to Join SPS Today!

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

SPS Social Media

IEEE SPS Educational Resources