1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.
Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g. , via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%−4% (16.9%−35.3% for baseline SCAN), and reducing the retrieval time by 50%−73%.
Associating the two most prevalent modalities of vision and language is crucial for artificial intelligence to understand our real world. Image-text matching, which refers to bridge the semantic gap between these two heterogeneous modalities, is a fundamental research in the cross-modal field [1]–[8]. It benefits many multimodal applications, such as visual question answering [3], [8] and image captioning [6]. This matching task aims to search images for a given textual description or find texts w.r.t. an image query. Despite the exciting progress, image-text matching retains challenges in how to accurately learn semantic alignment to find relevant shared semantics between modalities for measuring similarity.