Most current models for analyzing multimodal sequences often disregard the imbalanced contributions of individual modal representations caused by varying information densities, as well as the inherent multi-relational interactions across distinct modalities. Consequently, a biased understanding of the intricate interplay among modalities may be fostered, limiting prediction accuracy and effectiveness.