Scene-Text Visual Question Answering (STVQA) is a comprehensive task that requires reading and understanding the text in images to answer the question. Existing methods of exploring the vision-language relationships between questions, images, and scene text have achieved impressive results. However, these studies heavily rely on auxiliary modules, such as external OCR systems and object detection networks, making the question-answering process cumbersome and highly dependent.