A Multilingual Approach to Scene Text Visual Question Answering

Brugués i Pujolràs, Josep; Gómez i Bigordà, Lluís; Karatzas, Dimosthenis

doi:10.1007/978-3-031-06555-2_5

Josep Brugués i Pujolràs¹⁰,
Lluís Gómez i Bigordà¹⁰ &
Dimosthenis Karatzas¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13237))

Included in the following conference series:

International Workshop on Document Analysis Systems

1763 Accesses
4 Citations

Abstract

Scene Text Visual Question Answering (ST-VQA) has recently emerged as a hot research topic in Computer Vision. Current ST-VQA models have a big potential for many types of applications but lack the ability to perform well on more than one language at a time due to the lack of multilingual data, as well as the use of monolingual word embeddings for training. In this work, we explore the possibility to obtain bilingual and multilingual VQA models. In that regard, we use an already established VQA model that uses monolingual word embeddings as part of its pipeline and substitute them by FastText and BPEmb multilingual word embeddings that have been aligned to English. Our experiments demonstrate that it is possible to obtain bilingual and multilingual VQA models with a minimal loss in performance in languages not used during training, as well as a multilingual model trained in multiple languages that match the performance of the respective monolingual baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
An n-gram is a contiguous sequence of n items from a given sample of text or speech.
2.
The hubness problem is caused by words that are the closer word of too many words.
3.
https://fasttext.cc/.
4.
https://bpemb.h-its.org/.
5.
https://github.com/babylonhealth/fastText_multilingual.

References

Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2017)
Google Scholar
Chen, X., Cardie, C.: Unsupervised multilingual word embeddings. arXiv preprint arXiv:1808.08933 (2018)
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the EC-ACL (2014)
Google Scholar
Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers (2017). 9781627052986 (zitiert auf Seite 69) (2017)
Google Scholar
Gómez, L., et al.: Multimodal grid features and cell pointers for scene text visual question answering. CoRR abs/2006.00923 (2020)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
Gouws, S., Bengio, Y., Corrado, G.: BilBOWA: fast bilingual distributed representations without word alignments. In: ICML (2015)
Google Scholar
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: CVPR (2018)
Google Scholar
Heinzerling, B., Strube, M.: BPEmb: tokenization-free pre-trained subword embeddings in 275 languages. In: LREC (2018)
Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textVQA. In: CVPR (2020)
Google Scholar
Jawanpuria, P., Balgovind, A., Kunchukuttan, A., Mishra, B.: Learning multilingual word embeddings in latent metric space: a geometric approach. Trans. ACL (2019)
Google Scholar
Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., Grave, E.: Loss in translation: learning bilingual word mapping with a retrieval criterion. In: EMNLP (2018)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Google Scholar
Smith, S.L., Turban, D.H.P., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. CoRR abs/1702.03859 (2017)
Google Scholar
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks (2017)
Google Scholar
Wang, X., et al.: On the general value of evidence, and bilingual scene-text visual question answering. In: CVPR (2020)
Google Scholar
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: ICML (2009)
Google Scholar
Yang, Z., et al.: Tap: text-aware pre-training for text-VQA and text-caption. In: CVPR (2021)
Google Scholar

Download references

Acknowledgment

This work has been supported by: Grant PDC2021-121512-I00 funded by MCIN /AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR;

Project PID2020-116298GB-I00 funded by MCIN/ AEI /10.13039/501100011033; Grant PLEC2021-007850 funded by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR.

Author information

Authors and Affiliations

Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Brugués i Pujolràs, Lluís Gómez i Bigordà & Dimosthenis Karatzas

Authors

Josep Brugués i Pujolràs
View author publications
You can also search for this author in PubMed Google Scholar
Lluís Gómez i Bigordà
View author publications
You can also search for this author in PubMed Google Scholar
Dimosthenis Karatzas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lluís Gómez i Bigordà .

Editor information

Editors and Affiliations

Kyushu University, Fukuoka, Japan
Seiichi Uchida
Boise State University, BOISE, ID, USA
Elisa Barney
LIRIS UMR CNRS, Villeurbanne, France
Véronique Eglin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brugués i Pujolràs, J., Gómez i Bigordà, L., Karatzas, D. (2022). A Multilingual Approach to Scene Text Visual Question Answering. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-06555-2_5
Published: 18 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Multilingual Approach to Scene Text Visual Question Answering