Read While You Drive - Multilingual Text Tracking on the Road

Garcia-Bordils, Sergi; Tom, George; Reddy, Sangeeth; Mathew, Minesh; Rusiñol, Marçal; Jawahar, C. V.; Karatzas, Dimosthenis

doi:10.1007/978-3-031-06555-2_51

Sergi Garcia-Bordils^10,12,
George Tom¹¹,
Sangeeth Reddy¹¹,
Minesh Mathew¹¹,
Marçal Rusiñol¹²,
C. V. Jawahar¹¹ &
…
Dimosthenis Karatzas¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13237))

Included in the following conference series:

International Workshop on Document Analysis Systems

1705 Accesses
3 Citations
3 Altmetric

Abstract

Visual data obtained during driving scenarios usually contain large amounts of text that conveys semantic information necessary to analyse the urban environment and is integral to the traffic control plan. Yet, research on autonomous driving or driver assistance systems typically ignores this information. To advance research in this direction, we present RoadText-3K, a large driving video dataset with fully annotated text. RoadText-3K is three times bigger than its predecessor and contains data from varied geographical locations, unconstrained driving conditions and multiple languages and scripts. We offer a comprehensive analysis of tracking by detection and detection by tracking methods exploring the limits of state-of-the-art text detection. Finally, we propose a new end-to-end trainable tracking model that yields state-of-the-art results on this challenging dataset. Our experiments demonstrate the complexity and variability of RoadText-3K and establish a new, realistic benchmark for scene text tracking in the wild.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For CTPN, EAST and FOTS we have used unofficial implementations of the original methods, for CRAFT we have used the author’s released implementation:
- CTPN: https://github.com/eragonruan/text-detection-ctpn
- EAST: https://github.com/argman/EAST
- FOTS: https://github.com/jiangxiluning/FOTS.PyTorch
- CRAFT: https://github.com/clovaai/CRAFT-pytorch.
2.
We used the implementation given in https://github.com/cheind/py-motmetrics.

References

Lukežič, A., Vojíř, T., Čehovin, L., Matas, J., Kristan, M.: Discriminative correlation filter tracker with channel and spatial reliability. IJCV 126, 671–688 (2018)
Article MathSciNet Google Scholar
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR (2019)
Google Scholar
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 1–10 (2008)
Article Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Google Scholar
Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Cerf, M., Frady, E.P., Koch, C.: Faces and text attract gaze independent of the task: experimental data and computer model. J. Vision 9, 10 (2009)
Article Google Scholar
Cheng, Z., et al.: FREE: a fast and robust end-to-end video text spotter. IEEE Trans. Image Process. 30, 822–837 (2020)
Article Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Donoser, M., Bischof, H.: Efficient maximally stable extremal region (MSER) tracking. In: CVPR (2006)
Google Scholar
Gomez, L., Karatzas, D.: MSER-based real-time text detection and tracking. In: ICPR (2014)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50
Chapter Google Scholar
Kalal, Z., Mikolajczyk, K., Matas, J.,: Forward-backward error: automatic detection of tracking failures. In: ICPR (2010)
Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR (2015)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR (2013)
Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2, 83–97 (1955)
Article MathSciNet Google Scholar
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. TIP 27, 3676–3690 (2018)
MathSciNet MATH Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: CVPR (2018)
Google Scholar
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Minetto, R., Thome, N., Cord, M., Leite, N.J., Stolfi, J.: SnooperTrack: text detection and tracking for outdoor videos. In: ICIP (2011)
Google Scholar
Misra, D.: Mish: a self regularized non-monotonic neural activation function. arXiv preprint arXiv:1908.08681 (2019)
Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: dataset and benchmark. In: WACV (2014)
Google Scholar
Petter, M., Fragoso, V., Turk, M., Baur, C.: Automatic text detection for mobile augmented reality translation. In: ICCV Workshops (2011)
Google Scholar
Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Karatzas, D., Jawahar, C.V.: RoadText-1K: text detection & recognition dataset for driving videos. In: ICRA (2020)
Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.K., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS (2015)
Google Scholar
Tian, S., Pei, W.-Y., Zuo, Z.-Y., Yin, X.-C.: Scene text detection in video by learning locally and globally. In: IJCAI (2016)
Google Scholar
Tian, S., Yin, X.-C., Ya, S., Hao, H.-W.: A unified framework for tracking based text detection and recognition from web videos. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 542–554 (2017)
Article Google Scholar
Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Topolšek, D., Areh, I., Cvahte, T.: Examination of driver detection of roadside traffic signs and advertisements using eye tracking. Transp. Res. Part F: Traffic Psychol. Behav. 43, 212–224 (2016)
Google Scholar
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Wang, X., et al.: End-to-end scene text recognition in videos based on multi frame tracking. In: ICDAR (2017)
Google Scholar
Williams, D.: The Arbitron National In-Car Study. Arbitron Inc., Columbia (2009)
Google Scholar
Wu, W., et al.: A bilingual, OpenWorld video text dataset and end-to-end video text spotter with transformer. In: NeurIPS 2021 Track on Datasets and Benchmarks (2021)
Google Scholar
Yu, H., Huang, Y., Pi, L., Zhang, C., Li, X., Wang, L.: End-to-end video text detection with online tracking. PR 113, 107791 (2021)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: CVPR (2017)
Google Scholar

Download references

Acknowledgements

This work has been supported by the Pla de Doctorats Industrials de la Secretaria d’Universitats i Recerca del Departament d’Empresa i Coneixement de la Generalitat de Catalunya; Grant PDC2021-121512-I00 funded by MCIN /AEI/ 10.13039/501100011033 and the European Union NextGenerationEU/PRTR; Project PID2020-116298GB-I00 funded by MCIN/ AEI /10.13039/501100011033; Grant PLEC2021-007850 funded by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR; Spanish Project NEOTEC SNEO-20211172 from CDTI and CREATEC-CV IMCBTA/2020/46 from IVACE and IHub-Data at IIIT-Hyderabad.

Author information

Authors and Affiliations

Computer Vision Center (CVC), UAB, Barcelona, Spain
Sergi Garcia-Bordils & Dimosthenis Karatzas
Center for Visual Information Technology (CVIT), IIIT Hyderabad, Hyderabad, India
George Tom, Sangeeth Reddy, Minesh Mathew & C. V. Jawahar
AllRead Machine Learning Technologies, Barcelona, Spain
Sergi Garcia-Bordils & Marçal Rusiñol

Authors

Sergi Garcia-Bordils
View author publications
You can also search for this author in PubMed Google Scholar
George Tom
View author publications
You can also search for this author in PubMed Google Scholar
Sangeeth Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Minesh Mathew
View author publications
You can also search for this author in PubMed Google Scholar
Marçal Rusiñol
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar
Dimosthenis Karatzas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimosthenis Karatzas .

Editor information

Editors and Affiliations

Kyushu University, Fukuoka, Japan
Seiichi Uchida
Boise State University, BOISE, ID, USA
Elisa Barney
LIRIS UMR CNRS, Villeurbanne, France
Véronique Eglin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garcia-Bordils, S. et al. (2022). Read While You Drive - Multilingual Text Tracking on the Road. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_51

Download citation

DOI: https://doi.org/10.1007/978-3-031-06555-2_51
Published: 18 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)