Skip to main content

Read While You Drive - Multilingual Text Tracking on the Road

  • Conference paper
  • First Online:
Document Analysis Systems (DAS 2022)

Abstract

Visual data obtained during driving scenarios usually contain large amounts of text that conveys semantic information necessary to analyse the urban environment and is integral to the traffic control plan. Yet, research on autonomous driving or driver assistance systems typically ignores this information. To advance research in this direction, we present RoadText-3K, a large driving video dataset with fully annotated text. RoadText-3K is three times bigger than its predecessor and contains data from varied geographical locations, unconstrained driving conditions and multiple languages and scripts. We offer a comprehensive analysis of tracking by detection and detection by tracking methods exploring the limits of state-of-the-art text detection. Finally, we propose a new end-to-end trainable tracking model that yields state-of-the-art results on this challenging dataset. Our experiments demonstrate the complexity and variability of RoadText-3K and establish a new, realistic benchmark for scene text tracking in the wild.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For CTPN, EAST and FOTS we have used unofficial implementations of the original methods, for CRAFT we have used the author’s released implementation:

  2. 2.

    We used the implementation given in https://github.com/cheind/py-motmetrics.

References

  1. Lukežič, A., Vojíř, T., Čehovin, L., Matas, J., Kristan, M.: Discriminative correlation filter tracker with channel and spatial reliability. IJCV 126, 671–688 (2018)

    Article  MathSciNet  Google Scholar 

  2. Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR (2019)

    Google Scholar 

  3. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 1–10 (2008)

    Article  Google Scholar 

  4. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)

    Google Scholar 

  5. Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)

  6. Cerf, M., Frady, E.P., Koch, C.: Faces and text attract gaze independent of the task: experimental data and computer model. J. Vision 9, 10 (2009)

    Article  Google Scholar 

  7. Cheng, Z., et al.: FREE: a fast and robust end-to-end video text spotter. IEEE Trans. Image Process. 30, 822–837 (2020)

    Article  Google Scholar 

  8. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  9. Donoser, M., Bischof, H.: Efficient maximally stable extremal region (MSER) tracking. In: CVPR (2006)

    Google Scholar 

  10. Gomez, L., Karatzas, D.: MSER-based real-time text detection and tracking. In: ICPR (2014)

    Google Scholar 

  11. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  13. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50

    Chapter  Google Scholar 

  14. Kalal, Z., Mikolajczyk, K., Matas, J.,: Forward-backward error: automatic detection of tracking failures. In: ICPR (2010)

    Google Scholar 

  15. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR (2015)

    Google Scholar 

  16. Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR (2013)

    Google Scholar 

  17. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2, 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  18. Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. TIP 27, 3676–3690 (2018)

    MathSciNet  MATH  Google Scholar 

  19. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)

    Google Scholar 

  20. Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: CVPR (2018)

    Google Scholar 

  21. Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)

  22. Minetto, R., Thome, N., Cord, M., Leite, N.J., Stolfi, J.: SnooperTrack: text detection and tracking for outdoor videos. In: ICIP (2011)

    Google Scholar 

  23. Misra, D.: Mish: a self regularized non-monotonic neural activation function. arXiv preprint arXiv:1908.08681 (2019)

  24. Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: dataset and benchmark. In: WACV (2014)

    Google Scholar 

  25. Petter, M., Fragoso, V., Turk, M., Baur, C.: Automatic text detection for mobile augmented reality translation. In: ICCV Workshops (2011)

    Google Scholar 

  26. Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Karatzas, D., Jawahar, C.V.: RoadText-1K: text detection & recognition dataset for driving videos. In: ICRA (2020)

    Google Scholar 

  27. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.K., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS (2015)

    Google Scholar 

  28. Tian, S., Pei, W.-Y., Zuo, Z.-Y., Yin, X.-C.: Scene text detection in video by learning locally and globally. In: IJCAI (2016)

    Google Scholar 

  29. Tian, S., Yin, X.-C., Ya, S., Hao, H.-W.: A unified framework for tracking based text detection and recognition from web videos. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 542–554 (2017)

    Article  Google Scholar 

  30. Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4

    Chapter  Google Scholar 

  31. Topolšek, D., Areh, I., Cvahte, T.: Examination of driver detection of roadside traffic signs and advertisements using eye tracking. Transp. Res. Part F: Traffic Psychol. Behav. 43, 212–224 (2016)

    Google Scholar 

  32. Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)

  33. Wang, X., et al.: End-to-end scene text recognition in videos based on multi frame tracking. In: ICDAR (2017)

    Google Scholar 

  34. Williams, D.: The Arbitron National In-Car Study. Arbitron Inc., Columbia (2009)

    Google Scholar 

  35. Wu, W., et al.: A bilingual, OpenWorld video text dataset and end-to-end video text spotter with transformer. In: NeurIPS 2021 Track on Datasets and Benchmarks (2021)

    Google Scholar 

  36. Yu, H., Huang, Y., Pi, L., Zhang, C., Li, X., Wang, L.: End-to-end video text detection with online tracking. PR 113, 107791 (2021)

    Google Scholar 

  37. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

  38. Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: CVPR (2017)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the Pla de Doctorats Industrials de la Secretaria d’Universitats i Recerca del Departament d’Empresa i Coneixement de la Generalitat de Catalunya; Grant PDC2021-121512-I00 funded by MCIN /AEI/ 10.13039/501100011033 and the European Union NextGenerationEU/PRTR; Project PID2020-116298GB-I00 funded by MCIN/ AEI /10.13039/501100011033; Grant PLEC2021-007850 funded by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR; Spanish Project NEOTEC SNEO-20211172 from CDTI and CREATEC-CV IMCBTA/2020/46 from IVACE and IHub-Data at IIIT-Hyderabad.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimosthenis Karatzas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Garcia-Bordils, S. et al. (2022). Read While You Drive - Multilingual Text Tracking on the Road. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06555-2_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06554-5

  • Online ISBN: 978-3-031-06555-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics