Elsevier

Pattern Recognition

Volume 129, September 2022, 108766
Pattern Recognition

Pay attention to what you read: Non-recurrent handwritten text-Line recognition

https://doi.org/10.1016/j.patcog.2022.108766Get rights and content

Highlights

  • Novel adaptation of transformers for handwriting recognition tasks, bypassing recurrent neural nets.

  • Competitive results achieved in low resource scenario with synthetically pretrained model.

  • Extensive ablation and comparative studies conducted to understand and modify transformer properly for HTR.

  • Implicit language modelling ability proved.

  • The state-of-the-art performance achieved on public IAM dataset.

Abstract

The advent of recurrent neural networks for handwriting recognition marked an important milestone reaching impressive recognition accuracies despite the great variability that we observe across different writing styles. Sequential architectures are a perfect fit to model text lines, not only because of the inherent temporal aspect of text, but also to learn probability distributions over sequences of characters and words. However, using such recurrent paradigms comes at a cost at training stage, since their sequential pipelines prevent parallelization. In this work, we introduce a novel method that bypasses any recurrence during the training process with the use of transformer models. By using multi-head self-attention layers both at the visual and textual stages, we are able to tackle character recognition as well as to learn language-related dependencies of the character sequences to be decoded. Our model is unconstrained to any predefined vocabulary, being able to recognize out-of-vocabulary words, i.e. words that do not appear in the training vocabulary. We significantly advance over prior art and demonstrate that satisfactory recognition accuracies are yielded even in few-shot learning scenarios.

Introduction

Handwritten Text Recognition (HTR) frameworks aim to provide machines with the ability to read and understand human calligraphy. From the applications perspective, HTR is relevant both to digitize the textual contents from ancient document images in historic archives as well as contemporary administrative documentation such as cheques, forms, etc. Even though research in HTR began in the early sixties [1], it is still considered as an unsolved problem. The main challenge is the huge variability and ambiguity of the strokes composing words encountered across different writers. Fortunately, in most cases, the words to decipher do follow a well defined set of language rules that should be also modelled and taken into account in order to discard gibberish hypotheses and yield higher recognition accuracies. As a result, HTR is often approached by combining technologies from both computer vision and natural language processing communities.

Handwritten text is a sequential signal in nature, which is usually a sequence of characters from left to right in Latin languages. Thus, HTR approaches usually adopted temporal pattern recognition techniques to address it. The early approaches based on Hidden Markov Models (HMM) [2] evolved towards the use of Deep Learning techniques, in which Bidirectional Long Short-Term Memory (BLSTM) networks [3] became the standard solution. Recently, inspired by their success in the applications such as automatic translation or speech-to-text, Sequence-to-Sequence (Seq2Seq) approaches, conformed by encoder-decoder networks led by attention mechanisms have started to be applied for HTR [4]. All the above methods are not only a good fit to process images sequentially, but also have, in principle, the inherent power of language modelling, i.e. to learn which character is more probable to be found after another in their respective decoding steps. Nonetheless, this ability of language modelling has proven to be limited, since recognition performances are in most cases still enhanced when using a separate statistical language model as a post-processing step [5].

Despite the fact that attention-based encoder-decoder architectures have started to be used for HTR with impressive results, one major drawback still remains. In all of those cases, such attention mechanisms are still used in conjunction with a recurrent network, either BLSTMs or Gated Recurrent Unit (GRU) networks. The use of such sequential processing deters parallelization at training stage, and severely affects the effectiveness when processing longer sequence lengths by imposing substantial memory limitations.

Motivated by the above observations, Vaswani et al. proposed in [6] the seminal work on the Transformer architecture. Transformers rely entirely on attention mechanisms, relinquishing any recurrent designs. Stimulated by such advantage, we propose to address the HTR problem by an architecture inspired on transformers, which dispenses of any recurrent network. By using multi-head self-attention layers both at the visual and textual stages, we aim to tackle both the proper step of character recognition from images, as well as to learn language-related dependencies of the character sequences to be decoded.

The use of transformers in different language and vision applications have shown higher performances than recurrent networks while having the edge over BLSTMs or GRUs by being more parallelizable and thus involving reduced training times. Our method is, to the best of our knowledge, the first non-recurrent approach for HTR. Moreover, the proposed transformer approach is designed to work at character level, instead at the commonly used wordpiece level [7] in translation or speech recognition applications. By using such design we are not restricted to any predefined fixed vocabulary, so we are able to recognize out-of-vocabulary (OOV) words, i.e. never seen during training. Competitive state-of-the-art results on the public IAM dataset are reached even when using a small portion of training data.

The main contributions of our work are summarized as follows. i) For the first time, we explore the use of transformers for the HTR task, bypassing any recurrent architecture. We attempt to learn, with a single unified architecture, to recognize character sequences from images as well as to model language, providing context to distinguish between characters or words that might look similar. The proposed architecture works at character level, waiving the use of predefined lexicons. ii) By using a pre-training step using synthetic data, the proposed approach is able to yield competitive results with a limited amount of real annotated training data. iii) Extensive ablation and comparative experiments are conducted in order to validate the effectiveness of our approach. Our proposed HTR system achieves new state-of-the-art performance on the public IAM dataset.

Section snippets

Related work

The recognition of handwritten text has been commonly approached by the use of sequential pattern recognition techniques. Text lines are processed along a temporal sequence by learning models that leverage their sequence of internal states as memory cells, in order to be able to tackle variable length input signals. Whether we analyze the former approaches based on HMMs [2], [8], [9] or the architectures based on deep neural networks such as BLSTMs [3], Multidimensional LSTMs [10], [11]

Problem formulation

Let {X,Y} be a handwritten text dataset, containing images X of handwritten text lines, and their corresponding transcription strings Y. The alphabet defining all the possible characters of Y (letters, digits, punctuation signs, white spaces, etc.), is denoted as A. Given pairs of images xiX and their corresponding strings yiY, the proposed recognizer has the ability to combine both sources of information, learning both to interpret visual information and to model language-specific rules.

The

Dataset and performance measures

We conduct our experiments on the popular IAM handwritten dataset [31], composed of modern handwritten English texts. We use the RWTH partition, which consists of 6482, 976 and 2914 lines for training, validation and test, respectively. The size of alphabet |A| is 83, including special symbols, and the maximum length of the output character sequence is set to 89. All the handwritten text images are resized to the same height of 64 pixels while keeping the aspect ratio, which means that the text

Conclusion and future work

In this paper, we have proposed a novel non-recurrent and open-vocabulary method for handwritten text-line recognition. As far as we know, it is the first approach that adopts the transformer networks for the HTR task. We have performed a detailed analysis and evaluation on each module, demonstrating the suitability of the proposed approach. Indeed, the presented results prove that our method not only achieves the state-of-the-art performance, but also has the capability to deal with few-shot

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work has been partially supported by the grant 140/09421059 from Shantou University, the Spanish project RTI2018-095645-B-C21, the grant 2016-DI-087 from the Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya, the grant FPU15/06264 from the Spanish Ministerio de Educación, Cultura y Deporte, the Ramon y Cajal Fellowship RYC-2014-16831 and the CERCA Program/ Generalitat de Catalunya. We gratefully acknowledge the support of NVIDIA

Dr. Lei Kang received the BSc degree from Jilin University, Changchun, China in 2012, MSc degree from University of Science and Technology of China, Hefei, China in 2015, and PhD degree from Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain and omni:us, Berlin, Germany in 2020. He is currently a lecturer of Computer Science Dept. at Shantou University, Shantou, China. His main research interests include Transfer Learning, Domain Adaptation, Attention Mechanisms of

References (57)

  • J. Michael et al.

    Evaluating sequence-to-sequence models for handwritten text recognition

    Proceedings of the International Conference on Document Analysis and Recognition

    (2019)
  • C. Tensmeyer et al.

    Language model supervision for handwriting recognition model adaptation

    Proceedings of the International Conference on Frontiers in Handwriting Recognition

    (2018)
  • A. Vaswani et al.

    Attention is all you need

    Proceedings of the Neural Information Processing Systems Conference

    (2017)
  • Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s...
  • S. España-Boquera et al.

    Improving offline handwritten text recognition with hybrid HMM/ANN models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • A. Graves et al.

    Offline handwriting recognition with multidimensional recurrent neural networks

    Proceedings of the Neural Information Processing Systems Conference

    (2009)
  • J. Puigcerver

    Are multidimensional recurrent layers really necessary for handwritten text recognition?

    Proceedings of the International Conference on Document Analysis and Recognition

    (2017)
  • T. Bluche

    Joint line segmentation and transcription for end-to-end handwritten paragraph recognition

    Proceedings of the Neural Information Processing Systems Conference

    (2016)
  • L. Kang et al.

    Convolve, attend and spell: an attention-based sequence-to-sequence model for handwritten word recognition

    Proceedings of the German Conference on Pattern Recognition

    (2018)
  • A. Chowdhury et al.

    An efficient end-to-end neural model for handwritten text recognition

    Proceedings of the British Machine Vision Conference

    (2018)
  • A.K. Bhunia et al.

    Handwriting recognition in low-resource scripts using adversarial learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • N. Gurjar et al.

    Learning deep representations for word spotting under weak supervision

    Proceedings of the IAPR International Workshop on Document Analysis Systems

    (2018)
  • P. Krishnan et al.

    HWNet V2: an efficient word image representation for handwritten documents

    Int. J. Doc. Anal. Recogn.

    (2019)
  • J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language...
  • L. Dong et al.

    Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing

    (2018)
  • G. Tu et al.

    Context-and sentiment-aware networks for emotion recognition in conversation

    IEEE Trans. Artif. Intell.

    (2022)
  • F. Sheng et al.

    NRTR: a no-recurrence sequence-to-sequence model for scene text recognition

    Proceedings of the International Conference on Document Analysis and Recognition

    (2019)
  • J. Lee et al.

    On recognizing texts of arbitrary shapes with 2D self-attention

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2020)
  • Cited by (0)

    Dr. Lei Kang received the BSc degree from Jilin University, Changchun, China in 2012, MSc degree from University of Science and Technology of China, Hefei, China in 2015, and PhD degree from Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain and omni:us, Berlin, Germany in 2020. He is currently a lecturer of Computer Science Dept. at Shantou University, Shantou, China. His main research interests include Transfer Learning, Domain Adaptation, Attention Mechanisms of Seq2Seq Model and GANs applied to the problem of Handwritten Text Recognition and Synthesis.

    Dr. Pau Riba received the BSc degrees in Mathematics and Computer Science, the MSc and PhD degrees in Computer Vision from the Universitat Autònoma de Barcelona, in 2015, 2016 and 2020, respectively. Currently, he works as an AI research engineer at Helsing AI. His main research interests revolve around Self-supervised Learning, Graph-based Representations and Machine Learning. P. Riba has actively participated in the organization of the GMPRDIA tutorial within the ICDAR 2019 and the GREC workshop within the ICDAR 2021. In addition, he has been awarded the “Best paper award” in ICFHR 2020 and ICPR 2018.

    Dr. Marçal Rusiñol received his BSc, MSc and PhD degrees in Computer Sciences from the Universitat Autònoma de Barcelona, in 2004, 2006, and 2009 respectively. In 2012 and 2014 he worked as a Marie Curie research fellow at Itesoft and Université de La Rochelle, France, in 2012 and 2014 respectively. In 2019 he co-founded the spinoff company AllRead MLT where he currently works.

    Dr. Alicia Fornés is a senior research fellow at the Universitat Autònoma de Barcelona (UAB) and the Computer Vision Center. She obtained the PhD degree in Computer Science from the UAB in 2009. She was the recipient of the AERFAI (Spanish brand of the IAPR, International Association for Pattern Recognition) best thesis award 2009–2010, and the IAPR/ICDAR Young Investigator Award in 2017. She has more than 100 publications related to document analysis and recognition. Her research interests include document image analysis, handwriting recognition, optical music recognition, writer identification and digital humanities.

    Dr. Mauricio Villegas received MSc degree on Pattern Recognition and PhD degree on Computer Science from the Universitat Politècnica de València, in 2008 and 2011, respectively. He is currently a Senior Data Scientist at omni:us, Berlin, Germany. He participated in two EU funded projects FP7 and H2020, both related to handwritten text recognition, and organized competitions related to automatic image annotation (ImageCLEF 2013–2016), handwritten text recognition (ICDAR 2017) and handwritten document retrieval (ImageCLEF 2016).

    View full text