Pay attention to what you read: Non-recurrent handwritten text-Line recognition
Introduction
Handwritten Text Recognition (HTR) frameworks aim to provide machines with the ability to read and understand human calligraphy. From the applications perspective, HTR is relevant both to digitize the textual contents from ancient document images in historic archives as well as contemporary administrative documentation such as cheques, forms, etc. Even though research in HTR began in the early sixties [1], it is still considered as an unsolved problem. The main challenge is the huge variability and ambiguity of the strokes composing words encountered across different writers. Fortunately, in most cases, the words to decipher do follow a well defined set of language rules that should be also modelled and taken into account in order to discard gibberish hypotheses and yield higher recognition accuracies. As a result, HTR is often approached by combining technologies from both computer vision and natural language processing communities.
Handwritten text is a sequential signal in nature, which is usually a sequence of characters from left to right in Latin languages. Thus, HTR approaches usually adopted temporal pattern recognition techniques to address it. The early approaches based on Hidden Markov Models (HMM) [2] evolved towards the use of Deep Learning techniques, in which Bidirectional Long Short-Term Memory (BLSTM) networks [3] became the standard solution. Recently, inspired by their success in the applications such as automatic translation or speech-to-text, Sequence-to-Sequence (Seq2Seq) approaches, conformed by encoder-decoder networks led by attention mechanisms have started to be applied for HTR [4]. All the above methods are not only a good fit to process images sequentially, but also have, in principle, the inherent power of language modelling, i.e. to learn which character is more probable to be found after another in their respective decoding steps. Nonetheless, this ability of language modelling has proven to be limited, since recognition performances are in most cases still enhanced when using a separate statistical language model as a post-processing step [5].
Despite the fact that attention-based encoder-decoder architectures have started to be used for HTR with impressive results, one major drawback still remains. In all of those cases, such attention mechanisms are still used in conjunction with a recurrent network, either BLSTMs or Gated Recurrent Unit (GRU) networks. The use of such sequential processing deters parallelization at training stage, and severely affects the effectiveness when processing longer sequence lengths by imposing substantial memory limitations.
Motivated by the above observations, Vaswani et al. proposed in [6] the seminal work on the Transformer architecture. Transformers rely entirely on attention mechanisms, relinquishing any recurrent designs. Stimulated by such advantage, we propose to address the HTR problem by an architecture inspired on transformers, which dispenses of any recurrent network. By using multi-head self-attention layers both at the visual and textual stages, we aim to tackle both the proper step of character recognition from images, as well as to learn language-related dependencies of the character sequences to be decoded.
The use of transformers in different language and vision applications have shown higher performances than recurrent networks while having the edge over BLSTMs or GRUs by being more parallelizable and thus involving reduced training times. Our method is, to the best of our knowledge, the first non-recurrent approach for HTR. Moreover, the proposed transformer approach is designed to work at character level, instead at the commonly used wordpiece level [7] in translation or speech recognition applications. By using such design we are not restricted to any predefined fixed vocabulary, so we are able to recognize out-of-vocabulary (OOV) words, i.e. never seen during training. Competitive state-of-the-art results on the public IAM dataset are reached even when using a small portion of training data.
The main contributions of our work are summarized as follows. i) For the first time, we explore the use of transformers for the HTR task, bypassing any recurrent architecture. We attempt to learn, with a single unified architecture, to recognize character sequences from images as well as to model language, providing context to distinguish between characters or words that might look similar. The proposed architecture works at character level, waiving the use of predefined lexicons. ii) By using a pre-training step using synthetic data, the proposed approach is able to yield competitive results with a limited amount of real annotated training data. iii) Extensive ablation and comparative experiments are conducted in order to validate the effectiveness of our approach. Our proposed HTR system achieves new state-of-the-art performance on the public IAM dataset.
Section snippets
Related work
The recognition of handwritten text has been commonly approached by the use of sequential pattern recognition techniques. Text lines are processed along a temporal sequence by learning models that leverage their sequence of internal states as memory cells, in order to be able to tackle variable length input signals. Whether we analyze the former approaches based on HMMs [2], [8], [9] or the architectures based on deep neural networks such as BLSTMs [3], Multidimensional LSTMs [10], [11]
Problem formulation
Let be a handwritten text dataset, containing images of handwritten text lines, and their corresponding transcription strings . The alphabet defining all the possible characters of (letters, digits, punctuation signs, white spaces, etc.), is denoted as . Given pairs of images and their corresponding strings , the proposed recognizer has the ability to combine both sources of information, learning both to interpret visual information and to model language-specific rules.
The
Dataset and performance measures
We conduct our experiments on the popular IAM handwritten dataset [31], composed of modern handwritten English texts. We use the RWTH partition, which consists of 6482, 976 and 2914 lines for training, validation and test, respectively. The size of alphabet is 83, including special symbols, and the maximum length of the output character sequence is set to 89. All the handwritten text images are resized to the same height of 64 pixels while keeping the aspect ratio, which means that the text
Conclusion and future work
In this paper, we have proposed a novel non-recurrent and open-vocabulary method for handwritten text-line recognition. As far as we know, it is the first approach that adopts the transformer networks for the HTR task. We have performed a detailed analysis and evaluation on each module, demonstrating the suitability of the proposed approach. Indeed, the presented results prove that our method not only achieves the state-of-the-art performance, but also has the capability to deal with few-shot
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work has been partially supported by the grant 140/09421059 from Shantou University, the Spanish project RTI2018-095645-B-C21, the grant 2016-DI-087 from the Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya, the grant FPU15/06264 from the Spanish Ministerio de Educación, Cultura y Deporte, the Ramon y Cajal Fellowship RYC-2014-16831 and the CERCA Program/ Generalitat de Catalunya. We gratefully acknowledge the support of NVIDIA
Dr. Lei Kang received the BSc degree from Jilin University, Changchun, China in 2012, MSc degree from University of Science and Technology of China, Hefei, China in 2015, and PhD degree from Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain and omni:us, Berlin, Germany in 2020. He is currently a lecturer of Computer Science Dept. at Shantou University, Shantou, China. His main research interests include Transfer Learning, Domain Adaptation, Attention Mechanisms of
References (57)
- et al.
Handwriting word recognition using windowed Bernoulli HMMs
Pattern Recognit. Lett.
(2014) - et al.
Offline continuous handwriting recognition using sequence to sequence neural networks
Neurocomputing
(2018) - et al.
MASTER: Multi-aspect non-local network for scene text recognition
Pattern Recognit.
(2021) - et al.
Accurate, data-efficient, unconstrained text recognition with convolutional neural networks
Pattern Recognit.
(2020) - et al.
Candidate fusion: integrating language modelling into a sequence-to-sequence handwritten word recognition architecture
Pattern Recognit.
(2021) - et al.
Hidden Markov model-based ensemble methods for offline handwritten text line recognition
Pattern Recognit.
(2008) - et al.
Neural network language models for off-line handwriting recognition
Pattern Recognit.
(2014) - et al.
A system for automatic recognition of handwritten words
Proceedings of the Fall Joint Computer Conference
(1964) - et al.
Dynamic and contextual information in HMM modeling for handwritten word recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2011) - et al.
A novel connectionist system for unconstrained handwriting recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2008)
Evaluating sequence-to-sequence models for handwritten text recognition
Proceedings of the International Conference on Document Analysis and Recognition
Language model supervision for handwriting recognition model adaptation
Proceedings of the International Conference on Frontiers in Handwriting Recognition
Attention is all you need
Proceedings of the Neural Information Processing Systems Conference
Improving offline handwritten text recognition with hybrid HMM/ANN models
IEEE Trans. Pattern Anal. Mach. Intell.
Offline handwriting recognition with multidimensional recurrent neural networks
Proceedings of the Neural Information Processing Systems Conference
Are multidimensional recurrent layers really necessary for handwritten text recognition?
Proceedings of the International Conference on Document Analysis and Recognition
Joint line segmentation and transcription for end-to-end handwritten paragraph recognition
Proceedings of the Neural Information Processing Systems Conference
Convolve, attend and spell: an attention-based sequence-to-sequence model for handwritten word recognition
Proceedings of the German Conference on Pattern Recognition
An efficient end-to-end neural model for handwritten text recognition
Proceedings of the British Machine Vision Conference
Handwriting recognition in low-resource scripts using adversarial learning
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Learning deep representations for word spotting under weak supervision
Proceedings of the IAPR International Workshop on Document Analysis Systems
HWNet V2: an efficient word image representation for handwritten documents
Int. J. Doc. Anal. Recogn.
Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
Context-and sentiment-aware networks for emotion recognition in conversation
IEEE Trans. Artif. Intell.
NRTR: a no-recurrence sequence-to-sequence model for scene text recognition
Proceedings of the International Conference on Document Analysis and Recognition
On recognizing texts of arbitrary shapes with 2D self-attention
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (0)
Dr. Lei Kang received the BSc degree from Jilin University, Changchun, China in 2012, MSc degree from University of Science and Technology of China, Hefei, China in 2015, and PhD degree from Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain and omni:us, Berlin, Germany in 2020. He is currently a lecturer of Computer Science Dept. at Shantou University, Shantou, China. His main research interests include Transfer Learning, Domain Adaptation, Attention Mechanisms of Seq2Seq Model and GANs applied to the problem of Handwritten Text Recognition and Synthesis.
Dr. Pau Riba received the BSc degrees in Mathematics and Computer Science, the MSc and PhD degrees in Computer Vision from the Universitat Autònoma de Barcelona, in 2015, 2016 and 2020, respectively. Currently, he works as an AI research engineer at Helsing AI. His main research interests revolve around Self-supervised Learning, Graph-based Representations and Machine Learning. P. Riba has actively participated in the organization of the GMPRDIA tutorial within the ICDAR 2019 and the GREC workshop within the ICDAR 2021. In addition, he has been awarded the “Best paper award” in ICFHR 2020 and ICPR 2018.
Dr. Marçal Rusiñol received his BSc, MSc and PhD degrees in Computer Sciences from the Universitat Autònoma de Barcelona, in 2004, 2006, and 2009 respectively. In 2012 and 2014 he worked as a Marie Curie research fellow at Itesoft and Université de La Rochelle, France, in 2012 and 2014 respectively. In 2019 he co-founded the spinoff company AllRead MLT where he currently works.
Dr. Alicia Fornés is a senior research fellow at the Universitat Autònoma de Barcelona (UAB) and the Computer Vision Center. She obtained the PhD degree in Computer Science from the UAB in 2009. She was the recipient of the AERFAI (Spanish brand of the IAPR, International Association for Pattern Recognition) best thesis award 2009–2010, and the IAPR/ICDAR Young Investigator Award in 2017. She has more than 100 publications related to document analysis and recognition. Her research interests include document image analysis, handwriting recognition, optical music recognition, writer identification and digital humanities.
Dr. Mauricio Villegas received MSc degree on Pattern Recognition and PhD degree on Computer Science from the Universitat Politècnica de València, in 2008 and 2011, respectively. He is currently a Senior Data Scientist at omni:us, Berlin, Germany. He participated in two EU funded projects FP7 and H2020, both related to handwritten text recognition, and organized competitions related to automatic image annotation (ImageCLEF 2013–2016), handwritten text recognition (ICDAR 2017) and handwritten document retrieval (ImageCLEF 2016).